28 results on '"short read alignment"'
Search Results
2. Shared and genetically distinct Zea mays transcriptome responses to ongoing and past low temperature exposure
- Author
-
Luis M Avila, Wisam Obeidat, Hugh Earl, Xiaomu Niu, William Hargreaves, and Lewis Lukens
- Subjects
Maize ,Cold ,Abiotic stress ,RNA-Seq ,Short read alignment ,Genotype environment interaction ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background Cold temperatures and their alleviation affect many plant traits including the abundance of protein coding gene transcripts. Transcript level changes that occur in response to cold temperatures and their alleviation are shared or vary across genotypes. In this study we identify individual transcripts and groups of functionally related transcripts that consistently respond to cold and its alleviation. Genes that respond differently to temperature changes across genotypes may have limited functional importance. We investigate if these genes share functions, and if their genotype-specific gene expression levels change in magnitude or rank across temperatures. Results We estimate transcript abundances from over 22,000 genes in two unrelated Zea mays inbred lines during and after cold temperature exposure. Genotype and temperature contribute to many genes’ abundances. Past cold exposure affects many fewer genes. Genes up-regulated in cold encode many cytokinin glucoside biosynthesis enzymes, transcription factors, signalling molecules, and proteins involved in diverse environmental responses. After cold exposure, protease inhibitors and cuticular wax genes are newly up-regulated, and environmentally responsive genes continue to be up-regulated. Genes down-regulated in response to cold include many photosynthesis, translation, and DNA replication associated genes. After cold exposure, DNA replication and translation genes are still preferentially downregulated. Lignin and suberin biosynthesis are newly down-regulated. DNA replication, reactive oxygen species response, and anthocyanin biosynthesis genes have strong, genotype-specific temperature responses. The ranks of genotypes’ transcript abundances often change across temperatures. Conclusions We report a large, core transcriptome response to cold and the alleviation of cold. In cold, many of the core suite of genes are up or downregulated to control plant growth and photosynthesis and limit cellular damage. In recovery, core responses are in part to prepare for future stress. Functionally related genes are consistently and greatly up-regulated in a single genotype in response to cold or its alleviation, suggesting positive selection has driven genotype-specific temperature responses in maize.
- Published
- 2018
- Full Text
- View/download PDF
3. Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform
- Author
-
Belazzougui, Djamal, Cunial, Fabio, Kärkkäinen, Juha, Mäkinen, Veli, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bodlaender, Hans L., editor, and Italiano, Giuseppe F., editor
- Published
- 2013
- Full Text
- View/download PDF
4. A Multi GPU Read Alignment Algorithm with Model-Based Performance Optimization
- Author
-
Drozd, Aleksandr, Maruyama, Naoya, Matsuoka, Satoshi, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Daydé, Michel, editor, Marques, Osni, editor, and Nakajima, Kengo, editor
- Published
- 2013
- Full Text
- View/download PDF
5. Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes
- Author
-
Cox, Anthony J., Jakobi, Tobias, Rosone, Giovanna, Schulz-Trieglaff, Ole B., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Istrail, Sorin, editor, Pevzner, Pavel, editor, Waterman, Michael S., editor, Raphael, Ben, editor, and Tang, Jijun, editor
- Published
- 2012
- Full Text
- View/download PDF
6. Shared and genetically distinct Zea mays transcriptome responses to ongoing and past low temperature exposure.
- Author
-
Avila, Luis M, Obeidat, Wisam, Earl, Hugh, Niu, Xiaomu, Hargreaves, William, and Lukens, Lewis
- Subjects
- *
CORN , *PROTEINS , *LOW temperatures , *TRANSCRIPTION factors , *GENE expression in plants - Abstract
Background: Cold temperatures and their alleviation affect many plant traits including the abundance of protein coding gene transcripts. Transcript level changes that occur in response to cold temperatures and their alleviation are shared or vary across genotypes. In this study we identify individual transcripts and groups of functionally related transcripts that consistently respond to cold and its alleviation. Genes that respond differently to temperature changes across genotypes may have limited functional importance. We investigate if these genes share functions, and if their genotype-specific gene expression levels change in magnitude or rank across temperatures. Results: We estimate transcript abundances from over 22,000 genes in two unrelated Zea mays inbred lines during and after cold temperature exposure. Genotype and temperature contribute to many genes' abundances. Past cold exposure affects many fewer genes. Genes up-regulated in cold encode many cytokinin glucoside biosynthesis enzymes, transcription factors, signalling molecules, and proteins involved in diverse environmental responses. After cold exposure, protease inhibitors and cuticular wax genes are newly up-regulated, and environmentally responsive genes continue to be up-regulated. Genes down-regulated in response to cold include many photosynthesis, translation, and DNA replication associated genes. After cold exposure, DNA replication and translation genes are still preferentially downregulated. Lignin and suberin biosynthesis are newly down-regulated. DNA replication, reactive oxygen species response, and anthocyanin biosynthesis genes have strong, genotype-specific temperature responses. The ranks of genotypes' transcript abundances often change across temperatures. Conclusions: We report a large, core transcriptome response to cold and the alleviation of cold. In cold, many of the core suite of genes are up or downregulated to control plant growth and photosynthesis and limit cellular damage. In recovery, core responses are in part to prepare for future stress. Functionally related genes are consistently and greatly up-regulated in a single genotype in response to cold or its alleviation, suggesting positive selection has driven genotype-specific temperature responses in maize. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
7. parSRA: A framework for the parallel execution of short read aligners on compute clusters.
- Author
-
González-Domínguez, Jorge, Hundt, Christian, and Schmidt, Bertil
- Subjects
NUCLEOTIDE sequencing ,GENOMES ,BIOINFORMATICS ,DATA structures ,WORKFLOW management - Abstract
The growth of next generation sequencing datasets poses as a challenge to the alignment of reads to reference genomes in terms of both accuracy and speed. In this work we present parSRA , a parallel framework to accelerate the execution of existing short read aligners on distributed-memory systems. parSRA can be used to parallelize a variety of short read alignment tools installed in the system without any modification to their source code. We show that our framework provides good scalability on a compute cluster for accelerating the popular BWA-MEM and Bowtie2 aligners. On average, it is able to accelerate sequence alignments on 16 64-core nodes (in total, 1024 cores) with speedup of 10.48 compared to the original multithreaded tools running with 64 threads on one node. It is also faster and more scalable than the pMap and BigBWA frameworks. Source code of parSRA in C++ and UPC++ running on Linux systems with support for FUSE is freely available at https://sourceforge.net/projects/parsra/ . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
8. Analysis of optimal alignments unfolds aligners' bias in existing variant profiles.
- Author
-
Quang Tran, Shanshan Gao, and Vinhthuy Phan
- Subjects
- *
SINGLE nucleotide polymorphisms , *POPULATION , *KNOT insertion & deletion algorithms , *CHROMOSOME structure , *ALGORITHMS - Abstract
Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners’ called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
9. Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment
- Author
-
Altti Ilari, Maarala, Ossi, Arasalo, Daniel, Valenzuela, Veli, Mäkinen, Keijo, Heljanko, University of Helsinki, Department of Computer Science, Aalto-yliopisto, Aalto University, Helsinki Institute for Information Technology, Algorithmic Bioinformatics, Genome-scale Algorithmics research group / Veli Mäkinen, and Bioinformatics
- Subjects
Computer and Information Sciences ,Bioinformatics ,Science ,RETRIEVAL ,Microbial Genomics ,Research and Analysis Methods ,Microbiology ,Human Genomics ,Chromosomes ,Database and Informatics Methods ,SHORT READ ALIGNMENT ,Escherichia coli ,Genetics ,Humans ,Bacterial Genetics ,BLAST algorithm ,Data Management ,Base Sequence ,Bacterial Genomics ,Genome, Human ,Chromosome Biology ,Microbial Genetics ,High-Throughput Nucleotide Sequencing ,Biology and Life Sciences ,Computational Biology ,Bacteriology ,Genomics ,Cell Biology ,Data Compression ,Genome Analysis ,113 Computer and information sciences ,Medicine ,Sequence Alignment ,Sequence Analysis ,Genome, Bacterial ,Research Article ,STORAGE - Abstract
Publisher Copyright: Copyright © 2021 Maarala et al. Computational pan-genomics utilizes information from multiple individual genomes in largescale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Wholegenome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node
- Published
- 2021
10. Advanced Strategies for Alignment-based Real-time Analysis and Data Protection in Next-Generation Sequencing
- Author
-
Loka, Tobias Pascal
- Subjects
Next-Generation Sequencing ,Pathogen identification ,Genomic Privacy ,Real-time analysis ,Short read alignment ,Illumina sequencing ,500 Naturwissenschaften und Mathematik::570 Biowissenschaften ,Biologie::570 Biowissenschaften ,Biologie ,000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung ,Informatik ,FM-index ,Data protection - Abstract
Next-generation sequencing (NGS), in particular Illumina sequencing, is the current stateof- the-art DNA sequencing technology. However, when it comes to time-critical analysis, Illumina sequencing lacks sufficiently short turnaround times due to the sequential paradigm of data acquisition and analysis. For clinical application and infectious disease outbreaks, a significant reduction of time needed from sample arrival to analysis outcome is crucial to optimally treat patients and to prevent further spread of disease. At the same time, nucleotidelevel analysis is required to enable (sub-)species level classification and determination of organism-specific properties such as, for example, antimicrobial resistances. To accelerate the generation of NGS analysis results, the real-time read aligner HiLive was developed that performs read alignment while sequencing. Still, HiLive delivers results only at the end of the sequencing process and lacks sufficient resolution and scalability. In this thesis, a novel real-time alignment algorithm is introduced that was implemented in HiLive2. Unlike its predecessor, HiLive2 provides results at any desired stage of sequencing at full nucleotide-level resolution. The novel approach is based on an FM-index and is more scalable with respect to reference database size and sample size. HiLive2 enables high-quality downstream analysis as shown by performing variant calling based on realtime alignments of human sequencing data. Further, PathoLive is presented, a pipeline for real-time pathogen identification from metagenomic datasets. Based on the output of HiLive2, PathoLive performs a weighted ranking of identified species. Thereby, sequences that typically do not occur in samples from non-infected human individuals are assumed to be of high clinical significance and therefore highlighted in the results. PathoLive also provides an intuitive and interactive visualization that significantly facilitates the interpretation of results. In a case study of a real-world sample from Sudan, PathoLive enables the correct identification of Crimean–Congo hemorrhagic fever virus based on only a few dozen related reads. Besides analytical challenges, samples from human individuals are problematic with respect to data protection as reads from a human host can be used for the identification of the patient. To address this issue, PriLive was developed that enables the irrevocable removal of human sequences from Illumina sequencing data during the ongoing sequencing process. This enables a much higher level of data protection than conventional post hoc host removal approaches as the human sequences are at no time available in full length., ‘Next-Generation Sequencing’, im Speziellen die Illumina Sequenzierung, ist die derzeit meistgenutzte DNA-Sequenziertechnologie. Jedoch sind für zeitkritische Analysen aufgrund des sequentiellen Paradigmas der Datenerzeugung und -analyse die Durchlaufzeiten zu hoch. In der klinischen Anwendung und bei Ausbrüchen von Infektionskrankheiten ist es entscheidend, die Zeit vom Probeneingang zum Analyseergebnis zu verkürzen um Patienten optimal zu behandeln und einer weitere Krankheitsausbreitung zu verhindern. Gleichzeitig ist eine Analyse auf Nukleotidebene erforderlich um eine Spezies-Level-Klassifizierung und die Bestimmung spezifischer Eigenschaften, wie z.B. antimikrobiellen Resistenzen, zu ermöglichen. Um eine frühere Verfügbarkeit von Analyse-Ergebnissen zu erreichen wurde die Echtzeit-Alignierungssoftware HiLive entwickelt, welche DNA-Sequenzen während der Sequenzierung aligniert. Jedoch lieferte HiLive die Ergebnisse bislang nur am Ende eines Sequenzierlaufs und hatte keine ausreichende Auflösung und Skalierbarkeit. In dieser Arbeit präsentiere ich einen neuen Echtzeit-Alignierungsalgorithmus, der in HiLive2 implementiert wurde. HiLive2 basiert auf dem FM-index, kann zu jedem Zeitpunkt der Sequenzierung Ergebnisse liefern und erreicht eine höhere Skalierbarkeit der Größe von Referenzdatenbank und Datensatz. Durch die Detektion von Varianten basierend auf den Echtzeit-Alignierungen von humanen Sequenzierdaten zeige ich, dass HiLive2 qualitativ hochwertige Folgeanalysen ermöglicht. Außerdem stelle ich PathoLive vor, eine Pipeline zur Echtzeit-Identifizierung von Krankheitserregern aus metagenomischen Datensätzen. Basierend auf den Ergebnissen von HiLive2 führt PathoLive eine gewichtete Einstufung der identifizierten Organismen durch. Dabei werden Sequenzen, die auch in Proben von gesunden Menschen vorkommen, in den Ergebnissen weniger stark berücksichtigt. PathoLive bietet eine intuitive und interaktive Visualisierung, welche die Interpretation der Ergebnisse erleichtert. Ich zeige, dass PathoLive basierend auf nur wenigen Dutzend Sequenzen die Identifizierung des Krim-Kongo-Hämorrhagisches-Fieber-Virus in einer Probe aus dem Sudan ermöglicht. Neben den analytischen Herausforderungen sind Patientenproben im Hinblick auf den Datenschutz problematisch, da die Daten des humanen Wirts zur Identifizierung des Patienten verwendet werden könnten. Für diese Problematik präsentiere ich PriLive, welches noch während des Sequenzierlaufs das Entfernen humaner Sequenzen aus den Rohdaten ermöglicht. Hierdurch kann ein deutlich höheres Datenschutzniveau erreicht werden als mit herkömmlichen post hoc Ansätzen, da die humanen Sequenzen auch während des Sequenzierungsprozesses zu keinem Zeitpunkt in voller Länge vorliegen.
- Published
- 2020
11. Shared and genetically distinct Zea mays transcriptome responses to ongoing and past low temperature exposure
- Author
-
Hugh J. Earl, Wisam Obeidat, Luis M. Avila, Xiaomu Niu, William Hargreaves, and Lewis Lukens
- Subjects
0301 basic medicine ,Genotype environment interaction ,Genotype ,Transcription, Genetic ,lcsh:QH426-470 ,lcsh:Biotechnology ,Short read alignment ,RNA-Seq ,Environment ,Biology ,Zea mays ,Transcriptome ,03 medical and health sciences ,lcsh:TP248.13-248.65 ,Gene expression ,Genetics ,RNA, Messenger ,Photosynthesis ,Gene–environment interaction ,Gene ,Transcription factor ,2. Zero hunger ,Abiotic stress ,Gene Expression Profiling ,Crossover interactions ,Up-Regulation ,Maize ,Cold Temperature ,lcsh:Genetics ,Glucose ,030104 developmental biology ,DNA microarray ,Research Article ,Signal Transduction ,Biotechnology ,Cold - Abstract
Background Cold temperatures and their alleviation affect many plant traits including the abundance of protein coding gene transcripts. Transcript level changes that occur in response to cold temperatures and their alleviation are shared or vary across genotypes. In this study we identify individual transcripts and groups of functionally related transcripts that consistently respond to cold and its alleviation. Genes that respond differently to temperature changes across genotypes may have limited functional importance. We investigate if these genes share functions, and if their genotype-specific gene expression levels change in magnitude or rank across temperatures. Results We estimate transcript abundances from over 22,000 genes in two unrelated Zea mays inbred lines during and after cold temperature exposure. Genotype and temperature contribute to many genes’ abundances. Past cold exposure affects many fewer genes. Genes up-regulated in cold encode many cytokinin glucoside biosynthesis enzymes, transcription factors, signalling molecules, and proteins involved in diverse environmental responses. After cold exposure, protease inhibitors and cuticular wax genes are newly up-regulated, and environmentally responsive genes continue to be up-regulated. Genes down-regulated in response to cold include many photosynthesis, translation, and DNA replication associated genes. After cold exposure, DNA replication and translation genes are still preferentially downregulated. Lignin and suberin biosynthesis are newly down-regulated. DNA replication, reactive oxygen species response, and anthocyanin biosynthesis genes have strong, genotype-specific temperature responses. The ranks of genotypes’ transcript abundances often change across temperatures. Conclusions We report a large, core transcriptome response to cold and the alleviation of cold. In cold, many of the core suite of genes are up or downregulated to control plant growth and photosynthesis and limit cellular damage. In recovery, core responses are in part to prepare for future stress. Functionally related genes are consistently and greatly up-regulated in a single genotype in response to cold or its alleviation, suggesting positive selection has driven genotype-specific temperature responses in maize. Electronic supplementary material The online version of this article (10.1186/s12864-018-5134-7) contains supplementary material, which is available to authorized users.
- Published
- 2018
12. Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis.
- Author
-
Schbath, Sophie, Martin, Véronique, Zytnicki, Matthias, Fayolle, Julien, Loux, Valentin, and Gibrat, Jean-François
- Subjects
- *
NUCLEOTIDE sequence , *DATA analysis , *BACTERIAL genomes , *HASHING , *COMPUTER programming , *ELECTRONIC data processing , *ALGORITHMS - Abstract
Mapping short reads against a reference genome is classically the first step of many next-generation sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
13. Challenges of sequencing human genomes.
- Author
-
Koboldt, Daniel C., Ding, Li, Mardis, Elaine R., and Wilson, Richard K.
- Subjects
- *
HUMAN genetics , *NUCLEOTIDE sequence , *BIOINFORMATICS , *DNA synthesis , *MEDICAL technology - Abstract
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans. [ABSTRACT FROM PUBLISHER]
- Published
- 2010
- Full Text
- View/download PDF
14. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes
- Author
-
Nielsen, H.B., Almeida, M., Sierakowska Juncker, A., Rasmussen, S., Li, J., Sunagawa, S., Plichta, D.R., Gautier, L., Pedersen, A.G., Le Chatelier, E., Pelletier, E., Bonde, I., Nielsen, T., Manichanh, C., Arumugam, M., Batto, J.M., Quintanilha dos Santos, M.B., Blom, N., Borruel, N., Burgdorf, K.S., Boumezbeur, F., Casellas, F., Doré, J., Dworzynski, P., Guarner, F., Hansen, T., Hildebrand, F., Kaas, R.S., Kennedy, S., Kristiansen, K., Kultima, J.R., Leonard, P., Levenez, F., Lund, O., Moumen, B., Le Paslier, D., Pons, N., Pedersen, O., Prifti, E., Qin, J., Raes, J., Sørensen, S., Tap, J., Tims, S., Ussery, D.W., Yamada, T., Jamet, A., Mérieux, A., Cultrone, A., Torrejon, A., Quinquis, B., Brechot, C., Delorme, C., M'Rini, C., de Vos, W.M., Maguin, E., Varela, E., Guedon, E., Gwen, F., Haimet, F., Artiguenave, F., Vandemeulebrouck, G., Denariaz, G., Khaci, G., Blottière, H., Knol, J., Weissenbach, J., van Hylckama Vlieg, J.E., Torben, J., Parkhil, J., Turner, K., van de Guchte, M., Antolin, M., Rescigno, M., Kleerebezem, M., Derrien, M., Galleron, N., Sanchez, N., Grarup, N., Veiga, P., Oozeer, R., Dervyn, R., Layec, S., Bruls, T., Winogradski, Y., Zoetendal, E.G., Renault, D., Sicheritz-Ponten, Bork, P., Wang, J., Brunak, S., Ehrlich, S.D., Center for Biological Sequence Analysis, Technical University of Denmark [Lyngby] (DTU), Novo Nordisk Foundation Center for Biosustainability, MICrobiologie de l'ALImentation au Service de la Santé (MICALIS), Institut National de la Recherche Agronomique (INRA)-AgroParisTech, Department of Computer Science [Baltimore], Johns Hopkins University (JHU), BGI Hong Kong Researche Institute, BGI Shenzhen, School of Bioscience and Biotechnology, Southern University of Science and Technology [Shenzhen] (SUSTech), European Molecular Biology Laboratory, US 1367 MetaGénoPolis, Institut National de la Recherche Agronomique (INRA)-Département Microbiologie et Chaîne Alimentaire (MICA), Institut National de la Recherche Agronomique (INRA)-MetaGénoPolis (MGP), Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay, Université d'Évry-Val-d'Essonne (UEVE), Novo Nordisk Foundation Center for Basic Metabolic Research (CBMR), Faculty of Health and Medical Sciences, University of Copenhagen = Københavns Universitet (KU)-University of Copenhagen = Københavns Universitet (KU), Digestive System Research Unit, Vall d'Hebron University Hospital [Barcelona], Faculty of Health Sciences, University of Southern Denmark (SDU), Department of Structural Biology, Flanders Institute for Biotechnology, Department of Bioscience Engineering, Vrije Universiteit [Brussels] (VUB), 8National Food Institute - Division for Epidemiology and Microbial Genomics, Department of Biology [Copenhagen], Faculty of Science [Copenhagen], Hagedorn Research Institute, Faculty of Health, Aarhus University [Aarhus], BGI Hong Kong research Institute, Rega Institute - Department of Microbiology and Immunology, Université Catholique de Louvain (UCL), VIB Center for the Biology of Disease, Section of Microbiology [Copenhagen], University of Copenhagen = Københavns Universitet (KU)-University of Copenhagen = Københavns Universitet (KU)-Faculty of Science [Copenhagen], Laboratory of Microbiology, Wageningen University and Research Centre [Wageningen] (WUR), Department of Biological Information, Tokyo Institute of Technology [Tokyo] (TITECH), Max-Delbrück Center for Molecular Medicine, Princess Al Jawhara Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Centre for Host-Microbiome Interactions, Dental Institute Central Office, Guy’s Hospital, King‘s College London, Département Microbiologie et Chaîne Alimentaire (MICA), Institut National de la Recherche Agronomique (INRA), European Community's Seventh Framework Programme [FP7-HEALTH-F4-2007-201052, FP7-HEALTH-2010-261376], OpenGPU FUI collaborative research projects, DGCIS, Instituto de Salud Carlos III (Spain), Ministere de la Recherche et de l'Education Nationale (France), [ANR-11-DPBS-0001], Danmarks Tekniske Universitet = Technical University of Denmark (DTU), Beijing Genomics Institute [Shenzhen] (BGI), Southern University of Science and Technology (SUSTech), MetaGenoPolis, Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA), University of Copenhagen = Københavns Universitet (UCPH)-University of Copenhagen = Københavns Universitet (UCPH), Vrije Universiteit Brussel (VUB), Université Catholique de Louvain = Catholic University of Louvain (UCL), University of Copenhagen = Københavns Universitet (UCPH)-University of Copenhagen = Københavns Universitet (UCPH)-Faculty of Science [Copenhagen], Wageningen University and Research [Wageningen] (WUR), Max Delbrück Center for Molecular Medicine [Berlin] (MDC), Helmholtz-Gemeinschaft = Helmholtz Association, European Project: 201052,EC:FP7:HEALTH,FP7-HEALTH-2007-A,METAHIT(2008), Department of Systems Biology, Center for Biological Sequence Analysis, Ctr Biol Sequence Anal, National University of Singapore (NUS), European Molecular Biology Laboratory [Heidelberg] (EMBL), Department of Mathematics and Computer Science [Odense] (IMADA), Génomique métabolique (UMR 8030), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Vall d’Hebron Research Institute (VHIR), Faculty of Health and Medical Sciences, The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen = Københavns Universitet (KU), INRA US1367 MetaGenoPolis, European Molecular Biology Laboratory [Grenoble] (EMBL), Unité de Recherche sur les Maladies Cardiovasculaires, du Métabolisme et de la Nutrition = Institute of cardiometabolism and nutrition (ICAN), Université Pierre et Marie Curie - Paris 6 (UPMC)-Assistance publique - Hôpitaux de Paris (AP-HP) (APHP)-Institut National de la Santé et de la Recherche Médicale (INSERM)-CHU Pitié-Salpêtrière [APHP], Center for Biological Sequence Analysis [Lyngby], Chinese Academy of Agricultural Mechanization Sciences (CCCME), 1Génétique Microbienne, INRA, Domaine de Vilvert, 78352 Jouy en Josas Cedex, and Department of Bio-engineering Sciences
- Subjects
Cellular immunity ,polypeptide ,[SDV]Life Sciences [q-bio] ,SHORT READ ALIGNMENT SEQUENCES SYSTEMS ALGORITHMS MICROBIOTA PROTEIN LIFE SETS TREE TOOL ,complex metagenomic sample ,Applied Microbiology and Biotechnology ,Genome ,Microbiologie ,Databases, Genetic ,genetic element ,Cluster Analysis ,sets ,short read alignment ,ComputingMilieux_MISCELLANEOUS ,Genetics ,0303 health sciences ,tool ,metagenomic ,tree ,Lactococcus lactis ,IL-12 ,Molecular Medicine ,Biotechnology ,life ,Microbial Genomes ,antigen specific immune response ,Biomedical Engineering ,Bioengineering ,Computational biology ,[SDV.BID]Life Sciences [q-bio]/Biodiversity ,cellular immunity ,Biology ,algorithms ,Microbiology ,03 medical and health sciences ,Genetic variation ,microbiota ,Microbiome ,Gene ,genome ,030304 developmental biology ,adjuvant activity ,VLAG ,030306 microbiology ,Metagenomics ,WIAS ,Microbial genetics ,sequences ,systems ,protein - Abstract
Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.
- Published
- 2014
15. Parallel and scalable short-read alignment on multi-core clusters using UPC++
- Author
-
Liu, Yongchao, Schmidt, Bertil, and González-Domínguez, Jorge
- Subjects
Parallel computing ,Internet ,Genome, Human ,Bioinformatics ,PGAS ,Short read alignment ,lcsh:R ,Computational Biology ,High-Throughput Nucleotide Sequencing ,Reproducibility of Results ,lcsh:Medicine ,004 Informatik ,Humans ,Programming Languages ,lcsh:Q ,High performance computing ,lcsh:Science ,Sequence Alignment ,Algorithms ,004 Data processing ,Research Article - Abstract
[Abstract]: The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net). [Resumen]: El crecimiento de los conjuntos de datos de "secuenciamiento de próxima generación" (NGS por sus siglas en inglés) es un reto respecto a la calidad y a la velocidad de alineamientos de secuencias a genomas de referencia. Algunos alineadores disponibles obtienen mapeados de alta calidad a expensas de largos tiempos de ejecución. Desarrollar software rápido y preciso es muy importante para la investigación, ya que la disponibilidad y tamaño de los conjuntos NGS continua creciendo. En este trabajo presentamos una paralelización eficiente para el alineamiento de secuencias cortas de NGS en sistemas con nodos de múltiples núcleos de computación. Nuestra aproximación se aprovecha de un modelo de programación distribuida-compartida basado en el nuevo lenguaje UPC++. Los resultados experimentales usando el alineador CUSHAW3 muestran que nuestra implementación basada en reparto dinámico de trabajo obtiene buena escalabilidad. En nuestra evaluación somos capaces de completar alineamientos sencillos y en parejas de 246 millones de secuencias de longitud 150 en 11.54 y 16.64 minutos, respectivamente, usando 32 nodos con cuatro AMD Opteron 6272 y 16 núcleos de CPU cada uno. Sin embargo, la herramienta multi-hilo original necesita 2.77 y 5.54 horas para completar los mismos alineamientos en los 64 núcleos de un nodo. El código fuente de nuestra implementación paralela está disponible públicamente en la web de CUSHAW3 (http://cushaw3.sourceforge.net). [Resumo]: O medre dos conxuntos de datos de "secuenzamento de próxima xeración" (NGS polas súas siglas en inglés) é un reto respecto á calidade e á velocidade dos aliñamentos de secuencias a xenomas de referencia. Algúns aliñadores disponibles obteñen mapeados de alta calidade a expensas de largos tempos de execución. Desenvolver software rápido e preciso é moi importante para a investigación, xa que a disponibilidade e tamaño dos conxuntos NGS continua a medrar. Neste traballo presentamos unha paralelización eficiente para o aliñamiento de secuencias cortas de NGS en sistemas con nodos de múltiples núcleos de computación. A nosa aproximación aproveitase dun modelo de programación distribuida-compartida basado na nova linguaxe UPC++. Os resultados experimentais que fan uso do aliñador CUSHAW3 mostran que a nosa implementación baseada en reparto dinámico de traballo obtén boa escalabilidade. Na nosa avaliación somos capaces de completar aliñamentos sinxelos e en parellas de 246 millóns de secuencias de lonxitude 150 en 11.54 e 16.64 minutos, respectivamente, empregando 32 nodos con catro AMD Opteron 6272 e 16 núcleos de CPU cada un. Sen embargo, a ferramenta multi-fío oxiginal necesita 2.77 e 5.54 horas para completar os mesmos aliñamientos nos 64 núcleos dun nodo. O código fonte da nosa implementación paralela está disponible públicamente na web de CUSHAW3 (http://cushaw3.sourceforge.net).
- Published
- 2016
- Full Text
- View/download PDF
16. RandAL: a randomized approach to aligning DNA sequences to reference genomes
- Author
-
Nobal B. Niraula, Nam S Vo, Vinhthuy Phan, and Quang Tran
- Subjects
Genetics ,Genome ,Research ,next-gen sequencing ,Computational Biology ,High-Throughput Nucleotide Sequencing ,Sequence alignment ,Sequence Analysis, DNA ,Computational biology ,randomization ,Biology ,DNA sequencing ,DNA microarray ,Sequence Alignment ,short read alignment ,Algorithms ,Software ,Biotechnology - Abstract
Background The alignment of short reads generated by next-generation sequencers to genomes is an important problem in many biomedical and bioinformatics applications. Although many proposed methods work very well on narrow ranges of read lengths, they tend to suffer in performance and alignment quality for reads outside of these ranges. Results We introduce RandAL, a novel method that aligns DNA sequences to reference genomes. Our approach utilizes two FM indices to facilitate efficient bidirectional searching, a pruning heuristic to speed up the computing of edit distances, and most importantly, a randomized strategy that enables effective estimation of key parameters. Extensive comparisons showed that RandAL outperformed popular aligners in most instances and was unique in its consistent and accurate performance over a wide range of read lengths and error rates. The software package is publicly available at https://github.com/namsyvo/RandAL. Conclusions RandAL promises to align effectively and accurately short reads that come from a variety of technologies with different read lengths and rates of sequencing error.
- Published
- 2014
17. Harnessing virtual machines to simplify next-generation DNA sequencing analysis
- Author
-
Sébastien Lemieux, Patrick Gendron, Brian T. Wilhelm, Julie Nocq, Magalie Celton, Institut de Recherche en Immunologie et en Cancérologie [UdeM-Montréal] (IRIC), Université de Montréal (UdeM), Department of Medicine, Laboratory for High-Throughput Genomics, Sciences Pour l'Oenologie (SPO), Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro), Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Université Montpellier 1 (UM1)-Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA), Computer Sciences and Operation Research, Laboratory for Functional and Structural Bioinformatics, Institute for Research in Immunology and Cancer, Université de Montréal, and Université Montpellier 1 (UM1)-Institut de Recherche pour le Développement (IRD [Nouvelle-Calédonie])-Institut National de la Recherche Agronomique (INRA)-Université de Montpellier (UM)-Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)
- Subjects
Statistics and Probability ,Computer science ,[SDV]Life Sciences [q-bio] ,Genomics ,computer.software_genre ,Biochemistry ,DNA sequencing ,Field (computer science) ,differential expression ,03 medical and health sciences ,0302 clinical medicine ,Software ,framework ,Molecular Biology ,short read alignment ,030304 developmental biology ,0303 health sciences ,business.industry ,High-Throughput Nucleotide Sequencing ,tool ,Sequence Analysis, DNA ,package ,Data science ,3. Good health ,Computer Science Applications ,Variety (cybernetics) ,séquençage nouvelle génération ,Computational Mathematics ,viewer ,Computational Theory and Mathematics ,annotation ,Virtual machine ,030220 oncology & carcinogenesis ,tophat ,technology ,rna seq data ,Data mining ,Personalized medicine ,business ,séquençage adn ,computer - Abstract
Motivation: The growth of next-generation sequencing (NGS) has not only dramatically accelerated the pace of research in the field of genomics, but it has also opened the door to personalized medicine and diagnostics. The resulting flood of data has led to the rapid development of large numbers of bioinformatic tools for data analysis, creating a challenging situation for researchers when choosing and configuring a variety of software for their analysis, and for other researchers trying to replicate their analysis. As NGS technology continues to expand from the research environment into clinical laboratories, the challenges associated with data analysis have the potential to slow the adoption of this technology. Results: Here we discuss the potential of virtual machines (VMs) to be used as a method for sharing entire installations of NGS software (bioinformatic ‘pipelines’). VMs are created by programs designed to allow multiple operating systems to co-exist on a single physical machine, and they can be made following the object-oriented paradigm of encapsulating data and methods together. This allows NGS data to be distributed within a VM, along with the pre-configured software for its analysis. Although VMs have historically suffered from poor performance relative to native operating systems, we present benchmarking results demonstrating that this reduced performance can now be minimized. We further discuss the many potential benefits of VMs as a solution for NGS analysis and describe several published examples. Lastly, we consider the benefits of VMs in facilitating the introduction of NGS technology into the clinical environment. Contact: brian.wilhelm@umontreal.ca
- Published
- 2013
18. Mapping reads on a genomic sequence: a practical comparative analysis
- Author
-
Sophie Schbath, Veronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, Jean-François Gibrat, Unité Mathématique Informatique et Génome (MIG), Institut National de la Recherche Agronomique (INRA), and Unité de Recherche Génomique Info (URGI)
- Subjects
NGS ,mapping reads ,benchmark ,ComputingMethodologies_PATTERNRECOGNITION ,[SDV]Life Sciences [q-bio] ,ngs ,benchmarking ,short read alignment ,burrows-wheeler transform ,suffix tree ,suffix array ,hashing ,spaced seed ,[SDV.BV]Life Sciences [q-bio]/Vegetal Biology ,[INFO]Computer Science [cs] ,[MATH]Mathematics [math] ,ComputingMilieux_MISCELLANEOUS - Abstract
National audience; Mapping short reads against a reference genome is classically the first step of many next-generation sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits.
- Published
- 2012
19. Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis
- Author
-
Jonas S. Almeida, Alexandre P. Francisco, Susana Vinga, Alexandra M. Carvalho, Luís M. S. Russo, and NOVA Medical School|Faculdade de Ciências Médicas (NMS|FCM)
- Subjects
Theoretical computer science ,lcsh:QH426-470 ,Matching (graph theory) ,TANDEM REPEATS ,Computer science ,0206 medical engineering ,02 engineering and technology ,ULTRAFAST ,Rolling hash ,CLASSIFICATION ,03 medical and health sciences ,Structural Biology ,SHORT READ ALIGNMENT ,SEARCH ,ALGORITHM ,Pattern matching ,lcsh:QH301-705.5 ,Molecular Biology ,030304 developmental biology ,GENOMIC SIGNATURE ,0303 health sciences ,Research ,Applied Mathematics ,String (computer science) ,Approximate string matching ,Data structure ,DNA-SEQUENCES ,Substring ,lcsh:Genetics ,lcsh:Biology (General) ,Computational Theory and Mathematics ,TREES ,Graph (abstract data type) ,EFFICIENT ALIGNMENT ,Algorithm ,020602 bioinformatics - Abstract
This work was partially supported by FCT through the PIDDAC Program funds (INESC-ID multiannual funding) and under grant PEst-OE/EEI/LA0008/2011 (IT multiannual funding). In addition, it was also partially funded by projects HIVCONTROL (PTDC/EEA-CRO/100128/2008, S. Vinga, PI), TAGS (PTDC/EIA-EIA/112283/2009) and NEUROCLINOMICS (PTDC/EIA-EIA/111239/2009) from FCT (Portugal). Background: Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2(-L) distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results: The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions: The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems. publishersversion published
- Published
- 2012
20. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis
- Author
-
Véronique Martin, Valentin Loux, Jean-François Gibrat, Julien Fayolle, Sophie Schbath, Matthias Zytnicki, Unité Mathématique Informatique et Génome (MIG), Institut National de la Recherche Agronomique (INRA), Unité de Recherche Génomique Info (URGI), Schbath, Sophie, and Martin, Veronique
- Subjects
Burrows–Wheeler transform ,spaced seed ,[SDV]Life Sciences [q-bio] ,Concatenation ,Molecular Sequence Data ,ngs ,benchmarking ,short read alignment ,burrows-wheeler transform ,suffix tree ,suffix array ,hashing ,Hybrid genome assembly ,Genomics ,Bacterial genome size ,Biology ,computer.software_genre ,Set (abstract data type) ,03 medical and health sciences ,Genetics ,Humans ,[INFO]Computer Science [cs] ,[MATH]Mathematics [math] ,Molecular Biology ,Research Articles ,030304 developmental biology ,0303 health sciences ,Bacteria ,Base Sequence ,Genome, Human ,030302 biochemistry & molecular biology ,Chromosome Mapping ,Sequence Analysis, DNA ,analyse comparative ,Computational Mathematics ,ComputingMethodologies_PATTERNRECOGNITION ,Computational Theory and Mathematics ,Modeling and Simulation ,Human genome ,Data mining ,computer ,Sequence Alignment ,algorithme ,Algorithms ,Genome, Bacterial ,Software ,Reference genome - Abstract
International audience; Mapping short reads against a reference genome is classically the first step of many next-generation sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits.
- Published
- 2012
21. Prospects and limitations of full-text index structures in genome analysis
- Author
-
Bernard De Baets, Michaël Vyverman, Peter Dawyndt, and Veerle Fack
- Subjects
COMPRESSED SUFFIX ARRAYS ,FM-INDEX ,Genomics ,Biology ,INVERTED FILES ,MASSIVE DATA ,SHORT READ ALIGNMENT ,Genetics ,Survey and Summary ,PRACTICAL ALGORITHM ,Heuristic ,EXTERNAL MEMORY ,String (computer science) ,Full text search ,Biology and Life Sciences ,Data structure ,Data science ,Variety (cybernetics) ,Index (publishing) ,EFFICIENT CONSTRUCTION ,Sequence Analysis ,TREE CONSTRUCTION ,Algorithms ,FM-index ,SEQUENCE COLLECTIONS - Abstract
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared.
- Published
- 2012
22. Biased Gene Fractionation and Dominant Gene Expression among the Subgenomes of Brassica rapa
- Author
-
Cheng, F., Wu, J., Fang, L., Sun, S., Liu, B., Lin, K., Bonnema, A.B., Wang, Xiaowu, Cheng, F., Wu, J., Fang, L., Sun, S., Liu, B., Lin, K., Bonnema, A.B., and Wang, Xiaowu
- Abstract
Polyploidization, both ancient and recent, is frequent among plants. A ‘‘two-step theory’’ was proposed to explain the meso-triplication of the Brassica ‘‘A’’ genome: Brassica rapa. By accurately partitioning of this genome, we observed that genes in the less fractioned subgenome (LF) were dominantly expressed over the genes in more fractioned subgenomes (MFs: MF1 and MF2), while the genes in MF1 were slightly dominantly expressed over the genes in MF2. The results indicated that the dominantly expressed genes tended to be resistant against gene fractionation. By re-sequencing two B. rapa accessions: a vegetable turnip (VT117) and a Rapid Cycling line (L144), we found that genes in LF had less nonsynonymous or frameshift mutations than genes in MFs; however mutation rates were not significantly different between MF1 and MF2. The differences in gene expression patterns and on-going gene death among the three subgenomes suggest that ‘‘two-step’’ genome triplication and differential subgenome methylation played important roles in the genome evolution of B. rapa.
- Published
- 2012
23. ChIP-seq Analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions
- Author
-
Muino, J.M., Kaufmann, K., van Ham, R.C.H.J., Angenent, G.C., Krajewski, P., Muino, J.M., Kaufmann, K., van Ham, R.C.H.J., Angenent, G.C., and Krajewski, P.
- Abstract
Background In vivo detection of protein-bound genomic regions can be achieved by combining chromatin-immunoprecipitation with next-generation sequencing technology (ChIP-seq). The large amount of sequence data produced by this method needs to be analyzed in a statistically proper and computationally efficient manner. The generation of high copy numbers of DNA fragments as an artifact of the PCR step in ChIP-seq is an important source of bias of this methodology. Results We present here an R package for the statistical analysis of ChIP-seq experiments. Taking the average size of DNA fragments subjected to sequencing into account, the software calculates single-nucleotide read-enrichment values. After normalization, sample and control are compared using a test based on the ratio test or the Poisson distribution. Test statistic thresholds to control the false discovery rate are obtained through random permutations. Computational efficiency is achieved by implementing the most time-consuming functions in C++ and integrating these in the R package. An analysis of simulated and experimental ChIP-seq data is presented to demonstrate the robustness of our method against PCR-artefacts and its adequate control of the error rate. Conclusions The software ChIP-seq Analysis in R (CSAR) enables fast and accurate detection of protein-bound genomic regions through the analysis of ChIP-seq experiments. Compared to existing methods, we found that our package shows greater robustness against PCR-artefacts and better control of the error rate.
- Published
- 2011
24. Biased Gene Fractionation and Dominant Gene Expression among the Subgenomes of Brassica rapa
- Author
-
Guusje Bonnema, Ke Lin, Bo Liu, Silong Sun, Lu Fang, Feng Cheng, Jian Wu, and Xiaowu Wang
- Subjects
genome sequence ,lcsh:Medicine ,Gene Expression ,dna methylation ,Plant Genetics ,Genome ,Laboratorium voor Plantenveredeling ,Ploidy ,Vegetables ,Gene duplication ,lcsh:Science ,Genome Evolution ,short read alignment ,Genetics ,Evolutionary Theory ,Multidisciplinary ,plants ,polyploids ,Brassica rapa ,food and beverages ,Agriculture ,Genomics ,duplication ,DNA methylation ,Laboratory of Genetics ,Genome, Plant ,Research Article ,Genome evolution ,Bioinformatics ,Crops ,arabidopsis-thaliana ,Biology ,Laboratorium voor Erfelijkheidsleer ,Chromosomes, Plant ,Frameshift mutation ,Evolution, Molecular ,Bioinformatica ,evolution ,Gene ,Crop Genetics ,Evolutionary Biology ,lcsh:R ,Brassica napus ,Computational Biology ,Genomic Evolution ,Plant Breeding ,lcsh:Q ,EPS ,Genome Expression Analysis ,Population Genetics - Abstract
Polyploidization, both ancient and recent, is frequent among plants. A ‘‘two-step theory’’ was proposed to explain the meso-triplication of the Brassica ‘‘A’’ genome: Brassica rapa. By accurately partitioning of this genome, we observed that genes in the less fractioned subgenome (LF) were dominantly expressed over the genes in more fractioned subgenomes (MFs: MF1 and MF2), while the genes in MF1 were slightly dominantly expressed over the genes in MF2. The results indicated that the dominantly expressed genes tended to be resistant against gene fractionation. By re-sequencing two B. rapa accessions: a vegetable turnip (VT117) and a Rapid Cycling line (L144), we found that genes in LF had less nonsynonymous or frameshift mutations than genes in MFs; however mutation rates were not significantly different between MF1 and MF2. The differences in gene expression patterns and on-going gene death among the three subgenomes suggest that ‘‘two-step’’ genome triplication and differential subgenome methylation played important roles in the genome evolution of B. rapa.
- Published
- 2012
25. ChIP-seq Analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions
- Author
-
Roeland C. H. J. van Ham, Paweł Krajewski, Gerco C. Angenent, Kerstin Kaufmann, and Jose M. Muiño
- Subjects
0106 biological sciences ,Normalization (statistics) ,False discovery rate ,Computer science ,Bioinformatics ,ultrafast ,Word error rate ,Plant Science ,lcsh:Plant culture ,system ,Poisson distribution ,computer.software_genre ,01 natural sciences ,03 medical and health sciences ,symbols.namesake ,BIOS Applied Bioinformatics ,Software ,Bioinformatica ,Genetics ,Test statistic ,orchestration ,Laboratorium voor Moleculaire Biologie ,lcsh:SB1-1110 ,BIOS Plant Development Systems ,lcsh:QH301-705.5 ,short read alignment ,030304 developmental biology ,0303 health sciences ,EPS-1 ,software ,business.industry ,Ratio test ,Chip ,PRI Bioscience ,lcsh:Biology (General) ,rna-seq ,symbols ,cells ,Data mining ,Laboratory of Molecular Biology ,business ,Algorithm ,computer ,010606 plant biology & botany ,Biotechnology - Abstract
Background In vivo detection of protein-bound genomic regions can be achieved by combining chromatin-immunoprecipitation with next-generation sequencing technology (ChIP-seq). The large amount of sequence data produced by this method needs to be analyzed in a statistically proper and computationally efficient manner. The generation of high copy numbers of DNA fragments as an artifact of the PCR step in ChIP-seq is an important source of bias of this methodology. Results We present here an R package for the statistical analysis of ChIP-seq experiments. Taking the average size of DNA fragments subjected to sequencing into account, the software calculates single-nucleotide read-enrichment values. After normalization, sample and control are compared using a test based on the ratio test or the Poisson distribution. Test statistic thresholds to control the false discovery rate are obtained through random permutations. Computational efficiency is achieved by implementing the most time-consuming functions in C++ and integrating these in the R package. An analysis of simulated and experimental ChIP-seq data is presented to demonstrate the robustness of our method against PCR-artefacts and its adequate control of the error rate. Conclusions The software ChIP-seq Analysis in R (CSAR) enables fast and accurate detection of protein-bound genomic regions through the analysis of ChIP-seq experiments. Compared to existing methods, we found that our package shows greater robustness against PCR-artefacts and better control of the error rate.
- Published
- 2011
26. Analysis of optimal alignments unfolds aligners' bias in existing variant profiles.
- Author
-
Tran Q, Gao S, and Phan V
- Subjects
- Alleles, Data Accuracy, Genomics methods, Humans, Polymorphism, Genetic, Genome, Human, INDEL Mutation, Sequence Alignment methods, Sequence Analysis, DNA methods, Software
- Abstract
Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners' called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.
- Published
- 2016
- Full Text
- View/download PDF
27. Short Read Alignment Using SOAP2.
- Author
-
Hurgobin B
- Subjects
- Algorithms, Computational Biology methods, Genomics methods, Sequence Analysis, DNA methods, Software
- Abstract
Next-generation sequencing (NGS) technologies have rapidly evolved in the last 5 years, leading to the generation of millions of short reads in a single run. Consequently, various sequence alignment algorithms have been developed to compare these reads to an appropriate reference in order to perform important downstream analysis. SOAP2 from the SOAP series is one of the most commonly used alignment programs to handle NGS data, and it efficiently does so using low computer memory usage and fast alignment speed. This chapter describes the protocol used to align short reads to a reference genome using SOAP2, and highlights the significance of using the in-built command-line options to tune the behavior of the algorithm according to the inputs and the desired results.
- Published
- 2016
- Full Text
- View/download PDF
28. Analysis of optimal alignments unfolds aligners’ bias in existing variant profiles
- Author
-
Quang Tran, Vinhthuy Phan, and Shanshan Gao
- Subjects
0301 basic medicine ,Population ,Short read alignment ,Computational biology ,Biology ,Genome ,Biochemistry ,03 medical and health sciences ,0302 clinical medicine ,INDEL Mutation ,Structural Biology ,Variant calling ,Humans ,1000 Genomes Project ,International HapMap Project ,Indel ,education ,INDEL detection ,Molecular Biology ,Selection (genetic algorithm) ,Alleles ,Genetics ,education.field_of_study ,Polymorphism, Genetic ,Genome, Human ,Applied Mathematics ,food and beverages ,Genomics ,Sequence Analysis, DNA ,Data Accuracy ,Computer Science Applications ,030104 developmental biology ,Proceedings ,DNA microarray ,Sequence Alignment ,030217 neurology & neurosurgery ,Software ,Reference genome - Abstract
Efforts such as International HapMap Project and 1000 Genomes Project resulted in a catalog of millions of single nucleotides and insertion/deletion (INDEL) variants of the human population. Viewed as a reference of existing variants, this resource commonly serves as a gold standard for studying and developing methods to detect genetic variants. Our analysis revealed that this reference contained thousands of INDELs that were constructed in a biased manner. This bias occurred at the level of aligning short reads to reference genomes to detect variants. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that agreed strongly or disagreed strongly with reported INDELs. This finding suggests that the agreement or disagreement between the aligners’ called INDEL and the reported INDEL is merely a result of the arbitrary selection of one of the optimal alignments. The existence of bias in INDEL calling might have a serious influence in downstream analyses. As such, our finding suggests that this phenomenon should be further addressed.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.