1. Accurate self-correction of errors in long reads using de Bruijn graphs
- Author
-
Leena Salmela, Riku Walve, Eric Rivals, Esko Ukkonen, Department of Computer Science, Helsinki Institute for Information Technology, Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan), Combinatorial Pattern Matching research group / Esko Ukkonen, Genome-scale Algorithmics research group / Veli Mäkinen, Bioinformatics, Algorithmic Bioinformatics, Helsinki Institute for Information Technology (HIIT), Helsingin yliopisto = Helsingfors universitet = University of Helsinki-Aalto University, Méthodes et Algorithmes pour la Bioinformatique (MAB), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Institut de Biologie Computationnelle (IBC), Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS), Academy of Finland (grant 267591)EU FP7 SYSCOL UE7-SYSCOL-258236, ANR-12-BS02-0008,Colib'read,Méthodes d'extraction d'information biologique dans les données HTS non assemblées(2012), ANR-11-BINF-0002,IBC,Institut de biologie Computationnelle(2011), European Project: 258236,EC:FP7:HEALTH,FP7-HEALTH-2010-two-stage,SYSCOL(2011), Aalto University, ANR-11-BINF-0002,IBC,Institut de Biologie Computationnelle de Montpellier(2011), Université de Montpellier (UM)-Institut National de la Recherche Agronomique (INRA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Aalto University-University of Helsinki, and Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)
- Subjects
0301 basic medicine ,Statistics and Probability ,assembly ,Computer science ,0206 medical engineering ,ACM: F.: Theory of Computation/F.2: ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY ,education ,Sequence assembly ,Word error rate ,non hybrid correction ,Recomb-Seq/Recomb-Cbb 2016 ,02 engineering and technology ,Saccharomyces cerevisiae ,Biochemistry ,Set (abstract data type) ,03 medical and health sciences ,substitution ,Escherichia coli ,Quantitative Biology - Genomics ,Molecular Biology ,Self correction ,Throughput (business) ,De Bruijn sequence ,Genomics (q-bio.GN) ,PacBio ,Genome ,ACM: F.: Theory of Computation/F.2: ANALYSIS OF ALGORITHMS AND PROBLEM COMPLEXITY/F.2.2: Nonnumerical Algorithms and Problems/F.2.2.6: Sorting and searching ,Sequence analysis ,1184 Genetics, developmental biology, physiology ,High-Throughput Nucleotide Sequencing ,Sequence Analysis, DNA ,DNA ,113 Computer and information sciences ,Computer Science Applications ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,de Bruijn ,indel ,FOS: Biological sciences ,NGS ,LoRDEC ,Nanopore sequencing ,[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] ,Error detection and correction ,Algorithm ,020602 bioinformatics ,Algorithms ,Software - Abstract
New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50,000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilisation of the reads in, e.g., de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads. We present an error correction method that uses long reads only. The method consists of two phases: first we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75x, the throughput of the new method is at least 20% higher. LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/., paper accepted at the RECOMB-Seq 2016
- Published
- 2017