1. A fast and scalable high-throughput sequencing data error correction via oligomers
- Author
-
Iain Buchan, Mattia Prosperi, Franco Milicchio, Franco Milicchio, ain E. Buchan, MattiaProsperi, Milicchio, Franco, Buchan, Iain E., and Prosperi, Mattia C. F.
- Subjects
error correction ,0301 basic medicine ,Computer science ,0206 medical engineering ,Hash function ,Inference ,Genomics ,02 engineering and technology ,computer.software_genre ,De Bruijn graph ,03 medical and health sciences ,symbols.namesake ,Genetic ,Artificial Intelligence ,next generation sequencing ,Health Informatic ,Sanger sequencing ,Agricultural and Biological Sciences (miscellaneous) ,Range (mathematics) ,030104 developmental biology ,Computational Mathematic ,Scalability ,symbols ,Data mining ,Error detection and correction ,computer ,020602 bioinformatics ,de Bruijn graph ,Biotechnology - Abstract
Next-generation sequencing (NGS) technologies have superseded traditional Sanger sequencing approach in many experimental settings, given their tremendous yield and affordable cost. Nowadays it is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in weeks. Nonetheless, NGS technologies are error-prone. Correcting errors is a challenge due to multiple factors, including the data sizes, the machine-specific and non-at-random characteristics of errors, and the error distributions. Errors in NGS experiments can hamper the subsequent data analysis and inference. This work proposes an error correction method based on the de Bruijn graph that permits its execution on Gigabyte-sized data sets using normal desktop/laptop computers, ideal for genome sizes in the Megabase range, e.g. bacteria. The implementation makes extensive use of hashing techniques, and implements an A∗ algorithm for optimal error correction, minimizing the distance between an erroneous read and its possible replacement with the Needleman-Wunsch score. Our approach outperforms other popular methods both in terms of random access memory usage and computing times.
- Published
- 2016