Back to Search
Start Over
MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs
- Source :
- BMC Bioinformatics, BMC Bioinformatics, Vol 18, Iss S12, Pp 67-75 (2017)
- Publication Year :
- 2017
- Publisher :
- Springer Science and Business Media LLC, 2017.
-
Abstract
- The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers. In this paper, we present a new gene-targeted assembler MegaGTA, which attempts to improve Xander in different aspects. Quality-wise, it utilizes iterative de Bruijn graphs to take full advantage of multiple k-mer sizes to make the best of both sensitivity and accuracy. Computation-wise, it employs succinct de Bruijn graphs (SdBG) to achieve low memory footprint and high speed (the latter is benefited from a highly efficient parallel algorithm for constructing SdBG). Unlike Bloom filters, an SdBG is an exact representation of a de Bruijn graph. It enables MegaGTA to avoid false-positive contigs and to easily incorporate the multiplicity of k-mers for building better HMM model. We have compared MegaGTA and Xander on an HMP-defined mock metagenomic dataset, and showed that MegaGTA excelled in both sensitivity and accuracy. On a large rhizosphere soil metagenomic sample (327Gbp), MegaGTA produced 9.7–19.3% more contigs than Xander, and these contigs were assigned to 10–25% more gene references. In our experiments, MegaGTA, depending on the number of k-mers used, is two to ten times faster than Xander. MegaGTA improves on the algorithm of Xander and achieves higher sensitivity, accuracy and speed. Moreover, it is capable of assembling gene sequences from ultra-large metagenomic datasets. Its source code is freely available at https://github.com/HKU-BAL/megagta .
- Subjects :
- 0301 basic medicine
Theoretical computer science
Source code
Computer science
media_common.quotation_subject
Assembly
Statistics as Topic
0206 medical engineering
Parallel algorithm
Pilot Projects
02 engineering and technology
lcsh:Computer applications to medicine. Medical informatics
Biochemistry
De Bruijn graph
Soil
03 medical and health sciences
symbols.namesake
Structural Biology
Databases, Genetic
Humans
Targeted gene
Hidden Markov model
lcsh:QH301-705.5
Molecular Biology
media_common
De Bruijn sequence
Applied Mathematics
Multiplicity (mathematics)
Bloom filter
Reference Standards
Graph
Computer Science Applications
Tree traversal
030104 developmental biology
lcsh:Biology (General)
Genes
Rhizosphere
symbols
Memory footprint
lcsh:R858-859.7
Metagenomics
Algorithm
Software
Algorithms
020602 bioinformatics
Subjects
Details
- ISSN :
- 14712105
- Volume :
- 18
- Database :
- OpenAIRE
- Journal :
- BMC Bioinformatics
- Accession number :
- edsair.doi.dedup.....b918a965011a2c78f93589f93d15b80c