1. Fast Metagenomic Binning via Hashing and Bayesian Clustering.
- Author
-
Popic V, Kuleshov V, Snyder M, and Batzoglou S
- Subjects
- Cluster Analysis, Humans, Metagenome genetics, Bayes Theorem, Computational Biology statistics & numerical data, Metagenomics statistics & numerical data, Microbiota genetics
- Abstract
We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.
- Published
- 2018
- Full Text
- View/download PDF