1. Binning Metagenomic Contigs Using Contig Embedding and Decomposed Tetranucleotide Frequency.
- Author
-
Fu, Long, Shi, Jiabin, and Huang, Baohua
- Subjects
- *
LANGUAGE models , *MATRIX decomposition , *NONNEGATIVE matrices , *MICROBIAL genomes , *METAGENOMICS - Abstract
Simple Summary: Metagenomic binning is a part of the metagenomic analysis process that can classify the genomes in complex microbial communities into possible source species. It is dedicated to restoring more complete genomes and is conducive to studying microbial community structure and function. Obtaining more effective feature representations of genome sequences to improve binning is still a major challenge. This study proposes a new metagenomic binning method called CedtBin, which uses an improved BERT model to obtain the embedded representation of contigs and splices it with the decomposed tetranucleotide frequencies into a new feature representation. Then, the Annoy-DBSCAN clustering algorithm is proposed, which can adaptively determine the parameters of the DBSCAN clustering algorithm for binning. The results show that CedtBin can achieve good binning effects on both simulated and real datasets, and can reconstruct more genomes. Metagenomic binning is a crucial step in metagenomic research. It can aggregate the genome sequences belonging to the same microbial species into independent bins. Most existing methods ignore the semantic information of contigs and lack effective processing of tetranucleotide frequency, resulting in insufficient and complex feature information extracted for binning and poor binning results. To address the above problems, we propose CedtBin, a metagenomic binning method based on contig embedding and decomposed tetranucleotide frequency. First, the improved BERT model is used to learn the contigs to obtain their embedding representation. Secondly, the tetranucleotide frequencies are decomposed using a non-negative matrix factorization (NMF) algorithm. After that, the two features are spliced and input into the clustering algorithm for binning. Considering the sensitivity of the DBSCAN clustering algorithm to input parameters, in order to solve the drawbacks of manual parameter input, we also propose an Annoy-DBSCAN algorithm that can adaptively determine the parameters of the DBSCAN algorithm. This algorithm uses Approximate Nearest Neighbors Oh Yeah (Annoy) and combines it with a grid search strategy to find the optimal parameters of the DBSCAN algorithm. On simulated and real datasets, CedtBin achieves better binning results than mainstream methods and can reconstruct more genomes, indicating that the proposed method is effective. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF