Back to Search Start Over

Machine learning classification of archaea and bacteria identifies novel predictive genomic features

Authors :
Tania Bobbo
Filippo Biscarini
Sachithra K. Yaddehige
Leonardo Alberghini
Davide Rigoni
Nicoletta Bianchi
Cristian Taccioli
Source :
BMC Genomics, Vol 25, Iss 1, Pp 1-15 (2024)
Publication Year :
2024
Publisher :
BMC, 2024.

Abstract

Abstract Background Archaea and Bacteria are distinct domains of life that are adapted to a variety of ecological niches. Several genome-based methods have been developed for their accurate classification, yet many aspects of the specific genomic features that determine these differences are not fully understood. In this study, we used publicly available whole-genome sequences from bacteria ( $$N=2546$$ N = 2546 ) and archaea ( $$N=109$$ N = 109 ). From these, a set of genomic features (nucleotide frequencies and proportions, coding sequences (CDS), non-coding, ribosomal and transfer RNA genes (ncRNA, rRNA, tRNA), Chargaff’s, topological entropy and Shannon’s entropy scores) was extracted and used as input data to develop machine learning models for the classification of archaea and bacteria. Results The classification accuracy ranged from 0.993 (Random Forest) to 0.998 (Neural Networks). Over the four models, only 11 examples were misclassified, especially those belonging to the minority class (Archaea). From variable importance, tRNA topological and Shannon’s entropy, nucleotide frequencies in tRNA, rRNA and ncRNA, CDS, tRNA and rRNA Chargaff’s scores have emerged as the top discriminating factors. In particular, tRNA entropy (both topological and Shannon’s) was the most important genomic feature for classification, pointing at the complex interactions between the genetic code, tRNAs and the translational machinery. Conclusions tRNA, rRNA and ncRNA genes emerged as the key genomic elements that underpin the classification of archaea and bacteria. In particular, higher nucleotide diversity was found in tRNA from bacteria compared to archaea. The analysis of the few classification errors reflects the complex phylogenetic relationships between bacteria, archaea and eukaryotes.

Details

Language :
English
ISSN :
14712164
Volume :
25
Issue :
1
Database :
Directory of Open Access Journals
Journal :
BMC Genomics
Publication Type :
Academic Journal
Accession number :
edsdoj.1ba5ad1996c0472fa80c950a9624cb3e
Document Type :
article
Full Text :
https://doi.org/10.1186/s12864-024-10832-y