Back to Search Start Over

Semantic search using protein large language models detects class II microcins in bacterial genomes.

Authors :
Kulikova AV
Parker JK
Davies BW
Wilke CO
Source :
MSystems [mSystems] 2024 Oct 22; Vol. 9 (10), pp. e0104424. Date of Electronic Publication: 2024 Sep 18.
Publication Year :
2024

Abstract

Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date, only 10 class II microcins have been described, and the discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In data sets of Escherichia coli , Klebsiella spp., and Enterobacter spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.<br />Importance: Antibiotic resistance is becoming an increasingly serious problem in modern medicine, but the development pipeline for conventional antibiotics is not promising. Therefore, alternative approaches to combat bacterial infections are urgently needed. One such approach may be to employ naturally occurring antibacterial peptides produced by bacteria to kill competing bacteria. A promising class of such peptides are class II microcins. However, only a small number of class II microcins have been discovered to date, and the discovery of further such microcins has been hampered by their high sequence divergence and short length, which can cause sequence-based search methods to fail. Here, we demonstrate that a more robust method for microcin discovery can be built on the basis of a protein large language model, and we use this method to identify several putative novel class II microcins.<br />Competing Interests: The authors declare no conflict of interest.

Details

Language :
English
ISSN :
2379-5077
Volume :
9
Issue :
10
Database :
MEDLINE
Journal :
MSystems
Publication Type :
Academic Journal
Accession number :
39291976
Full Text :
https://doi.org/10.1128/msystems.01044-24