Back to Search Start Over

Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust

Authors :
Benjamin Buchfink
Haim Ashkenazy
Klaus Reuter
John A. Kennedy
Hajk-Georg Drost
Publication Year :
2023
Publisher :
Cold Spring Harbor Laboratory, 2023.

Abstract

The biosphere genomics era is transforming life science research, but existing methods struggle to efficiently reduce the vast dimensionality of the protein universe. We present DIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the 19 billion protein sequences currently defining the protein biosphere. As a result, we detect 1.7 billion clusters of which 32% hold more than one sequence. This means that 544 million clusters represent 94% of all known proteins, illustrating that clustering across the tree of life can significantly accelerate comparative studies in the Earth BioGenome era.

Details

Database :
OpenAIRE
Accession number :
edsair.doi...........2c170cba8695d29989c657f39b4b6e65