1. Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering
- Author
-
Federico Barone, Elena Tea Russo, Edith Natalia Villegas Garcia, Marco Punta, Stefano Cozzini, Alessio Ansuini, and Alberto Cazzaniga
- Abstract
Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the lack of sequence annotation impairs its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of theDPCfam-UHGP50 datasetcontaining 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. It is our hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut by the release of a FAIR-compliant database easily accessible via a searchable web server and Zenodo repository.
- Published
- 2023