1. PhANNs, a fast and accurate tool and web server to classify phage structural proteins
- Author
-
Vito Adrian Cantu, Anca M. Segall, David Salamon, Jackson Redfield, Robert Edwards, Victor Seguritan, and Peter Salamon
- Subjects
0301 basic medicine ,Computer science ,viruses ,medicine.medical_treatment ,computer.software_genre ,Biochemistry ,Genome ,Viral Packaging ,Machine Learning ,Bacteriophage ,Database and Informatics Methods ,Protein Structure Databases ,Macromolecular Structure Analysis ,Bacteriophages ,ORFS ,Biology (General) ,Databases, Protein ,Viral Genomics ,0303 health sciences ,Ecology ,biology ,Genomics ,Computational Theory and Mathematics ,Modeling and Simulation ,Viruses ,Structural Proteins ,Research Article ,Computer and Information Sciences ,Protein Structure ,Web server ,Phage therapy ,QH301-705.5 ,030106 microbiology ,Microbial Genomics ,Computational biology ,Research and Analysis Methods ,Microbiology ,Cellular and Molecular Neuroscience ,03 medical and health sciences ,Artificial Intelligence ,Virology ,Genetics ,medicine ,Cluster analysis ,Molecular Biology ,Gene ,Ecology, Evolution, Behavior and Systematics ,Artificial Neural Networks ,030304 developmental biology ,Viral Structural Proteins ,Computational Neuroscience ,Internet ,030306 microbiology ,Organisms ,Reproducibility of Results ,Biology and Life Sciences ,Computational Biology ,Proteins ,Protein superfamily ,biology.organism_classification ,Viral Replication ,030104 developmental biology ,Biological Databases ,Metagenomics ,Neural Networks, Computer ,computer ,Neuroscience - Abstract
For any given bacteriophage genome or phage-derived sequences in metagenomic data sets, we are unable to assign a function to 50–90% of genes, or more. Structural protein-encoding genes constitute a large fraction of the average phage genome and are among the most divergent and difficult-to-identify genes using homology-based methods. To understand the functions encoded by phages, their contributions to their environments, and to help gauge their utility as potential phage therapy agents, we have developed a new approach to classify phage ORFs into ten major classes of structural proteins or into an “other” category. The resulting tool is named PhANNs (Phage Artificial Neural Networks). We built a database of 538,213 manually curated phage protein sequences that we split into eleven subsets (10 for cross-validation, one for testing) using a novel clustering method that ensures there are no homologous proteins between sets yet maintains the maximum sequence diversity for training. An Artificial Neural Network ensemble trained on features extracted from those sets reached a test F1-score of 0.875 and test accuracy of 86.2%. PhANNs can rapidly classify proteins into one of the ten structural classes or, if not predicted to fall in one of the ten classes, as “other,” providing a new approach for functional annotation of phage proteins. PhANNs is open source and can be run from our web server or installed locally., Author summary Bacteriophages (phages, viruses that infect bacteria) are the most abundant biological entity on Earth. They outnumber bacteria by a factor of ten. As phages are very different from each other and from bacteria, and we have relatively few phage genes in our database compared to bacterial genes, we are unable to assign function to 50–90% of phage genes. In this work, we developed PhANNs, a machine learning tool that can classify a phage gene as one of ten structural roles, or “other”. This approach does not require a similar gene to be known.
- Published
- 2020