1. Application of Hidden Markov Model based methods for gaining insights into protein domain evolution and function
- Author
-
Upadhyay, Amit Anil
- Subjects
- Signal transduction, sensory domains, extracellular, Cache, PAS, PDC, GH48, Bioinformatics, Computational Biology, Genomics
- Abstract
With the explosion in the amount of available sequence data, computational methods have become indispensable for studying proteins. Domains are the fundamental structural, functional and evolutionary units that make up proteins. Studying protein domains is an important part of understanding protein function and evolution. Hidden Markov Models (HMM) are one of the most successful methods that have been applied for protein sequence and structure analysis. In this study, HMM based methods were applied to study the evolution of sensory domains in microbial signal transduction systems as well as functional characterization and identification of cellulases in metagenomics datasets. Use of HMM domain models enabled identification of the ambiguity in sequence and structure based definitions of the Cache domain family. Cache domains are extracellular sensory domains that are present in microbial signal transduction proteins and eukaryotic voltage gated calcium channels. The ambiguity in domain definitions was resolved and more accurate HMM models were built that detected more than 50,000 new members. It was discovered that Cache domains constitute the largest family of extracellular sensory domains in prokaryotes. Cache domains were also found to be remotely homologous to PAS domains at the level of sequence, a relationship previously suggested purely based on structural comparisons. We used HMM-HMM comparisons to study the diversity of extracellular sensory domains in prokaryotic signal transductions systems. This approach allowed annotation of more than 46,000 sequences and reduced the percentage of unknown domains from 64% to 15%. New relationships were also discovered between domain families that were otherwise thought to be unrelated. Finally, HMM models were used to retrieve Family 48 glycoside hydrolases (GH48) from sequence databases. Analysis of these sequences, enabled the identification of distinguishing features of cellulases. These features were used to identify GH48 cellulases from metagenomics datasets. In summary, HMM based methods enabled domain identification, remote homology detection and functional characterization of protein domains.
- Published
- 2015