1. Structural Alphabets for Protein Structure Classification: A Comparison Study
- Author
-
Quan Le, Gianluca Pollastri, and Patrice Koehl
- Subjects
Models, Molecular ,Computer science ,Sequence analysis ,Molecular Sequence Data ,Structural alignment ,Sequence alignment ,Bioinformatics ,Protein Structure, Secondary ,Article ,Sequence Analysis, Protein ,Structural Biology ,Amino Acid Sequence ,Loop modeling ,Databases, Protein ,Molecular Biology ,Alignment-free sequence analysis ,Multiple sequence alignment ,Sequence Homology, Amino Acid ,business.industry ,Proteins ,Pattern recognition ,Structural Classification of Proteins database ,Markov Chains ,Sequence logo ,ROC Curve ,Structural Homology, Protein ,Artificial intelligence ,business - Abstract
Finding structural similarities between proteins often helps reveal shared functionality, which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures, however, remains a hard problem. An alternative approach is to approximate each three-dimensional protein structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences or structural sequences. In this article, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments having four residues. We show that the global structural sequences approximate well the native structures of proteins, with an average coordinate root mean square of 0.69 A over 2225 test proteins. The approximation is best for all alpha-proteins, while relatively poorer for all beta-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary-structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary-structure information or local information from the structural sequence considerably improves the classification accuracy. The two fragment-based sequences perform better than the secondary-structure sequence but not well enough at this stage to be a viable alternative to more computationally intensive methods based on protein structure alignment.
- Published
- 2009
- Full Text
- View/download PDF