1. The Structure of Evolutionary Model Space for Proteins across the Tree of Life.
- Author
-
Scolaro, Gabrielle E. and Braun, Edward L.
- Subjects
- *
EVOLUTIONARY models , *PROTEIN models , *AMINO acids , *ENVIRONMENTAL history , *AMINO acid sequence , *RIBOSOMAL proteins , *AMINO acid residues , *RIBOSOMAL DNA - Abstract
Simple Summary: The relative rates of amino acid substitution over evolutionary time reflect the chemical properties of amino acids. Substitutions that result in an amino acid similar to an ancestral residue accumulate more rapidly than those resulting in a dissimilar amino acid. The substitution rates for each amino acid pair are the parameters in models of evolutionary change for proteins. Although the best-fitting model of protein evolution is known to differ among taxa, a comprehensive picture of model changes across the tree of life is not available. In principle, models of protein change might reflect evolutionary history (i.e., closely related taxa have similar models) or the environment (i.e., taxa living in similar environments have similar models). We estimated models of amino acid evolution for organisms across the tree of life, finding evidence that history and the environment have both contributed to model differences. Bacterial models differed from archaeal and eukaryotic models. Models for Halobacteriaceae (archaea that live in highly saline environments) and Thermoprotei (a group of thermophilic archaea) were found to be very distinctive. The rates of substitution for pairs of aromatic amino acids were especially variable. Overall, these results paint a picture of the "evolutionary model space" for proteins across the tree of life. The factors that determine the relative rates of amino acid substitution during protein evolution are complex and known to vary among taxa. We estimated relative exchangeabilities for pairs of amino acids from clades spread across the tree of life and assessed the historical signal in the distances among these clade-specific models. We separately trained these models on collections of arbitrarily selected protein alignments and on ribosomal protein alignments. In both cases, we found a clear separation between the models trained using multiple sequence alignments from bacterial clades and the models trained on archaeal and eukaryotic data. We assessed the predictive power of our novel clade-specific models of sequence evolution by asking whether fit to the models could be used to identify the source of multiple sequence alignments. Model fit was generally able to correctly classify protein alignments at the level of domain (bacterial versus archaeal), but the accuracy of classification at finer scales was much lower. The only exceptions to this were the relatively high classification accuracy for two archaeal lineages: Halobacteriaceae and Thermoprotei. Genomic GC content had a modest impact on relative exchangeabilities despite having a large impact on amino acid frequencies. Relative exchangeabilities involving aromatic residues exhibited the largest differences among models. There were a small number of exchangeabilities that exhibited large differences in comparisons among major clades and between generalized models and ribosomal protein models. Taken as a whole, these results reveal that a small number of relative exchangeabilities are responsible for much of the structure of the "model space" for protein sequence evolution. The clade-specific models we generated may be useful tools for protein phylogenetics, and the structure of evolutionary model space that they revealed has implications for phylogenomic inference across the tree of life. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF