1. A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection
- Author
-
Filip Bielejec, Vladimir N. Minin, Philippe Lemey, Sergei L. Kosakovsky Pond, and Marc A. Suchard
- Subjects
0106 biological sciences ,Statistics and Probability ,Posterior probability ,Scale (descriptive set theory) ,Biology ,computer.software_genre ,010603 evolutionary biology ,01 natural sciences ,Biochemistry ,Evolution, Molecular ,Set (abstract data type) ,03 medical and health sciences ,Bayes' theorem ,Statistics ,Code (cryptography) ,Amino Acids ,Selection, Genetic ,Codon ,Molecular Biology ,Phylogeny ,Selection (genetic algorithm) ,030304 developmental biology ,Estimation ,Stochastic Processes ,0303 health sciences ,Models, Genetic ,Stochastic process ,Bayes Theorem ,Original Papers ,Computer Science Applications ,Computational Mathematics ,Amino Acid Substitution ,Computational Theory and Mathematics ,Viruses ,Data mining ,Sequence Alignment ,computer - Abstract
Motivation: Statistical methods for comparing relative rates of synonymous and non-synonymous substitutions maintain a central role in detecting positive selection. To identify selection, researchers often estimate the ratio of these relative rates () at individual alignment sites. Fitting a codon substitution model that captures heterogeneity in across sites provides a reliable way to perform such estimation, but it remains computationally prohibitive for massive datasets. By using crude estimates of the numbers of synonymous and non-synonymous substitutions at each site, counting approaches scale well to large datasets, but they fail to account for ancestral state reconstruction uncertainty and to provide site-specific estimates. Results: We propose a hybrid solution that borrows the computational strength of counting methods, but augments these methods with empirical Bayes modeling to produce a relatively fast and reliable method capable of estimating site-specific values in large datasets. Importantly, our hybrid approach, set in a Bayesian framework, integrates over the posterior distribution of phylogenies and ancestral reconstructions to quantify uncertainty about site-specific estimates. Simulations demonstrate that this method competes well with more-principled statistical procedures and, in some cases, even outperforms them. We illustrate the utility of our method using human immunodeficiency virus, feline panleukopenia and canine parvovirus evolution examples. Availability: Renaissance counting is implemented in the development branch of BEAST, freely available at http://code.google.com/p/beast-mcmc/. The method will be made available in the next public release of the package, including support to set up analyses in BEAUti. Contact: philippe.lemey@rega.kuleuven.be or msuchard@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online.
- Published
- 2012