1. Measuring Phylogenetic Information of Incomplete Sequence Data
- Author
-
Tae-Kun Seo, Jeffrey L. Thorne, Olivier Gascuel, Korea Polar Research Institute (KOPRI), Bioinformatique évolutive - Evolutionary Bioinformatics, Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS), Institut de Systématique, Evolution, Biodiversité (ISYEB ), Muséum national d'Histoire naturelle (MNHN)-École pratique des hautes études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université des Antilles (UA), North Carolina State University [Raleigh] (NC State), University of North Carolina System (UNC), This work was supported by the Korea Polar Research Institute [PE21130 and PE21140 to T.-K.S.], PRAIRIE [ANR-19-P3IA-0001 to O.G.], N.S.F. [DEB-1754142] and N.I.H. [R01 GM118508] to J.L.T., ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019), Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS), and Muséum national d'Histoire naturelle (MNHN)-École Pratique des Hautes Études (EPHE)
- Subjects
0106 biological sciences ,[SDV]Life Sciences [q-bio] ,Sequence alignment ,Biology ,010603 evolutionary biology ,01 natural sciences ,insertion ,Set (abstract data type) ,Evolution, Molecular ,03 medical and health sciences ,symbols.namesake ,Tree (descriptive set theory) ,INDEL Mutation ,Genetics ,deletion ,[MATH]Mathematics [math] ,Fisher information ,Ecology, Evolution, Behavior and Systematics ,Phylogeny ,030304 developmental biology ,0303 health sciences ,Sequence ,Models, Statistical ,Phylogenetic tree ,Models, Genetic ,business.industry ,gaps ,Probabilistic logic ,Pattern recognition ,Data set ,model adequacy ,indel ,sequence alignment ,symbols ,goodness-of-fit test ,Artificial intelligence ,business ,Regular Articles - Abstract
Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.]
- Published
- 2021
- Full Text
- View/download PDF