51. Inferring dated genealogies from modern and ancient genomes
- Author
-
Wohns, Anthony Wilder Lauritano, Gil, McVean, Jerome, Kelleher, and Yan, Wong
- Subjects
Statistics ,Genetics - Abstract
Tree sequences encode the ancestral history of a sample of genomes, reflecting the sum total of what is theoretically knowable about the sample's evolutionary history. However, to realise the promise of this structure, scalable methods that infer time-resolved tree sequences from both modern and ancient samples are needed. In this thesis, I address this challenge and thus realise the power of tree sequences for population genetic analysis. First, I investigate a key assumption of the tree sequence inference algorithm tsinfer: that allele frequency reflects relative allele age. I derive an exact iterative formula for the expected accuracy of frequency in assigning the relative age of pairs of alleles at unlinked loci under the coalescent. Using coalescent simulations approximating human demographic history, I further show that allele frequency correctly orders 91.6% of the mutation pairs that are relevant for topological inference. Second, I introduce new statistics to summarise genetic relationships between individuals and populations. Genealogical Nearest Neighbours quantifies topological relationships in a tree sequence and is applied to population sequencing datasets. The related Genomic Descent statistic quantifies descent from ancient individuals. I use this statistic to reveal links between Neolithic Britons and modern Sardinian individuals, as well as between an Afanasievo family of four and UK Biobank participants from traditionally Celtic regions of Britain and Ireland. Third, I introduce tsdate, an approximate Bayesian method for inferring the age of ancestral haplotypes conditional on a tree sequence topology. I validate the algorithm using simulations and empirical analyses, the latter indicating that 96.3% of tsdate-estimated derived allele ages from Chromosome 20 of the 1000 Genomes Project are consistent with radiocarbon-dated ancient samples. Finally, I use an iterative approach combining tsinfer and tsdate to infer the largest time-resolved genealogy constructed to date with 3,719 ancient and modern samples from eight datasets. The genealogy provides estimates of the distribution of time to most recent common ancestor for the 215 populations in the datasets, recovers patterns of descent from Archaics, and suggests that up to 17.5% of variant sites could be the result of more than one ancestral mutation. I also introduce a simple, non-parametric estimator of the spatiotemporal properties of inferred ancestors that recapitulates key events in hominin history. These results demonstrate that tree sequences are a powerful means of synthesising genetic data and provide rich insights into human evolution.
- Published
- 2021