Howard M. Cann, Shane A. McCarthy, Chris Tyler-Smith, David Reich, Aylwyn Scally, Qasim Ayub, Pontus Skoglund, Petr Danecek, Jean-François Deleuze, Manjinder S. Sandhu, Pille Hallast, Ruoyun Hui, Yuan Chen, Swapan Mallick, Yali Xue, Sabine Felkel, Anders Bergström, Hélène Blanché, Mohamed A. Almarri, Jack Kamm, Richard Durbin, Bergström, Anders [0000-0002-4096-9268], McCarthy, Shane A [0000-0002-2715-4187], Hui, Ruoyun [0000-0002-5689-7131], Almarri, Mohamed A [0000-0003-1255-0918], Ayub, Qasim [0000-0003-3291-0917], Danecek, Petr [0000-0002-4159-1666], Felkel, Sabine [0000-0001-8935-8305], Hallast, Pille [0000-0002-0588-3987], Kamm, Jack [0000-0003-2412-756X], Blanché, Hélène [0000-0003-2115-575X], Deleuze, Jean-François [0000-0002-5358-4463], Mallick, Swapan [0000-0002-4531-4439], Reich, David [0000-0002-7037-5292], Skoglund, Pontus [0000-0002-3021-5913], Scally, Aylwyn [0000-0002-0807-1167], Durbin, Richard [0000-0002-9130-1006], Tyler-Smith, Chris [0000-0002-6492-5403], and Apollo - University of Cambridge Repository
INTRODUCTION: Large-scale human genome sequencing studies to date have been limited to large, metropolitan populations or to small numbers of genomes from each group. Much remains to be understood about the extent and structure of genetic variation in our species and how it was shaped by past population separations, admixture, adaptation, size changes, and gene flow from archaic human groups. Larger numbers of genome sequences from more diverse populations are needed to illuminate these questions. RATIONALE: We sequence 929 genomes from 54 geographically, linguistically and culturally diverse human populations to an average of 35x coverage, and analyze the variation among them. We also physically resolve the haplotype phase of 26 of these genomes using linked-read sequencing. RESULTS: We identify 67.3 million single-nucleotide polymorphisms (SNPs), 8.8 million small insertions or deletions (indels) and 40,736 copy number variants (CNVs). This includes hundreds of thousands of variants that had not been discovered by previous sequencing efforts but which are common in one or more population. We demonstrate benefits to the study of population relationships of genome sequences over ascertained array genotypes, particularly when involving African populations. Populations in central and southern Africa, the Americas and Oceania each harbour tens to hundreds of thousands of private, common genetic variants. The majority of these variants arose as novel mutations rather than through archaic introgression, except in Oceanian populations where many private variants derive from Denisovan admixture. While some reach high frequencies, no variants are fixed between major geographical regions. We estimate that the genetic separation between present-day human populations occurred mostly within the last 250,000 years. However, these early separations were gradual in nature and shaped by protracted gene flow. All populations thus still had some genetic contact more recently than this, but there is also evidence that a small fraction of present-day structure might be hundreds of thousands of years older. Most populations expanded in size over the last 10,000 years, but hunter-gatherer groups did not. The low diversity among the Neanderthal haplotypes segregating in present-day populations indicates that, while more than one Neanderthal individual must have contributed genetic material to modern humans, there was likely only one major episode of admixture. In contrast, Denisovan haplotype diversity reflects a more complex history involving more than one episode of admixture. We find small amounts of Neanderthal ancestry in West African genomes, most likely reflecting Eurasian admixture. Despite their very low levels or absence of archaic ancestry, African populations share many Neanderthal and Denisovan variants that are absent from Eurasia, reflecting how a larger proportion of the ancestral human variation has been maintained in Africa. CONCLUSION: The discovery of substantial amounts of common genetic variation that was previously undocumented, and is geographically restricted, highlights the continued value of anthropologically informed study designs for understanding human diversity. The genome sequences presented here are a freely available resource with relevance to population history, medical genetics, anthropology and linguistics. [Figure: see text]