Back to Search
Start Over
Ancestry Inference Using Reference Labeled Clusters of Haplotypes
- Source :
- BMC Bioinformatics, BMC Bioinformatics, Vol 22, Iss 1, Pp 1-14 (2021)
- Publication Year :
- 2020
- Publisher :
- Cold Spring Harbor Laboratory, 2020.
-
Abstract
- We present ARCHes, a fast and accurate haplotype-based approach for inferring an individual’s ancestry composition. Our approach works by modeling haplotype diversity from a large, admixed cohort of hundreds of thousands, then annotating those models with population information from reference panels of known ancestry. The running time of ARCHes does not depend on the size of a reference panel because training and testing are separate processes, and the inferred population-annotated haplotype models can be written to disk and reused to label large test sets in parallel (in our experiments, it averages less than one minute to assign ancestry from 32 populations to 1,001 sections of a genotype using 10 CPU). We test ARCHes on public data from the 1,000 Genomes Project and HGDP as well as simulated examples of known admixture. Our results demonstrate that ARCHes outperforms RFMix at correctly assigning both global and local ancestry at finer population scales regardless of the amount of population admixture.Author SummaryHuman DNA is inherited from ancestors that come from different populations across the globe and across time. Being able to identify which of those populations make up an individual’s DNA, how much they contribute, and on which chromosomes, is currently an important open research problem with many applications in the study of human diversity and history. As DNA sequencing and genotyping technology has developed, we have greater and greater amounts of data, which allows for the development of new sophisticated machine learning methods to approach this problem, and presents a need to process large amounts of data efficiently. These methods learn from examples of DNA data from known populations, and must be robust to differences in size and diversity among those reference populations. We present a new approach to this problem called ARCHes (Ancestry inference usingReference labeledClusters ofHaplotypes), that models the global diversity of small segments of human DNA sequence (“haplotypes”), and the extent to which these haplotypes are associated with each of a set of population reference panels. It then computes the most likely population assignments and the points along the genome where the populations change. Our experiments show that ARCHes has superior accuracy compared to a state-of-the-art method in identifying source populations and their locations on the genome, regardless of the number of different populations present in the genome, how closely related those populations are. ARCHes is also able to model populations despite having a small amount of population reference DNA data.
- Subjects :
- QH301-705.5
Computer science
Computer applications to medicine. Medical informatics
Population
R858-859.7
Inference
Polymorphism, Single Nucleotide
RFMix
Biochemistry
Structural Biology
Humans
Biology (General)
HMM
1000 Genomes Project
education
Molecular Biology
education.field_of_study
Genome, Human
Ancestry inference
Methodology Article
Applied Mathematics
Haplotype
Local ancestry
ARCHes
Computer Science Applications
Running time
Genetics, Population
Haplotypes
Evolutionary biology
Human genome
Haplotype modeling
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- BMC Bioinformatics, BMC Bioinformatics, Vol 22, Iss 1, Pp 1-14 (2021)
- Accession number :
- edsair.doi.dedup.....bafdc810d973bc5c3a88aac5dbe1ab98