Back to Search
Start Over
EnsembleCNV: An ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
- Source :
- Nucleic Acids Research
- Publication Year :
- 2018
- Publisher :
- Cold Spring Harbor Laboratory, 2018.
-
Abstract
- The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV a) identifies and eliminates batch effects at raw data level; b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; d) refines CNVR boundaries by local correlation structure in copy number intensities; e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.
- Subjects :
- Quality Control
DNA Copy Number Variations
Genotyping Techniques
Computer science
Datasets as Topic
Genome-wide association study
Single-nucleotide polymorphism
Computational biology
Biology
Polymorphism, Single Nucleotide
01 natural sciences
Machine Learning
010104 statistics & probability
03 medical and health sciences
0302 clinical medicine
Genetics
Humans
Copy-number variation
0101 mathematics
1000 Genomes Project
Genotyping
030304 developmental biology
Genetic association
0303 health sciences
Genome, Human
Ensemble learning
SNP genotyping
Methods Online
030217 neurology & neurosurgery
SNP array
Subjects
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- Nucleic Acids Research
- Accession number :
- edsair.doi.dedup.....e011a94888bef7ee802f3bba21817989
- Full Text :
- https://doi.org/10.1101/356667