Back to Search
Start Over
Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data
- Source :
- Bioinformatics
- Publication Year :
- 2016
- Publisher :
- Oxford University Press, 2016.
-
Abstract
- Motivation Variant calling from next-generation sequencing (NGS) data is susceptible to false positive calls due to sequencing, mapping and other errors. To better distinguish true from false positive calls, we present a method that uses genotype array data from the sequenced samples, rather than public data such as HapMap or dbSNP, to train an accurate classifier using Random Forests. We demonstrate our method on a set of variant calls obtained from 642 African-ancestry genomes from the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), sequenced to high depth (30X). Results We have applied our classifier to compare call sets generated with different calling methods, including both single-sample and multi-sample callers. At a False Positive Rate of 5%, our method determines true positive rates of 97.5%, 95% and 99% on variant calls obtained using Illuminas single-sample caller CASAVA, Real Time Genomics multisample variant caller, and the GATK UnifiedGenotyper, respectively. Since NGS sequencing data may be accompanied by genotype data for the same samples, either collected concurrent to sequencing or from a previous study, our method can be trained on each dataset to provide a more accurate computational validation of site calls compared to generic methods. Moreover, our method allows for adjustment based on allele frequency (e.g. a different set of criteria to determine quality for rare versus common variants) and thereby provides insight into sequencing characteristics that indicate call quality for variants of different frequencies. Availability and Implementation Code is available on Github at: https://github.com/suyashss/variant_validation Supplementary information Supplementary data are available at Bioinformatics online.
- Subjects :
- 0301 basic medicine
Statistics and Probability
dbSNP
Genotype
Genotyping Techniques
Computer science
Genomics
computer.software_genre
Biochemistry
Genome
Polymorphism, Single Nucleotide
03 medical and health sciences
Humans
International HapMap Project
Molecular Biology
Allele frequency
Whole genome sequencing
Massive parallel sequencing
Whole Genome Sequencing
Genome, Human
High-Throughput Nucleotide Sequencing
Original Papers
Computer Science Applications
Data Accuracy
Computational Mathematics
030104 developmental biology
Computational Theory and Mathematics
Human genome
Data mining
computer
Sequence Analysis
Subjects
Details
- Language :
- English
- ISSN :
- 13674811 and 13674803
- Volume :
- 33
- Issue :
- 8
- Database :
- OpenAIRE
- Journal :
- Bioinformatics
- Accession number :
- edsair.doi.dedup.....885417f3ac07e9b30d7f254fdd129c03