1. Exploiting whole-exome sequencing data to identify copy number variants
- Author
-
D'Aurizio R, Tattini L, Pippucci T, Pellegrini M, and Magi A
- Subjects
Computational Biology ,Next generation Sequencing - Abstract
Whole Exome Sequencing (WES) is a cost-effective and extensively used method involving the capture of enriched protein-coding regions by hybridizing genomic DNA to oligonucleotide probes (baits) and then sequencing using high-throughput technology. WES is a standard approach for detecting single nucleotide variants (SNVs) and small insertion-deletion (indel) variants. Recently new tools have been developed to identify major genomic rearrangements like Copy Number Variants. CNVs are structural variants larger than 50bp, demonstrated to be one of the main sources of genomic variation in humans [1000 Genomes Project Consortium, Abecasis, Nature 2010] causing various diseases including cancer and cardiovascular disease. Few computational tools have been developed to identify CNVs in targeted regions by WES using Read Count (RC) approach. None of them allows for the identification of CNVs in intergenic regions. We present a new tool able to infer CNVs in both coding and non coding regions from WES data. We studied the distribution of reads generated by WES approach in coding and non coding regions of the genome. To this end, we used a survey dataset made of 30 WES samples that includes data generated by Clark and by our labs using all three enrichment kits. By exploiting this dataset, we evaluated the properties of Read Count measure (the number of reads that maps to specific region of the genome) in two distinct classes of genomic features: (i) all the CCDS and (ii) non-overlapping genomic windows of different sizes that belong to intergenic regions of the genome. We studied the relationship between RC data and classical genomic systematic biases, such as GC-content and mappability. As a further step, we evaluated the capability of RC data to predict the number of DNA copy of coding and non coding regions. Finally we developed a computational framework for exploiting WES data to identify and predict the genomic regions involved in copy number variation. We found that the three enrichment kits obtain different results in terms of percentage of reads that unambiguously map to out target regions: 38% of reads for TruSeq, 20-50% for SeqCap and 28-50% for SureSelect. As expected, we found that both in-target and out-target RC are affected by traditional biases (GC-content and mappability) and that this biases can be well-mitigated by traditional RC normalization methods. By using WES data of individuals sequenced by the 1000 Genome project consortium and previously genotyped by Conrand et al. [Nature 2010] we studied the correlation between RC data and the copy number states. We found that RC data are well correlated with CNV state for both coding and non-coding regions, thus demonstrating the capability of our hybrid approach (RC data on coding and non-coding regions) to identify and predict the correct number of DNA copy of the genome. Finally we extended our previously developed pipeline (EXCAVATOR, Magi, Genome Bio 2013 ) to this novel framework. We applied our novel computational pipeline to the analysis of a cancer and a population datasets and we show its capability to recognize exomic as well as genomic regions involved in CNV.
- Published
- 2014