Back to Search
Start Over
Sequana Coverage: Detection and Characterization of Genomic Variations using Running Median and Mixture Models
- Source :
- GigaScience, GigaScience, BioMed Central, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩, GigaScience, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩
- Publication Year :
- 2018
- Publisher :
- HAL CCSD, 2018.
-
Abstract
- International audience; Background:In addition to mapping quality information, the Genome coverage contains valuable biological information such as the presence of repetitive regions, deleted genes, or copy number variations (CNVs). It is essential to take into consideration atypical regions, trends (e.g., origin of replication), or known and unknown biases that influence coverage. It is also important that reported events have robust statistics (e.g. z-score) associated with their detections as well as precise location.Results:We provide a stand-alone application, sequana_coverage, that reports genomic regions of interest (ROIs) that are significantly over- or underrepresented in high-throughput sequencing data. Significance is associated with the events as well as characteristics such as length of the regions. The algorithm first detrends the data using an efficient running median algorithm. It then estimates the distribution of the normalized genome coverage with a Gaussian mixture model. Finally, a z-score statistic is assigned to each base position and used to separate the central distribution from the ROIs (i.e., under- and overcovered regions). A double thresholds mechanism is used to cluster the genomic ROIs. HTML reports provide a summary with interactive visual representations of the genomic ROIs with standard plots and metrics. Genomic variations such as single-nucleotide variants or CNVs can be effectively identified at the same time.
- Subjects :
- 0301 basic medicine
DNA Copy Number Variations
Computer science
CNV
Sequencing data
Robust statistics
Health Informatics
Computational biology
Disease cluster
Genome
03 medical and health sciences
sequencing depth
mental disorders
Technical Note
Copy-number variation
Statistic
Snakemake
running median
Bacteria
Fungi
Genetic Variation
High-Throughput Nucleotide Sequencing
Sequana
Repetitive Regions
Mixture model
Computer Science Applications
030104 developmental biology
genome coverage
NGS
Viruses
[INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM]
Algorithms
Python
Subjects
Details
- Language :
- English
- ISSN :
- 2047217X
- Database :
- OpenAIRE
- Journal :
- GigaScience, GigaScience, BioMed Central, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩, GigaScience, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩
- Accession number :
- edsair.doi.dedup.....18803234164af4ceee283b2594e2d089
- Full Text :
- https://doi.org/10.1093/gigascience/giy110⟩