Back to Search Start Over

Sequana Coverage: Detection and Characterization of Genomic Variations using Running Median and Mixture Models

Authors :
Christiane Bouchier
Sean Kennedy
Dimitri Desvillechabrol
Thomas Cokelaer
Pôle Biomics (C2RT)
Centre de Ressources et de Recherche Technologique - Center for Technological Resources and Research (C2RT)
Institut Pasteur [Paris]-Institut Pasteur [Paris]
Hub Bioinformatique et Biostatistique - Bioinformatics and Biostatistics HUB
Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS)
This work has been supported by the France Génomique Consortium (ANR 10-INBS-09-08).
We are grateful to Nicolas Escriou (Institut Pasteur) for providing the FastQ and reference of the virus test case. We are also grateful to Benoit Arcangioli (Institut Pasteur) and Serge Gangloff (Institut Pasteur) for providing the FastQ files and reference of the S. pombe test case. We thank Juliana Pipoli da Fonseca for her various comments on the manuscript. We are also grateful to the reviewers who suggested the CNV studies
ANR-10-INBS-0009,France-Génomique,Organisation et montée en puissance d'une Infrastructure Nationale de Génomique(2010)
Institut Pasteur [Paris] (IP)-Institut Pasteur [Paris] (IP)
Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS)
Source :
GigaScience, GigaScience, BioMed Central, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩, GigaScience, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩
Publication Year :
2018
Publisher :
HAL CCSD, 2018.

Abstract

International audience; Background:In addition to mapping quality information, the Genome coverage contains valuable biological information such as the presence of repetitive regions, deleted genes, or copy number variations (CNVs). It is essential to take into consideration atypical regions, trends (e.g., origin of replication), or known and unknown biases that influence coverage. It is also important that reported events have robust statistics (e.g. z-score) associated with their detections as well as precise location.Results:We provide a stand-alone application, sequana_coverage, that reports genomic regions of interest (ROIs) that are significantly over- or underrepresented in high-throughput sequencing data. Significance is associated with the events as well as characteristics such as length of the regions. The algorithm first detrends the data using an efficient running median algorithm. It then estimates the distribution of the normalized genome coverage with a Gaussian mixture model. Finally, a z-score statistic is assigned to each base position and used to separate the central distribution from the ROIs (i.e., under- and overcovered regions). A double thresholds mechanism is used to cluster the genomic ROIs. HTML reports provide a summary with interactive visual representations of the genomic ROIs with standard plots and metrics. Genomic variations such as single-nucleotide variants or CNVs can be effectively identified at the same time.

Details

Language :
English
ISSN :
2047217X
Database :
OpenAIRE
Journal :
GigaScience, GigaScience, BioMed Central, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩, GigaScience, 2018, 7 (12), pp.giy110. ⟨10.1093/gigascience/giy110⟩
Accession number :
edsair.doi.dedup.....18803234164af4ceee283b2594e2d089
Full Text :
https://doi.org/10.1093/gigascience/giy110⟩