Back to Search Start Over

Scoring of pathogenic non-coding variants in Mendelian diseases through supervised learning on ancient, recent and ongoing purifying selection signals in human

Authors :
Yufei Luo
Antonio Rausell
Barthelemy Caron
Publication Year :
2018
Publisher :
Cold Spring Harbor Laboratory, 2018.

Abstract

The study of rare Mendelian diseases through exome sequencing typically yields incomplete diagnostic rates, ~8-70% depending on the disease type. Whole genome sequencing of the unresolved cases allows addressing the hypothesis that causal variants could lay in non-coding regions with damaging regulatory consequences. The large amount of rare and singleton variants found in each individual genome requires computational filtering and scoring strategies to gain power in downstream statistical genetics tests. However, state-of-the-art methods estimating the functional relevance of non-coding genomic regions have been mostly characterized on sets of variants largely composed of trait-associated polymorphisms and associated to common diseases, yet with modest accuracy and strong positional biases. In this work we first curated a collection of n=737 high-confidence pathogenic non-coding single-nucleotide variants in proximalcis-regulatory genomic regions associated to monogenic Mendelian diseases. We then systematically evaluated the ability to predict causal variants of a comprehensive set of natural selection features extracted at three genomic levels: the affected position, the flanking region and the associated gene. In addition to inter-species conservation, a comprehensive set of recent and ongoing purifying selection signals in human was explored, allowing to capture potential constraints associated to recently acquired regulatory elements in the human lineage. A supervised learning approach using gradient tree boosting on such features reached a high predictive performance characterized by an area under the ROC curve = 0.84 and an area under the Precision-Recall curve = 0.47. The figures represent a relative improvement of >10% and >34% respectively upon the performance of current state-of-the-art methods for prioritizing non-coding variants. Performance was consistent under multiple configurations of the sets of variants used for learning and for independent testing. The supervised learning design allowed the assessment of newly seen non-coding variants overcoming gene and positional bias. The scores produced by the approach allow a more consistent weighting and aggregation of candidate pathogenic variants from diverse non-coding regions within and across genes in the context of statistical tests for rare variant association analysis.

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....791a1ad0b4d462f2a1b38a0b19f5d372
Full Text :
https://doi.org/10.1101/363903