Back to Search Start Over

Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.

Authors :
Molotkov, Ivan
Artomov, Mykyta
Source :
Bioinformatics Advances. 2023, Vol. 3 Issue 1, p1-10. 10p.
Publication Year :
2023

Abstract

Motivation Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. Results We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. Availability and implementation Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias. [ABSTRACT FROM AUTHOR]

Details

Language :
English
Volume :
3
Issue :
1
Database :
Academic Search Index
Journal :
Bioinformatics Advances
Publication Type :
Academic Journal
Accession number :
179072766
Full Text :
https://doi.org/10.1093/bioadv/vbad128