Back to Search Start Over

Coverage bias in small molecule machine learning

Authors :
Fleming Kretschmer
Jan Seipp
Marcus Ludwig
Gunnar W. Klau
Sebastian Böcker
Source :
Nature Communications, Vol 16, Iss 1, Pp 1-19 (2025)
Publication Year :
2025
Publisher :
Nature Portfolio, 2025.

Abstract

Abstract Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.

Subjects

Subjects :
Science

Details

Language :
English
ISSN :
20411723
Volume :
16
Issue :
1
Database :
Directory of Open Access Journals
Journal :
Nature Communications
Publication Type :
Academic Journal
Accession number :
edsdoj.f54fe41fbe0a4077b214283198000de1
Document Type :
article
Full Text :
https://doi.org/10.1038/s41467-024-55462-w