Back to Search
Start Over
Coverage bias in small molecule machine learning
- Source :
- Nature Communications, Vol 16, Iss 1, Pp 1-19 (2025)
- Publication Year :
- 2025
- Publisher :
- Nature Portfolio, 2025.
-
Abstract
- Abstract Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.
- Subjects :
- Science
Subjects
Details
- Language :
- English
- ISSN :
- 20411723
- Volume :
- 16
- Issue :
- 1
- Database :
- Directory of Open Access Journals
- Journal :
- Nature Communications
- Publication Type :
- Academic Journal
- Accession number :
- edsdoj.f54fe41fbe0a4077b214283198000de1
- Document Type :
- article
- Full Text :
- https://doi.org/10.1038/s41467-024-55462-w