Back to Search Start Over

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Authors :
Scott H. Snyder
Patricia A. Vignaux
Mustafa Kemal Ozalp
Jacob Gerlach
Ana C. Puhl
Thomas R. Lane
John Corbett
Fabio Urbina
Sean Ekins
Source :
Communications Chemistry, Vol 7, Iss 1, Pp 1-11 (2024)
Publication Year :
2024
Publisher :
Nature Portfolio, 2024.

Abstract

Abstract Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the ‘no-free lunch’ theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a ‘goldilocks zone’ for each model type, in which dataset size and feature distribution (i.e. dataset “diversity”) determines the optimal algorithm strategy. When datasets are small (

Subjects

Subjects :
Chemistry
QD1-999

Details

Language :
English
ISSN :
23993669
Volume :
7
Issue :
1
Database :
Directory of Open Access Journals
Journal :
Communications Chemistry
Publication Type :
Academic Journal
Accession number :
edsdoj.35dfdbecd04240859dca01d619069e05
Document Type :
article
Full Text :
https://doi.org/10.1038/s42004-024-01220-4