Back to Search Start Over

Cautionary Guidelines for Machine Learning Studies with Combinatorial Datasets

Authors :
Jeremy J. Henle
Scott E. Denmark
Andrew F. Zahrt
Source :
ACS Combinatorial Science. 22:586-591
Publication Year :
2020
Publisher :
American Chemical Society (ACS), 2020.

Abstract

Regression modeling is becoming increasingly prevalent in organic chemistry as a tool for reaction outcome prediction and mechanistic interrogation. Frequently, to acquire the requisite amount of data for such studies, researchers employ combinatorial datasets to maximize the number of data points while limiting the number of discrete chemical entities required. An often-overlooked problem in modeling studies using combinatorial datasets is the tendency to fit on patterns in the datasets (i.e., the presence or absence of a reactant or catalyst) rather than to identify meaningful trends between descriptors and the response variable. Consequently, the generality and interpretability of such models suffer. This report illustrates these well-known pitfalls in a case study, demonstrates the necessary control experiments to identify when this property will be problematic, and suggests how to perform further validation to assess general applicability and interpretability of models trained using combinatorial datasets.

Details

ISSN :
21568944 and 21568952
Volume :
22
Database :
OpenAIRE
Journal :
ACS Combinatorial Science
Accession number :
edsair.doi.dedup.....8182fbb25027132a83930a51d5f78b7f
Full Text :
https://doi.org/10.1021/acscombsci.0c00118