Start Over

On missing random effects in machine learning.

Authors :: D'Ottaviano, Fabio
Yang, Wenzhao
Source :: Communications in Statistics: Simulation & Computation. 2022, Vol. 51 Issue 11, p6320-6331. 12p.
Publication Year :: 2022
Abstract: The large availability of undesigned data, a by-product of chemical industrial research and manufacturing, makes it attractive the venturesome use of machine learning for its plug-and-play appeal in attempt to extract value out of this data. Often this type of data does not only reflect the response to controlled variation but also to that caused by random effects. Thus, machine learning based models in this industry may easily miss active random effects out. This study shows by simulation the effect of missing a random effect via machine learning — vs. including it properly via mixed models as a benchmark — in a context commonly encountered in the chemical industry — mixture experiments with process variables — and as a function of relative cluster size, total variance, proportion of variance attributed to the random effect, and data size. Simulation was employed for it allows the comparison — missing vs. not missing random effects — to be made clear and in a simple manner while avoiding unwanted confounders found in real world data. Besides the long-established fact that machine learning performs better the larger the size of the data, it was also observed that data lacking due specificity — i.e. without clustering information — causes critical prediction biases regardless the data size. [ABSTRACT FROM AUTHOR]