1. Representative splitting cross validation.
- Author
-
Xu, Lu, Hu, Ou, Guo, Yuwan, Zhang, Mengqin, Lu, Daowang, Cai, Chen-Bo, Xie, Shunping, Goodarzi, Mohammad, Fu, Hai-Yan, and She, Yuan-Bin
- Subjects
- *
LATENT variables , *ESTIMATION theory , *MATHEMATICAL complex analysis , *MULTIVARIATE analysis , *PARTIAL least squares regression - Abstract
Abstract Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden. Highlights • Representative splitting cross validation (RSCV) was proposed. • The DUPLEX algorithm was used to split the raw data set. • RSCV is a fusion of serial k -fold cross validations. • RSCV is stable and can obtain simpler models when necessary. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF