Back to Search Start Over

Fair train-test split in machine learning: Mitigating spatial autocorrelation for improved prediction accuracy.

Authors :
Salazar, Jose J.
Garland, Lean
Ochoa, Jesus
Pyrcz, Michael J.
Source :
Journal of Petroleum Science & Engineering. Feb2022, Vol. 209, pN.PAG-N.PAG. 1p.
Publication Year :
2022

Abstract

Machine learning supports prediction and inference in multivariate and complex datasets where observations are spatially related to one another. Frequently, these datasets depict spatial autocorrelation that violates the assumption of identically and independently distributed data. Overlooking this correlation result in over-optimistic models that fail to account for the geographical configuration of data. Furthermore, although different data split methods account for spatial autocorrelation, these methods are inflexible, and the parameter training and hyperparameter tuning of the machine learning model is set with a different prediction difficulty than the planned real-world use of the model. In other words, it is an unfair training-testing process. We present a novel method that considers spatial autocorrelation and planned real-world use of the spatial prediction model to design a fair train-test split. Demonstrations include two examples of the planned real-world use of the model using a realistic multivariate synthetic dataset and the analysis of 148 wells from an undisclosed Equinor play. First, the workflow applies the semivariogram model of the target to compute the simple kriging variance as a proxy of spatial estimation difficulty based on the spatial data configuration. Second, the workflow employs a modified rejection sampling to generate a test set with similar prediction difficulty as the planned real-world use of the model. Third, we compare 100 test sets' realizations to the model's planned real-world use, using probability distributions and two divergence metrics: the Jensen-Shannon distance and the mean squared error. The analysis ranks the spatial fair train-test split method as the only one to replicate the difficulty (i.e., kriging variance) compared to the validation set approach and spatial cross-validation. Moreover, the proposed method outperforms the validation set approach, yielding a minor mean percentage error when predicting a target feature in an undisclosed Equinor play using a random forest model. The resulting outputs are training and test sets ready for model fit and assessment with any machine learning algorithm. Thus, the proposed workflow offers spatial aware sets ready for predictive machine learning problems with similar estimation difficulty as the planned real-world use of the model and compatible with any spatial data analysis task. • Our data split method handles spatial autocorrelation and imposes prediction fairness. • The sets impose fair algorithms with similar difficulty in all machine learning steps. • Kriging variance is a surrogate of spatial prediction difficulty. • The resulting training and test sets are compatible with any machine learning model. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09204105
Volume :
209
Database :
Academic Search Index
Journal :
Journal of Petroleum Science & Engineering
Publication Type :
Academic Journal
Accession number :
154452637
Full Text :
https://doi.org/10.1016/j.petrol.2021.109885