Back to Search Start Over

A probabilistic approach to training machine learning models using noisy data.

Authors :
Alzraiee, Ayman H.
Niswonger, Richard G.
Source :
Environmental Modelling & Software. Aug2024, Vol. 179, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines. • The study presents a new probabilistic method to identify and reduce the impact of noisy data in machine learning datasets. • The approach generates a supervised noise detection model to identify noisy data during both model development and deployment. • The supervised noise detection model is interpreted to identify factors causing data to appear as noisy. • Interpretation of the supervised noise detection model is used to heuristically distinguish between erroneous and outlier data. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
13648152
Volume :
179
Database :
Academic Search Index
Journal :
Environmental Modelling & Software
Publication Type :
Academic Journal
Accession number :
178478197
Full Text :
https://doi.org/10.1016/j.envsoft.2024.106133