1. A data-driven method for detecting and diagnosing causes of water quality contamination in a dataset with a high rate of missing values.
- Author
-
Ngouna, Raymond Houé, Ratolojanahary, Romy, Medjaher, Kamal, Dauriac, Fabien, Sebilo, Mathieu, and Junca-Bourié, Jean
- Subjects
- *
WATER pollution , *WATER quality , *WATER supply , *ATRAZINE , *ANOMALY detection (Computer security) , *GENETIC algorithms - Abstract
Democratization of sensing devices in industrial systems has made it possible to collect a large amount of data of different types, which has led to the necessity of handling complex analyses for knowledge extraction. The field of water resources is of those areas which has drawn the attention of decision-makers seeking to preserve human health and safety. Recent advances in Artificial Intelligence, particularly in the domain of Machine Learning, have opened the potential to leverage massive data to better address the issue related to the relationship between water quality and human activities. However, high rate of missing data and heterogeneity of the measurements are scientific issues that cannot be solved by standard methods, especially when no prior knowledge on the label of each observation is provided. In this article, Prognostics and Health Management was implemented to detect and diagnose anomalies in water quality datasets, taking into account the uncertainties induced by the above-mentioned issues. Fuzzy c-means was used to identify the different water quality classes, while Random Forest was applied to determine the most influencing parameters, with respect to potential contamination of water resources in the southwest of France. The results suggest that multiple imputation methods can handle the missingness issue, while the use of decision rules based on well-known water quality standards can solve the problem regarding the lack of labelled observations. In addition, two potential sources of contamination (atrazine and nitrate) were identified and then validated by hydrogeology experts, prior to further online deployment of the proposed model. • Anomaly detection method allowing to handle a high rate of missing values. • Definition of decision rule to handle the lack of prior knowledge on the raw samples. • Hybridization with Genetic Algorithm to optimize the hyperparameters choice. • Recommendations for the data collection strategy to reduce underlying costs. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF