Skip to main content
U.S. flag

An official website of the United States government

A probabilistic approach to training machine learning models using noisy data

July 8, 2024

Machine learning (ML) models are increasingly popular in environmental and hydrologic modeling, but they typically contain uncertainties resulting from noisy data (erroneous or outlier data). This paper presents a novel probabilistic approach that combines ML and Markov Chain Monte Carlo simulation to (1) detect and underweight likely noisy data, (2) develop an approach capable of detecting noisy data during model deployment, and (3) interpret the reasons why a data point is deemed noisy to help heuristically distinguish between outliers and erroneous data. The new algorithm recognizes that there is no unique way to split the training data into noisy and clean data, and thus produces an ensemble of plausible splits. The algorithm successfully detected noisy data in synthetic benchmark problems with varying complexity and a real-world public supply water withdrawal dataset. The algorithm is generic and flexible, making it suitable for application across a broad range of hydrologic and environmental disciplines.

Publication Year 2024
Title A probabilistic approach to training machine learning models using noisy data
DOI 10.1016/j.envsoft.2024.106133
Authors Ayman Alzraiee, Richard Niswonger
Publication Type Article
Publication Subtype Journal Article
Series Title Environmental Modelling & Software
Index ID 70258333
Record Source USGS Publications Warehouse
USGS Organization California Water Science Center
Was this page helpful?