Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models
February 25, 2021
Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias: (1) empirical distribution matching (EDM); (2) regression of observed on estimated values (ROE); (3) linear transfer function (LTF); (4) linear equation based on Z-score transform (ZZ); (5) second machine learning model used to estimate residuals (ML2-RES); and (6) Duan smearing estimate applied after ROE is implemented (ROE-Duan). The performance of the methods was evaluated using four previously published ML case studies of groundwater quality: (1) pH in the glacial aquifer system; (2) pH in the North Atlantic Coastal Plain; (3) nitrate in the Central Valley of California; and (4) iron in the Mississippi Embayment. This data release includes nine tables. For each of the four case studies, there are training data and holdout data; hence there are eight data tables. Each of the data tables includes observed values and ML estimates; these were obtained from previously published reports (Ransom and others, 2017; DeSimone and others, 2020; Knierem and others, 2020; Stackelberg and others, 2020). Each of the tables also includes bias-corrected values for each of the data points. The methods for obtaining the bias-corrected values are described in the primary related publication (Belitz and Stackelberg; 2021). The ninth table includes coefficients of equations associated with selected bias-correction methods for each of the case studies. Not all of the methods were applied to all of the case studies.
Citation Information
Publication Year | 2021 |
---|---|
Title | Data Release for Evaluation of Six Methods for Correcting Bias in Estimates from Ensemble Tree Machine Learning Regression Models |
DOI | 10.5066/P9LCTYI2 |
Authors | Kenneth Belitz, Paul E Stackelberg, Jennifer B Sharpe |
Product Type | Data Release |
Record Source | USGS Asset Identifier Service (AIS) |
USGS Organization | Water Resources Mission Area - Headquarters |
Rights | This work is marked with CC0 1.0 Universal |
Related
Evaluation of six methods for correcting bias in estimates from ensemble tree machine learning regression model
Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias. Method performance was evaluated using four case...
Authors
Kenneth Belitz, Paul Stackelberg
Related
Evaluation of six methods for correcting bias in estimates from ensemble tree machine learning regression model
Ensemble-tree machine learning (ML) regression models can be prone to systematic bias: small values are overestimated and large values are underestimated. Additional bias can be introduced if the dependent variable is a transform of the original data. Six methods were evaluated for their ability to correct systematic and introduced bias. Method performance was evaluated using four case...
Authors
Kenneth Belitz, Paul Stackelberg