An exploratory Bayesian network for estimating the magnitudes and uncertainties of selected water-quality parameters at streamgage 03374100 White River at Hazleton, Indiana, from partially observed data
An exploratory discrete Bayesian network (BN) was developed to assess the potential of this type of model for estimating the magnitudes and uncertainties of an arbitrary subset of unmeasured water-quality parameters given the measured complement of parameters historically measured at a U.S. Geological Survey streamgage. Water-quality data for 27 water-quality parameters from 596 discrete measurements at U.S. Geological Survey streamgage 03374100 White River at Hazleton, Indiana, were used to develop this BN. Data for each of the water-quality parameters were discretized into five intervals based on the quintiles of the measured values. The 596 discrete measurements were randomly partitioned into a training set with 80 percent of the data and a testing set with 20 percent of the data to identify, estimate, and assess the training and testing accuracy of the Bayesian network.
A BN with 28 nodes was formed from the 27 water-quality parameters and the month of sample collection. Based on data in the training set, a network with 53 directed edges and month as the target node was identified by minimizing the negative log-likelihood function for all nodes treated, in turn, as the target variable. The edge structure determines the number and magnitude of elements in conditional probability tables associated with all nodes.
The effectiveness of the BN was assessed on the basis of correct classification rates to one of the five discrete intervals, which were computed separately for the training and testing datasets and for two conditioning variable sets. The selected sets of conditioning variables represent two of many possible sets of measured parameters on which to base estimates of unmeasured parameters. The first set includes only the month of sample collection (month), and an expanded set includes month and six other continuously measurable parameters, referred to as the ContMeasSet, all of which were obtained from the discrete data.
Results indicated that the training dataset had average correct classification rates of 41.7- and 61.2-percent rates conditioned on the month and ContMeasSet sets, respectively. The testing dataset had somewhat lower average correct classification rates of 40.8 and 56.5 percent for the two conditioning variable sets. When conditioned on month only, the average correct classification rate for the testing dataset was only slightly lower than the average correct classification rate in the training dataset, indicating little model overfitting. When using the ContMeasSet, however, the average decrease in accuracy between training and testing sets was 4.9 percent. The training and testing datasets and both sets of conditioning variables, however, indicate that the BN would substantially outperform a random assignment model, which would be expected to have a 20-percent correct classification rate. In addition, the edge structure of the BN depicts how information can flow through the network, which may help prioritize parameters for measurement to facilitate estimation of unmeasured parameters. Finally, extension of a static BN, like the one developed in this report, to a dynamic BN may provide a basis for using high-frequency or continuous water-quality data to extend information in time between discrete water-quality samples, and this integration could mitigate some of the limitations of high-frequency and discrete water-quality sampling methods.
Citation Information
Publication Year | 2018 |
---|---|
Title | An exploratory Bayesian network for estimating the magnitudes and uncertainties of selected water-quality parameters at streamgage 03374100 White River at Hazleton, Indiana, from partially observed data |
DOI | 10.3133/sir20185053 |
Authors | David J. Holtschlag |
Publication Type | Report |
Publication Subtype | USGS Numbered Series |
Series Title | Scientific Investigations Report |
Series Number | 2018-5053 |
Index ID | sir20185053 |
Record Source | USGS Publications Warehouse |
USGS Organization | National Water Quality Program |