A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17

September 7, 2021

Scant availability of streamflow data can impede the utility of streamflow as a variable in ecological models of aquatic and terrestrial species, especially when studying small streams in watersheds that lack streamgages. Streamflow data at fine resolution and broad extent were needed by collaborators for ecological research on small streams in several ungaged watersheds of southwestern Wyoming, where streamflow data are sparse.

To improve the utility of sparse streamflow data to ecological research in ungaged watersheds, we developed a machine learning approach in R for modeling spatially and temporally continuous monthly streamflow from 2012 through 2017 in three semiarid montane-steppe watersheds (with drainage areas of 26–55 square miles and mean elevations of 8,031–8,455 feet) on the Wyoming Range in the upper Green River Basin. A machine learning streamflow (MLFLOW) model was calibrated and validated with 971 discrete streamflow observations and 24 static and dynamic predictor variables derived from geospatial and time series data on climatic, physiographic, and anthropogenic characteristics affecting streamflow. The predictor variables were temporally and spatially conditioned to amplify the relation of predictor variables to monthly streamflow.

The MLFLOW model had satisfactory agreement between observed and predicted streamflow (coefficient of determination [R²]=0.80, Nash-Sutcliffe efficiency [NSE]=0.79, NSE with log-transformed data [logNSE]=0.82, and percent bias [PBIAS]=0.7 percent). NSE and logNSE indicated the MLFLOW model performed equally well for high and low flows, and PBIAS indicated the MLFLOW model did not overpredict or underpredict monthly streamflow. Streamflow predictions seemed to well represent the annual hydrograph within the study area during the study period.

The most important variables (statistically important in the MLFLOW model) for explaining monthly streamflow were temporally and spatially conditioned dynamic climatic variables, mostly precipitation and snow water equivalent. Importance of the static and dynamic variables did not differ substantially among the three watersheds but differed considerably among the 6 years. Monthly streamflow increased with increasing precipitation, snow water equivalent, and drainage area but decreased with increasing forest cover, elevation, evapotranspiration, and temperature.

The MLFLOW model was most sensitive to selection of dynamic climatic variables. Unconditioned dynamic climatic variables alone explained 54 percent of the variance (R²=0.54) in monthly streamflow, whereas adding static physiographic and anthropogenic variables only explained 12 percent more of the variance (R²=0.66). Also, spatial conditioning of all variables together with temporal conditioning of dynamic variables increased the variance explained in the MLFLOW model by another 14 percent (R²=0.80). The MLFLOW model also had greater sensitivity to temporal than to spatial differences in the data. For the MLFLOW model trained with observations from all watersheds and years or for models trained with observations from all except one watershed or 1 year left out sequentially, performance was better in testing on observations from each watershed than from each year separately. Also, performance was better for models fitted to fewer sites than to fewer months of observations.

The greatest utility of the modeling approach is the ease of use and the speed of processing input data, running the model, and interpreting the model output, whereas the greatest limitation is the need for spatially and temporally representative streamflow observations to drive the model. Although familiarity with R is necessary, only a working knowledge of hydrology (for selecting appropriate predictor variables and evaluating the quality of streamflow observations) and a rudimentary understanding of machine learning models are needed. Therefore, this modeling approach is practicable for other scientists who work with water but who are not hydrologists.

Publication Year	2021
Title	A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17
DOI	10.3133/sir20215093
Authors	Ryan R. McShane, Cheryl A. Eddy-Miller
Publication Type	Report
Publication Subtype	USGS Numbered Series
Series Title	Scientific Investigations Report
Series Number	2021-5093
Index ID	sir20215093
Record Source	USGS Publications Warehouse
USGS Organization	WY-MT Water Science Center

A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17

Hydrologist

Supervisory Hydrologist

Hydrologist

Supervisory Hydrologist

Wyoming-Montana Water Science Center - Helena Office

U.S. Geological Survey

U.S. Department of the Interior

A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17

Citation Information

Related Content

Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012-17

Ryan R. McShane

Hydrologist

Cheryl Eddy Miller

Supervisory Hydrologist

Related Content

Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012-17

Ryan R. McShane

Hydrologist

Cheryl Eddy Miller

Supervisory Hydrologist