Modeling in support of development of biocriteria for wadeable California streams and rivers
The State Water Resources Control Board (SWRCB) of California has initiated a process to develop biological objectives for wadeable freshwater streams and rivers for the entire state.
Previously the value of protecting aquatic resources was recognized by SWRCB and the Regional Water Quality Control Boards (RWQCB). This interest led to a number of projects funded by the RWQCBs and the California Department of Fish and Game to develop methods of bioassessment for California streams; however, there are presently only narrative objectives for protection of aquatic life beneficial uses. There are no numeric objectives or guidance for using biological data in regulatory decision making.
The absence of biological objectives or the lack of guidance has limited the effectiveness of many Water Board programs, leading to:
- the inability to objectively assess whether aquatic life beneficial uses are supported;
- the inability to assess whether chemical and physical criteria are sufficient to protect aquatic life (i.e., whether permits relying on chemical and physical criteria alone are achieving healthy streams and rivers);
- inconsistencies in identifying impaired waterbodies;
- costly development of biological targets on a project-by-project basis.
These problems can be resolved by employing modern tools for directly measuring and protecting aquatic life and developing thresholds and guidance for assessing the data. SCCWRP, in cooperation with the California Department of Fish and Game (CDFG) and SWRCB, is assembling a technical team to develop the tools needed by managers and policy makers (Ken Schiff, SCCWRP, pers. comm.). Modeling of stressor-response relationships will play an important role in a number of tasks associated in developing biocriteria including, stressor identification, definition of reference conditions, and waterbody classification. USGS has experience developing such models (e.g., Waite et al. 2010, Brown et al., in preparation) in California and other geographic regions. As a result of this experience, USGS has been invited to contribute to the process of developing modeling tools appropriate for aiding decisions by managers and policy makers.
The overall goal of this project is to develop stressor-response models to support development of biocriteria for wadeable streams and rivers in California. Meeting this goal has been divided into two tasks to be accomplished sequentially over the two years of the project:
Task 1: Construct stress-response models for subareas of California
California is a large state that encompasses a wide range of ecological conditions. Stressors vary across these areas as well as the biological response variables most likely to be sensitive to the stressors. There are also a variety of modeling tools that can be applied to determine the best choices of biological response variables and environmental stressor variables. The scope of work defined in this task addresses developing models for 6 areas in California that roughly correspond to areas regulated by various RWCQBs and EPA Level III ecoregions. The regions are Southern California, California Desert, Central Valley, Chaparral and Oak woodlands (surrounding Central Valley), the Sierra Nevada the Eastern Cascade Slopes and Foothills (northeastern California), and the North Coast.
Task 2: Refine stress-response models as needed
It is unclear whether the models developed as part of Task 1 will be sufficient to establish biocriteria. The six subregions are sufficiently large that there is significant ecological variability within them. The proportion of variability explained by the Task 1 models will be assessed. For subregions where unexplained variability is unacceptably high, models will be developed for more focused regions within each subregion as the data allow.
All biological data for this study has already been gathered and is available from a database maintained by CDFG. The environmental stressor data is also available from CDFG databases. These data are a combination of variables derived from land cover/land use data using Geographic Information Systems software and actual measurements taken at the time of biological sampling. All work will be done in cooperation of the team assembled by SCCWRP to insure the effort remains focused on the needs of the team for the development of biocriteria. Modelling work will commence after the other members of the team prepare and deliver to USGS the final proofed versions of the both environmental stressor data and biological response data.
Task 1: Construct stress-response models for subregions of California
For each subregion USGS will conduct preliminary analysis of available stressor variables to determine if variables are appropriate for the modeling efforts. For each variable we will assess the range of response of the variable and if it is correlated with other proposed variables. Methods for assessing variables are well established and have previously been applied by the USGS and other members of the SCCWRP team for similar efforts (e.g., Ode et al. 2005, Brown et al. 2009, Waite et al. 2010). Final choices of stressor variables will made in consultation with appropriate members of the SCCWRP team.
We will conduct similar preliminary analyses of available response variables. A large number of response variables (100+) can typically be calculated from the data collected during a bioassessment. Rather than evaluating all such response variables for each subregion we will limit the number considered to those that previous research has shown to be useful in each area. If an Index of Biotic Integrity (e.g. Ode et al. 2005, Rehn et al. 2008) or O/E model (e.g., Hawkins et al. 2000) has already been developed for a region, they will be included in the response variable subset. Otherwise, we will limit assessment to metrics shown by previous work to be responsive to environmental stress. For example, the total number of taxa at a site, the average of the EPA tolerance of all the invertebrate taxa found at a site, and the number of taxa at a site belonging to Ephemeroptera, Tricoptera, and Plecoptera are commonly included in bioassessment programs across the United States and have been found to be responsive in a wide range of environmental settings (e.g., Kashuba et al. 2010). The goal is to limit the number of response variables to about 5 variables per subregion to minimize the number of models to be developed.
Once the stressor and biological response variables have been selected we will develop preliminary models for each subregion. Based on previous experience with such modeling in Oregon and California (Waite et al. 2009, Brown et al., in preparation), we will us multiple linear regression, classification and regression trees (CART), and boosted regression trees. Multiple linear regression is a simple and well known modeling technique that is easy to interpret. We will assess model performance using adjusted mean sum of squares (R2) and Akaike Information Criterion (AIC). Only models with all regression coefficients significant at P < 0.05 were considered.
Regression trees are one type of technique within the commonly used classification and regression tree (CART) or decision tree family, and their use and technical details have been described extensively in the literature (Breiman et al. 1984, De’ath and Fabricius 2000, Prasad et al. 2006). Trees attempt to explain variation in one categorical (classification) or continuous (regression) response variable by one or more explanatory variables, the resultant output being a dendrogram or tree with varying numbers of branches or nodes. Trees are developed following a hierarchical binary splitting procedure that attempts to find the best single explanatory variable that minimizes the within group and maximizes the among group dissimilarity in the response variable at each split. It does this for each explanatory variable entered into model development and can thus provide a list of the explanatory or predictive power of the variables.
Bagged trees, random forests and boosted trees are among a family of techniques used to improve upon single classification or regression trees by averaging the results for each binary split from numerous trees or forests thus reducing the predictive error and improving overall performance (De’ath 2007). After the initial tree has been generated, boosted trees develop successive trees on reweighted versions of the data giving more weight to those cases that are incorrectly classified than those that are correctly classified within each growth sequence. Thus as more and more trees are developed, boosting increases the chance that cases that are difficult to classify initially are correctly classified. Overall, boosted trees retain the positive aspects of single CARTs and have improved predictive performance, provide a list of importance of the explanatory variables, and provide for testing and assessing the importance of nonlinearities and interactions (De’ath 2007).
Task 2: Refine stress-response models as needed
It is possible that there is sufficient variability in the intensity of stressors or sensitivity of biological response variables within our selected subregions that the preliminary models might not be useful for defining biocriteria. It is also possible that the boundaries of the subregions might need to be adjusted to facilitate more useful modeling or the original subregions may need to be subdivided. In either situation additional modeling will be required to refine understanding of stressor-response relationships. Decisions about model refinement will be made in cooperation with the SCCWRP team.
The State Water Resources Control Board (SWRCB) of California has initiated a process to develop biological objectives for wadeable freshwater streams and rivers for the entire state.
Previously the value of protecting aquatic resources was recognized by SWRCB and the Regional Water Quality Control Boards (RWQCB). This interest led to a number of projects funded by the RWQCBs and the California Department of Fish and Game to develop methods of bioassessment for California streams; however, there are presently only narrative objectives for protection of aquatic life beneficial uses. There are no numeric objectives or guidance for using biological data in regulatory decision making.
The absence of biological objectives or the lack of guidance has limited the effectiveness of many Water Board programs, leading to:
- the inability to objectively assess whether aquatic life beneficial uses are supported;
- the inability to assess whether chemical and physical criteria are sufficient to protect aquatic life (i.e., whether permits relying on chemical and physical criteria alone are achieving healthy streams and rivers);
- inconsistencies in identifying impaired waterbodies;
- costly development of biological targets on a project-by-project basis.
These problems can be resolved by employing modern tools for directly measuring and protecting aquatic life and developing thresholds and guidance for assessing the data. SCCWRP, in cooperation with the California Department of Fish and Game (CDFG) and SWRCB, is assembling a technical team to develop the tools needed by managers and policy makers (Ken Schiff, SCCWRP, pers. comm.). Modeling of stressor-response relationships will play an important role in a number of tasks associated in developing biocriteria including, stressor identification, definition of reference conditions, and waterbody classification. USGS has experience developing such models (e.g., Waite et al. 2010, Brown et al., in preparation) in California and other geographic regions. As a result of this experience, USGS has been invited to contribute to the process of developing modeling tools appropriate for aiding decisions by managers and policy makers.
The overall goal of this project is to develop stressor-response models to support development of biocriteria for wadeable streams and rivers in California. Meeting this goal has been divided into two tasks to be accomplished sequentially over the two years of the project:
Task 1: Construct stress-response models for subareas of California
California is a large state that encompasses a wide range of ecological conditions. Stressors vary across these areas as well as the biological response variables most likely to be sensitive to the stressors. There are also a variety of modeling tools that can be applied to determine the best choices of biological response variables and environmental stressor variables. The scope of work defined in this task addresses developing models for 6 areas in California that roughly correspond to areas regulated by various RWCQBs and EPA Level III ecoregions. The regions are Southern California, California Desert, Central Valley, Chaparral and Oak woodlands (surrounding Central Valley), the Sierra Nevada the Eastern Cascade Slopes and Foothills (northeastern California), and the North Coast.
Task 2: Refine stress-response models as needed
It is unclear whether the models developed as part of Task 1 will be sufficient to establish biocriteria. The six subregions are sufficiently large that there is significant ecological variability within them. The proportion of variability explained by the Task 1 models will be assessed. For subregions where unexplained variability is unacceptably high, models will be developed for more focused regions within each subregion as the data allow.
All biological data for this study has already been gathered and is available from a database maintained by CDFG. The environmental stressor data is also available from CDFG databases. These data are a combination of variables derived from land cover/land use data using Geographic Information Systems software and actual measurements taken at the time of biological sampling. All work will be done in cooperation of the team assembled by SCCWRP to insure the effort remains focused on the needs of the team for the development of biocriteria. Modelling work will commence after the other members of the team prepare and deliver to USGS the final proofed versions of the both environmental stressor data and biological response data.
Task 1: Construct stress-response models for subregions of California
For each subregion USGS will conduct preliminary analysis of available stressor variables to determine if variables are appropriate for the modeling efforts. For each variable we will assess the range of response of the variable and if it is correlated with other proposed variables. Methods for assessing variables are well established and have previously been applied by the USGS and other members of the SCCWRP team for similar efforts (e.g., Ode et al. 2005, Brown et al. 2009, Waite et al. 2010). Final choices of stressor variables will made in consultation with appropriate members of the SCCWRP team.
We will conduct similar preliminary analyses of available response variables. A large number of response variables (100+) can typically be calculated from the data collected during a bioassessment. Rather than evaluating all such response variables for each subregion we will limit the number considered to those that previous research has shown to be useful in each area. If an Index of Biotic Integrity (e.g. Ode et al. 2005, Rehn et al. 2008) or O/E model (e.g., Hawkins et al. 2000) has already been developed for a region, they will be included in the response variable subset. Otherwise, we will limit assessment to metrics shown by previous work to be responsive to environmental stress. For example, the total number of taxa at a site, the average of the EPA tolerance of all the invertebrate taxa found at a site, and the number of taxa at a site belonging to Ephemeroptera, Tricoptera, and Plecoptera are commonly included in bioassessment programs across the United States and have been found to be responsive in a wide range of environmental settings (e.g., Kashuba et al. 2010). The goal is to limit the number of response variables to about 5 variables per subregion to minimize the number of models to be developed.
Once the stressor and biological response variables have been selected we will develop preliminary models for each subregion. Based on previous experience with such modeling in Oregon and California (Waite et al. 2009, Brown et al., in preparation), we will us multiple linear regression, classification and regression trees (CART), and boosted regression trees. Multiple linear regression is a simple and well known modeling technique that is easy to interpret. We will assess model performance using adjusted mean sum of squares (R2) and Akaike Information Criterion (AIC). Only models with all regression coefficients significant at P < 0.05 were considered.
Regression trees are one type of technique within the commonly used classification and regression tree (CART) or decision tree family, and their use and technical details have been described extensively in the literature (Breiman et al. 1984, De’ath and Fabricius 2000, Prasad et al. 2006). Trees attempt to explain variation in one categorical (classification) or continuous (regression) response variable by one or more explanatory variables, the resultant output being a dendrogram or tree with varying numbers of branches or nodes. Trees are developed following a hierarchical binary splitting procedure that attempts to find the best single explanatory variable that minimizes the within group and maximizes the among group dissimilarity in the response variable at each split. It does this for each explanatory variable entered into model development and can thus provide a list of the explanatory or predictive power of the variables.
Bagged trees, random forests and boosted trees are among a family of techniques used to improve upon single classification or regression trees by averaging the results for each binary split from numerous trees or forests thus reducing the predictive error and improving overall performance (De’ath 2007). After the initial tree has been generated, boosted trees develop successive trees on reweighted versions of the data giving more weight to those cases that are incorrectly classified than those that are correctly classified within each growth sequence. Thus as more and more trees are developed, boosting increases the chance that cases that are difficult to classify initially are correctly classified. Overall, boosted trees retain the positive aspects of single CARTs and have improved predictive performance, provide a list of importance of the explanatory variables, and provide for testing and assessing the importance of nonlinearities and interactions (De’ath 2007).
Task 2: Refine stress-response models as needed
It is possible that there is sufficient variability in the intensity of stressors or sensitivity of biological response variables within our selected subregions that the preliminary models might not be useful for defining biocriteria. It is also possible that the boundaries of the subregions might need to be adjusted to facilitate more useful modeling or the original subregions may need to be subdivided. In either situation additional modeling will be required to refine understanding of stressor-response relationships. Decisions about model refinement will be made in cooperation with the SCCWRP team.