National Science Foundation/USGS Internship Opportunities

Models of high-dimensional environmental and ecological data

Link to PDF Version.


Generalized linear mixed models (GLMMs) have become popular over the past two decades among scientists and applied statisticians. However, these models come with important limitations. First, the use of these models to address unexplained variation at cluster or group scales (i.e., as GLMMs) is limited by the dimension of variance-covariance matrices. Of course, multivariate datasets may contain more variables than sampling units—necessitating dimension reduction (if working with GLMMs); approaches for doing so within a GLMM framework are currently under development. The problem with dimensionality is exacerbated by additional clustering; such would occur when, for example, vectors of species counts were obtained from secondary sampling units which themselves are clustered within primary sampling units.

A second challenge to the use of GLMMs is that scientific colleagues may prefer marginal inferences over the conditional or cluster-specific inferences offered by GLMMs. For example, scientists may be interested in inferences that are not conditional on sampling unit. However, GLMMs will often be preferred for model development because GLMMs may be tailored to better approximate putative generating processes. Hence, models that are conditional (like GLMMs) but which also permit marginal inferences are often needed. Such models have been explored by statisticians but have not become mainstream and also have not, to our knowledge, been explored for use with ecological or environmental data.

An intern would have the opportunity to work on the above cutting-edge issues—or use already-proposed models to tackle natural resource questions that are multivariate in nature. The intern would not be limited to the use of parametric models; concerns with making inferences from multivariate ecological data may be addressed using machine learning or other approaches.

Project Hypothesis or Objectives:

The objective of this project may be theoretical or applied. A theoretical objective would focus on elaborating current methods for making inferences or predictions from multivariate and moderately high-dimensional data, often consisting of regular and irregular time series. Such an approach would entail evaluations of a proposed method using simulated data as well as an example environmental or ecological dataset. An applied approach would focus on using recently-developed computational methods that have seen little application with natural resource questions. A project might have both theoretical and applied components. An example would be the development and/or use of multivariate generalized linear mixed models with fish counts from multiple species to estimate fish community associations with environmental predictors. Another would estimate mercury concentrations in fish of a given species, length, location and year given left censoring and that mercury concentrations for many species-length-year combinations are missing from most location-year combinations; note that this example is multivariate in species and incorporates clustering with locations and years.

Duration: Up to 12 months

Internship Location: La Crosse, WI

Field(s) of Study: Life Science, Computing, statistics, data science

Applicable NSF Division: OCE  Ocean Sciences, DEB Environmental Biology, BD HS Big Data Regional Innovation Hubs and Spokes, HPC High Performance Computing, SES Social and Economic Sciences, DMS Mathematical Sciences, CISE Computer and Information Science and Engineering

Intern Type Preference: Any Type of Intern

Keywords: Statistical models; environmental science; ecology

Expected Outcome:

Expected outcomes include exposure of intern to natural resource science practice in the USGS; evaluation of methods for use with multivariate and clustered data (theoretical objective) or use of recently-described methods to obtain inferences from multivariate natural resource data; and one or more peer-reviewed publications.

Special skills/training Required:

Familiarity with statistical methods (including probability distributions, and generalized linear models) and/or machine learning, and analytical software (e.g., R or SAS); if a statistical approach will be taken, then the intern will need to have completed undergraduate- or, preferably, graduate-level mathematical statistics courses. Modest familiarity with hydrology, ecology or inorganic chemistry is not required but would be helpful.


The intern will develop models or methods appropriate for use with multivariate and clustered environmental or ecological data. Models will be evaluated using statistical or mathematical software and both simulated and measured data. The intern will have the opportunity to work in a natural resource setting, to interact with natural resource scientists, and to share findings with partner natural resource agencies and the scientific community, the latter via one or more peer-reviewed publications.  


Brian Gray, PhD

Research Statistician
Upper Midwest Environmental Sciences Center
Phone: 608-781-6234