The USGS Science Data Lifecycle Model
By John L. Faundeen, Thomas E. Burley, Jennifer Carlino, David L. Govoni, Heather S. Henkel, Sally L. Holl, Vivian B. Hutchison, Cassandra C. Ladino, Elizabeth Martin, Ellyn Montgomery, Steve Tessler, and Lisa Zolly
Figure 1. USGS Scientific Data Lifecycle Model.
In 2010, the USGS Community for Data Integration (CDI) established a Data Management Working Group (DMWG) to develop and recommend best practices or policies that would assist the agency in effectively documenting, preserving, and providing access to our science data. To provide a framework for these activities, a DMWG subteam investigated data lifecycle models that could serve as a foundation for USGS data management processes. The model developed by the subteam was approved in November 2012 by CDI executive sponsors Kevin Gallagher, Associate Director for Core Science Systems, and Linda Gundersen, then Director of the Office of Science Quality and Integrity. An Open File Report containing more details about the Science Data Lifecycle Model is currently in the USGS publication process.
Primary Model Elements
PLAN: a critical data management activity that is a component of the larger process of planning the research project. In the context of the data lifecycle, the Plan stage specifies a comprehensive focus on all activities related to the handling of the project’s data assets, from project inception to publication and beyond.
ACQUIRE: represents the point at which new and/or existing data are collected or generated. USGS stream gage data, historical maps, seismology motion sensors output, biological specimens, and satellite observations are examples of acquired data and information that represent the diverse and robust variety of science data inputs to the USGS data lifecycle.
PROCESS: represents the activities associated with the preparation of various new or existing acquired data inputs. Processing of input data may entail data format transformations; integration; extract, transform, and load operations; or calibration activities to prepare the data for analysis.
ANALYZE: represents the activities associated with the exploration and interpretation of well-managed, processed data for the purpose of knowledge discovery. Analytical methods might include statistical analysis, spatial analysis, or modeling and are used to produce information that is of value to decisionmakers and the public.
PRESERVE: represents the activities associated with preserving data for long-term use and accessibility. Quite often (but not recommended), preservation is not considered until the end stage of a project, and can be further compromised by project budgets, timetables, and rapid changes in technologies and formats.
PUBLISH/SHARE: combines the Bureau’s concepts of traditional peer-reviewed publication with the venues by which USGS makes its rich data stores available through Web sites, data catalogs, and social media.
Cross-Cutting Model Elements
Each of the primary elements of the Science Data Lifecycle Model addresses discrete activities and outputs unique to that stage. However, other critical activities must be performed continuously across all stages of the lifecycle, including:
DESCRIBE (METADATA, DOCUMENTATION) highlights the importance of step-wise documentation throughout the data lifecycle. Beginning with the data management plan, this element emphasizes detailed lifecycle documentation using recognized community standards.
MANAGE QUALITY implies enacting quality-assurance measures for data at the project’s inception and then undertaking ongoing quality-control monitoring at subsequent lifecycle stages to verify that those measures perform as expected as the project proceeds.
BACKUP & SECURE involves managing physical risks to the data throughout the data lifecycle. Routine local backups are critical to prevent the physical loss of data prior to the final PRESERVATION of the data. Preventive measures should address not only the raw and processed research data, but also plans, analysis methods, published products, and associated metadata.
While multiple personnel may oversee the various elements, the project lead is responsible for ensuring that each element, including those that are cross-cutting, is addressed throughout the life of the project. Incorporating the Science Data Lifecycle Model into research project planning will help ensure that the science the USGS produces, and the data upon which it rests, will be preserved, accessible, well described, and fit for reuse. For more information about science data management, see http://www.usgs.gov/datamanagement/.