USGS Data Management
Data Lifecycle Overview
When we start thinking of our data as corporate assets with value beyond our immediate need, the idea of managing data through a whole lifecycle becomes more relevant. All of the questions of documentation, storage, quality assurance, and ownership then need to be answered for each stage of the data lifecycle, starting with the recognition of a need, and ending with archiving or updating the information.
Why a Data Lifecycle?
From the time you decide to collect or use data until they become obsolete or no longer needed, those data will need to be accounted for and managed. Further, like any other assets, USGS cannot justify or afford acquisition of unneeded data. Data should be acquired and maintained only to meet a scientific need.
Management best practices parallel data management best practices in establishing standards and procedures that are documented, defined, and consistent. The goal is to eliminate waste and abuse of the taxpayers' money while providing the information resources to operate efficiently in an era of shrinking budgets. As we would want our financial advisor to be a good steward of our investment dollars, so we need to recognize our obligation to be good data stewards on behalf of those who pay our salaries.
The USGS Data Lifecycle
Plan: A documented sequence of intended actions to identify and secure resources and gather, maintain, secure, and utilize data holdings comprise a Data Management Plan. This also includes the procurement of funding and the identification of technical and staff resources for full lifecycle data management. Once the data needs are determined, a system to store and manipulate the data can then be identified and developed.
Acquire: Acquisition involves collecting or adding to the data holdings. There are four methods of acquiring data: collecting new data; converting/transforming legacy data; sharing/exchanging data; and purchasing data.
Process & Analyze: Why Are They Not Covered on this Web site?
Data Processing and Analysis are highly specialized activities, and to cover the range of the relevant permutations of these activities to USGS science is beyond the scope of this Web site.
Process: Processing denotes actions or steps performed on data to verify, organize, transform, integrate, and extract data in an appropriate output form for subsequent use. This includes data files and content organization, and data synthesis or integration, format transformations, and may include calibration activities (of sensors and other field and laboratory instrumentation). Both raw and processed data require complete metadata to ensure that results can be duplicated. Methods of processing must be rigorously documented to ensure the utility and integrity of the data.
Analyze: Analysis involves actions and methods performed on data that help describe facts, detect patterns, develop explanations, and test hypotheses. This includes data quality assurance, statistical data analysis, modeling, and interpretation of analysis results.
Preserve: Preservation involves actions and procedures to keep data for some period of time and/or to set data aside for future use, and includes data archiving and/or data submission to a data repository. A primary goal for the USGS is to preserve well-organized and documented datasets that support research interpretations that can be re-used by others; all research publications should be supported by associated, accessible datasets. Data must be disposed of in accordance with a written policy that conforms to the requirements of the National Archives and Records Administration (NARA). Correct and prompt disposal of outdated information may reduce the Bureau's risk in some FOIA requests or legal actions, by demonstrating strict conformance to written policy and eliminating incorrect, outdated, or irrelevant information from the record.
Publish/Share: The ability to prepare and issue, or disseminate, quality data to the public and to other agencies is an important part of the lifecycle process. The data should be medium- and agent-independent, with an understanding that transfer may occur via automated or non-automated mechanisms. We need to ensure that data are shared, but with controls to protect proprietary and pre-decisional data and the integrity of the data itself. Data sharing also requires complete metadata to be useful to those who are receiving the data.
Describe (Metadata, Documentation): Throughout the data lifecycle process, documentation must be updated to reflect actions taken upon the data. This includes acquisition, processing, and analysis, but may touch upon any stage of the lifecycle. Updated and complete metadata are critical to maintaining data quality. The key distinction between metadata and documentation is that metadata, in the standard sense of "data about data," formally describes various key attributes of each data element or collection of elements, while documentation makes reference to data in the context of their use in specific systems, applications, settings. Documentation also includes ancillary materials (e.g., field notes) from which metadata can be derived. In the former sense, it's "all about the data;" in the latter, it's "all about the use."
Not all projects will utilize every aspect of the data lifecycle, nor will all projects use the data lifecycle in the same way. Some may not follow the paths as depicted and some may circle back on certain elements.
Manage Quality: Protocols and methods must be employed to ensure that data are properly collected, handled, processed, used, and maintained at all stages of the scientific data lifecycle. This is commonly referred to as "QA/QC" (Quality Assurance/Quality Control). QA focuses on building-in quality to prevent defects while QC focuses on testing for quality (e.g., detecting defects). QA makes sure you are doing the right things, the right way. QC makes sure the results of what you've done are what you expected.
Back Up & Secure: Steps must be taken to protect data from accidental data loss, corruption, and unauthorized access. This includes routinely making additional copies of data files or databases that can be used to restore the original data or for recovery of earlier instances of the data.