Data-quality management is a process where protocols and methods are employed to ensure that data are properly collected, handled, processed, used, and maintained at all stages of the scientific data lifecycle.
Quality Assurance (QA) & Quality Control (QC)
QA & QC are often used interchangeably, but they mean different things. QA refers to defect prevention, whereas QC refers to defect detection. Generally, QA is applied before and during data acquisition, whereas QC is applied after data are in hand.
What is a Data 'Defect'?
In a data context, a 'defect' is any data issue that negatively affects fitness for use, such as a numeric value error, incorrect classification term, gaps in data series, or failed data transformations.
Yes, you can plan ahead for high-quality data! A Quality Assurance Plan (QAP) is used to define the criteria and processes that will ensure and verify that data meet specific data-quality objectives throughout the Data Lifecycle. Some agencies and organizations require a QAP as part of a research proposal, before funding a project (for example, USEPA). Like the DMP, the QAP (if a separate document) would be revised as needed during a project timeline to reflect the reality of the data workflow and activities.
Preventing the creation of defective data is the most effective means of ensuring the ultimate quality of your data products and the research that depends upon that data. QA refers to utilizing written criteria, methods and processes that will ensure the production of data that meet a specified quality standard.
Quality by Design
Having a plan for how to store, enter, edit, and manipulate data BEFORE data collection will save time and directly affect your ability to use those data. By starting with a conceptual design (or schema) of the data you can ensure that you have considered all of the data you intend to store, the data types they represent, the relationships between different chunks of data, and the data domains that will support the primary data you collect.
Domain Management and Reference Data
Terms used to classify or describe data elements can help or hurt the usefulness of the dataset. Data domains and Reference data are often implemented as lookup tables or drop-down boxes on forms and define the allowable values for an attribute. Terms that are descriptive (such as color and size) are relative, whereas terms that are used for classification are more discrete (ecoregion, land use category).
Quality control (QC) of data refers to the application of methods or processes that determine whether data meet overall quality goals and defined quality criteria for individual values. In order to determine whether data are 'good' or 'bad' - or to what degree they are so - one must have a set of quality goals and specific criteria against which data are evaluated. Rapid data scanning methods can be used to tag records or sets of records that meet or fail to meet a particular criterion. Remember that QC is a partner to QA, because when errors are found, a way to prevent them via QA might also be revealed.
Data Quality Assessment and Review
Project staff should perform periodic data-assessments during the project cycle to discover errors prior to project completion. These reviews do not need to be overly complicated, but instead serve as an opportunity to keep your data management plan, quality goals and metrics, and metadata up to date, and to generate documentation about adherence to your quality plan. Data from outside sources need to be assessed for quality issues prior to use. Real-time and streaming data processes include some level of quality control.
Using Data Quality Indicators
The quality of individual measurement or observation data should not be hidden in metadata or documentation associated with a dataset. Rather, indicators of quality or usability can and should be stored with the data themselves in separate fields or columns. That allows potential data users to avoid validating unusual data that have already been justified, and to determine which values are fit for specific uses.
Describing your data, like managing quality, is a cross-cutting element of the USGS Science Data Lifecycle. In addition to using data quality indicators within your dataset, quality-management documentation may take the form of a QAP or sections within the DMP about specific quality goals and criteria, along with any quality assessment summaries and notes on massaging data to meet the content needs of your project. The FGDC metadata standard includes sections specifically reserved for Data Quality Information.
Responsibilities for quality work and work products are reflected within the Code of Conduct for Department of Interior staff (poster), specifically to ensure the highest level of data quality in scientific and scholarly information products:
"I will be responsible for the quality of the data I use or create and the integrity of the conclusions, interpretations, and applications I make. I will adhere to appropriate quality assurance and quality control standards, and not withhold information because it might not support the conclusions, interpretations, and applications I make."
"The USGS provides unbiased, objective scientific information upon which other entities may base judgments. Since its inception in 1879, the USGS has maintained comprehensive internal and external procedures for ensuring the quality, objectivity, utility, and integrity of data, analyses, and scientific conclusions. ... Information Quality ... covers all information produced by the USGS in any medium, including data sets, web pages, maps, audiovisual presentations in USGS-published information products, or in publications of outside entities."
General Policies that apply to Data Quality within the USGS [Links Verified November 30, 2017]
- OMB Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies
- DOI Data Quality Management Guide (pdf)
- U.S. Geological Survey Information Quality Guidelines
USGS Fundamental Science Practices [Links Verified November 30, 2017]
- SM 500.25 - Scientific Integrity
- SM 502.3 - Fundamental Science Practices: Peer Review
- SM 502.4 - Fundamental Science Practices: Review, Approval, and Release of Information Products
- SM 502.7 - Fundamental Science Practices: Metadata for USGS Scientific Information Products Including Data
- SM 502.8 - Fundamental Science Practices: Review and Approval of Scientific Data for Release
- Dilbert on Data Quality: Scott Adams offers serious insight into TQM [Link Verified November 30, 2017]
- Chapman, A.D., 2005, Principles of Data Quality, version 1.0 ( pdf)
- Helsel, D.R. and R. M. Hirsch, 2002, Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages. [Link Verified November 30, 2017]
- DataONE education modules. [Link Verified July 17, 2017]
- Hook, Les A., Suresh K. Santhana Vannan, Tammy W. Beaty, Robert B. Cook, and Bruce E. Wilson. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. [Link Verified July 17, 2017]
- A. D. Chapman, "Principles of Data Quality: Report for the Global Biodiversity Information Facility" (Global Biodiversity Information Facility, Copenhagen, 2004). [Link Verified July 17, 2017]