Skip to main content

Data Standards

Data standards are the guidelines by which data are described and recorded. In order to share, exchange, combine and understand data, we must standardize the format as well as the meaning.

Why do we need Data Standards?

Standards make it easier to create, share, and integrate data by ensuring that the data are represented and interpreted correctly. Standards also reduce the time spent cleaning and translating data. Cleansing “dirty data” is a common barrier encountered by scientists, taking 26% of data scientists’ on-the-job time (Anaconda, 2020). For example, when integrating datasets from different sources, each of which used a different format for their date variable (e.g., April 2, 2024, 04-02-24, 04/02/2024), it would be a time consuming task to interpret and convert the dates into a common format before integrating the data.

Hear from a USGS Biologist about how data standards increase efficiency in research.

 

Dataset-level standards

Dataset-level standards specify the scientific domain, structure, relationships, field labels, and parameter-level standards for the dataset as a whole. A dataset-level standard is normally documented with a data dictionary (link to data dictionary page). See below for examples of formal dataset-level standards: 

 

Parameter-level standards

Parameter-level standards define the format and units for a given parameter or field within a dataset and help users correctly interpret the values. Parameter-level standards should be adopted at the time of data collection, that is when values in a field are created or recorded.  If standardization of a parameter in an existing dataset will result in any loss of original detail or information, a best practice is to retain the original parameter and add a separate field for the standardized parameter.  The Darwin Core standard, for example, provides verbatim fields for this purpose. 

Below are some examples of common earth and biological science data parameters and data standards. 

Temporal 

Date / time 

  • Data Standard: ISO 8601 
  • Format: YYYY-MM-DD or YYYY-MM-DDT:HH:MM:SS+00:00  
  • Example (Mountain Standard Time (MST)): 2020-08-11T11:02:49-07:00 
The current chart from Divisions of Geologic Time, 2018
Divisions of Geologic Time approved by the U.S. Geological Survey Geologic Names Committee, 2018. (Public domain.)

Geologic time 

Geographic location descriptors  

Geographic coordinates  

  • Data Standard: ISO 6709:2008 

  • Format: ±90.00 and ±180.00 (precision documented by number of decimal places and contingent upon equipment used) 

  • Example: Latitude: 42.3300; Longitude: -98.1449 

Geographic names / codes 

Watershed boundaries  

Country codes 

  • Data Standard: ISO-3166 

  • Example:  Full name: the United States of America; Alpha-3 code: USA; Numeric code: 840

U.S. state and county codes 

Classification standards 

If ITIS does not address your nomenclature requirements, (contact the ITIS team (itiswebmaster@itis.gov)); and/or reference another appropriate taxonomic authority .

 
  event
Date
decimal
Latitude
decimal
Longitude
geodetic
Datum
higher
GeographyID
country
Code
scientific
Name
taxon
ID
name
According
ToID
Value example 2010-05-17  42.33 -98.1449 WGS84  31003 US  Agapostemon virescens  154352  Integrated Taxonomic Information System – https://www.itis.gov 
Parameter level standard ISO-8601  ISO 6709:2008 ISO 6709:2008 WGS84 FIPS code for Antelope county, Nebraska ISO 3166-1 alpha-2  ITIS ITIS taxonomic serial number (TSN) corresponding to this scientific name Not Applicable 
close up of image
Close-up of an Agapostemon virescens specimen.(Credit: USGSBIML Team, .Public domain.)

 

Table CaptionThis excerpt from an occurrence record for a U.S. native bee species uses parameter-level standards recommended by the Darwin Core dataset-level standard (the full record can be viewed at https://www.gbif.org/occurrence/1456598984). 

Data encoding and interface standards

Most dataset-level standards (see above) also offer guidance on how to encode data. Data encoding standards define the rules for structuring and organizing data for use in a given context. These standards ensure that when applications read data, the information and context is preserved (OGC, 2020a). Data encoding standards are generally associated with a file format (see File Formats page). Researchers should use universal, open source data encoding standards and long-term access, open file formats whenever possible. Character encoding standards, such as Unicode Transformation Format (UTF-8) ensure that characters in the data are correctly interpreted. 

Below are some examples of open data encoding standards used in the earth and biological sciences. Abbreviations and acronyms are defined at the end of this section. 

  • GeoTIFF and Cloud Optimized GeoTIFF: provides the rules for describing geographic image data using the TIFF file format 

  • GeoJSON: defines the rules for describing geographic features using JSON.  

  • NetCDF: supports electronic encoding of geospatial data, specifically digital geospatial information representing space and time-varying phenomena, using the HDF file format.   

  • OGC GeoPackage: defines the rules for what goes into the GeoPackage file format, which is an alternative to Esri’s proprietary yet popular Shapefile* format. *Although proprietary, the technical specifications are open

  • NARA RFC 4180: There is no single encoding standard for creating CSV files; however, the Library of Congress uses the NARA RFC 4180 as the encoding format specification to define the structure of CSV files. 

  • OGC Web Map Service: allows users to remotely access georeferenced map images via HTTPS requests. 

  • OGC Web Coverage Service: allows users to access online geospatial data in multiple raster-based data formats (e.g. GeoTiffs, .img, ENVI (.hdr) file types). 

  • GML Web Feature Service: allows users to access online geospatial data at the feature level using formats such as Shapefile, GML, etc. 

 

Abbreviations and Acronyms 

  • CSV – comma separated values 
  • ENVI - ENvironment for Visualizing Images 
  • GML - Geography Markup Language 
  • HTTPS – Hypertext Transfer Protocol Secure  
  • JSON - JavaScript Object Notation 
  • NARA RFC - National Archives and Records Administration Request For Comments 
  • NetCDF - Network Common Data Form 
  • OGC - Open Geospatial Consortium 
  • TIFF - Tag Image File Format

 

Documenting data standards in metadata

Image of a metadata record, with focus on the entity overview section, which indicates that the Darwin Core standard was used.
Screenshot of a metadata record's entity overview section, indicating use of Darwin Core.

Parameter-level and dataset-level data standards should be documented in the accompanying data dictionary and metadata record. For example, if following the Content Standard for Digital Geospatial Metadata (CSDGM), the dataset-level data standards can be documented in the Entity and Attribute Overview Description section of the metadata and the parameter-level data standards can be documented in the Entity and Attribute Detailed Description for each attribute described in the metadata record.  

For more on metadata and data dictionaries, see Metadata Creation and Data Dictionaries.

 

Where can I find other relevant data standards?

In any given Federal agency, there can be multiple groups of experts that produce standards and authorities that approve them, often based on the science topic. There is no single group that establishes or recommends standards for the USGS. 

 

Community Guidelines / Norms 

 Standards often evolve out of communities of practice coming together and agreeing on common practices. Here is an example of an evolving best practice that hasn’t been formally ratified by a standards authority. Does your USGS project use community guidelines that you’d like us to link here? Contact GS_Data_Management@usgs.gov

 

What the U.S. Geological Survey Manual Requires: 

The USGS Survey Manual Chapter 502.2 - Fundamental Science Practices: Planning and Conducting Data Collection and Research addresses data and metadata standards:

"The data collected and the techniques used by USGS scientists should conform to or reference national and international standards and protocols if they exist and when they are relevant and appropriate. For datasets of a given type, and if national or international metadata standards exist, the data are indexed with metadata that facilitate access and integration."

USGS Survey Manual Chapter SM 502.6 - Fundamental Science Practices: Scientific Data Management Foundation specifies that a data management plan will include standards and intended actions as appropriate to the project for acquiring, processing, analyzing, preserving, publishing/sharing, describing, and managing the quality of, backing up, and securing the data holdings.

 

References 

 

Page last updated 5/3/21.