Data Management

File Formats

File formats are standard methods for encoding digital information.

File Format Examples

File Format Examples

Examples of file formats are comma-separated values (.csv), ascii text (.txt), Microsoft Excel (.xlsx), JPEG (.jpg), or Audio-Video Interleave format (.avi).

 

It’s important to think about file formats before you acquire data because your decisions at this stage may have implications in other stages of the science data lifecycle. The best format for collecting and processing the data, might not be the best for analyzing the data. The best format for analysis might not be the best format for distribution of the data, which in turn, might not be the best format for preservation of the data. Understanding these differences and connections at the beginning of a project can be helpful. Keep in mind that every time data are converted to a different format, there is a risk for introducing data error or loss.

 

Best practices for choosing a file format for acquisition

  • Collect data in a file format that is open and non-proprietary to limit the need to convert from one format to another

  • If you need to collect data in a proprietary format, ensure that it can easily be converted to another non-proprietary, open format.

  • Select formats that have broad use and support in your community

  • Be aware of software, hardware, and licensing requirements for viewing and working with the data

  • When possible, choose formats that are self-describing and can automatically capture metadata

  • Within the files, avoid application of formatting, such as highlighting or color, to serve as metadata, because it will likely be lost when converting to different formats
     

Best practices for public data release formats

  • Data for public release must be in open, non-proprietary, and machine-readable formats

  • Release data in multiple formats if the format used by the scientific community does not meet all of these requirements

  • Due to file transfer limitations, compress data using lossless formats

  • If sharing different formats of the same file, be sure to name each file with the same name (e.g. bison_data_v1.xlsx and bison_data_v1.txt).

  • Include in the metadata record information about software and hardware for accessing proprietary formats. Learn more on the Metadata page.

  • When converting a file, be sure to check that the new file does not contain any errors or omissions.

    • Check the actual data: column headings, rows, etc

    • Ensure that values were not truncated and that significant figures were preserved

    • Check the metadata; make sure it is present and accurate.

    • Check to see that any markup, such as highlights or bolded text, are either removed, or are moved to the metadata, so that important ancillary information is not lost in the conversion.

To learn more, visit the Data Release page.
 

Best practices for long-term preservation

  • Save data in open, non-proprietary, unencrypted formats for long-term preservation

  • Proprietary formats used for acquisition and analysis should be converted into standard and long lasting formats by the researcher familiar with the data, once the data analysis is complete.

  • When converting a file, be sure to check that the new file does not contain any errors or omissions.

    • Check the actual data itself: column headings, rows, etc

    • Ensure that values were not truncated and that significant figures were preserved

    • Check the internal metadata; make sure it is present and accurate.

    • Check to see that any markup, such as highlights or bolded text, are either removed, or are moved to the metadata, so that important ancillary information is not lost in the conversion.

  • If saving different formats of the same file, but sure to name each file with the same name (e.g. bison_data_v1.xlsx and bison_data_v1.txt).

  • Include in the metadata record information about software and hardware for accessing proprietary formats. Learn more on the Metadata page.

  • Save data in uncompressed formats whenever possible

  • If saving an image file using compression, lossless is better than lossy.

To learn more, visit the Preserve section of the website.

 

File Format Options

Below are links to file format recommendations from the National Archives and Records Administration. You can also check out the Library of Congress Recommended Formats Statement, which is updated annually, for more information.

What the U.S. Geological Survey Manual Requires: 

SM 502.8 - Fundamental Science Practices: Review and Approval of Scientific Data for Release states:

The May 9, 2013, Office of Management and Budget (OMB) memorandum “Open Data Policy—Managing Data as an Asset” also requires agencies to provide free public access to data collected or created by using Federal funds, and to collect or create data in a way that supports downstream processing and dissemination activities.  This includes using machine-readable or open formats, data standards, and common-core and extensible metadata for all data released to the public.

 

SM 502.9 - Fundamental Science Practices: Preservation Requirements for Digital Scientific Data states:

The USGS must comply with National Archives and Records Administration (NARA) formats for records deemed to be permanent at the time of transfer to NARA (refer to https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html).

 

Recommended Reading 

 

References