An official website of the United States government
Here's how you know
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock () or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
Data Processing covers any set of structured activities resulting in the alteration or integration of data. Data processing can result in data ready for analysis, or generate output such as graphs and summary reports. Documenting the steps for how data are processed is essential for reproducibility and improves transparency.
Data Processing's Place in the Data Lifecycle
Data processing does not represent a set of activities that must occur after acquisition and before analysis. A process can support any data handling activity in the lifecycle, such as screening or preparing data for preservation and sharing.
Process and Analyze - Closely Related Activities
It can sometimes be difficult to determine where processing ends and analysis begins. In part this is because the two concepts are often intermingled to ensure that both data and research products meet a common set of goals.
Data may need to be compared to natural limits, adjacent measurements, or historical data to verify that they are suitable for use. Although this activity falls under the umbrella of Data Quality Management, it is quite common to develop a process to handle those validation steps, one that can be codified and does not require manual intervention. See the section on Managing Quality for more information about data quality in the Science Data Lifecycle.
Examples:
Reject spurious values in a real-time data stream that are more than X% different from the previous value.
Flag coordinates that are outside the spatial bounding box of a project and ensure that the longitude portion of a pair of coordinates in the U.S. is given as a negative value.
If field pH is > 1 unit different than lab-derived pH for the same sample, flag both values.
Transformation
Transforming data includes converting, reorganizing, or reformatting data. This action does not change the meaning of the data but can enable use in a context different than the original intent, or facilitate display and analysis. It is not unusual for data to be reformatted for use in different software environments (Fig 1).
Examples:
Convert raw electrical pulses from in situ remote sensors into a time series of data values, at the same time removing spikes and interpolating for missing sequence values.
Convert spreadsheet data to XML or CSV format.
Convert a series of date-time values from local time to Coordinated Universal Time (UTC).
Transform a series of 'named' coordinates (point values) to locations on a map (same data, different usage format).
Rotate a data table from a tall format (vertical) into a wide format (horizontal), or vice versa (Fig. 1).
All values are preserved and intact, but presented in a different arrangement that may be better suited to certain uses or to meet software requirements.
Subsetting
Subsetting data includes extracting not only select parts of a larger dataset (retrieval filter), but also filtering columns or rows and excluding values from working datasets based on user-defined criteria. The result of subsetting is a more compact and well-defined set of data that meet a particular set of use requirements.
Examples:
Remove rows of data that do not meet the data requirements of a project, such as data that are outside of the geographic or temporal extent.
Cut out columns of data in a standardized retrieval from an external source to leave only the data of interest.
For example, a project requiring annual total precipitation at a Site can remove the redundant individual monthly columns from a standardized weather data retrieval in wide format.
Summarization
Sometimes it is necessary to summarize data through grouping, aggregating, and producing statistics about the data. This can be considered a 'data reduction' step in a process, where fine-grained data are recast at a scale more amenable to integration, analysis or display.
Examples:
Convert 15-minute interval data to a daily total, min, and max.
Calculate the total annual rainfall by county from individual weather station hourly data (or by watershed, HUC, state, municipality, etc.).
Aggregate data by using a classification or domain system (e.g., express a sediment sample as percentages of different grain sizes).
Integration
Data integration builds a new data structure or combines datasets. Activities could involve merging, stacking, or concatenating data, and may use web services that allow access to authoritative data sources based on user-defined criteria. Integrated modeling can also be an avenue for data integration (see Modeling).
Geo Data Portal - Processing and comparing downscaled climate projection models
Web Coverage Service (WCS), Web Map Service (WMS), Web Processing Service (WPS) - Services that enable selecting, aggregating and integrating, based on user-defined criteria
Derivation
Data derivation is a processing component that creates new value types that were not present in the source data. Typically an algorithm is applied to derive a new value.
Examples:
Compute the difference between potential evapotranspiration and total rainfall, for a point grid in GIS.
Estimate the ion balance for a water sample (total cation meq / total anion meq).
From Census data, compute the population growth rate (or decline) over the last 50 years by congressional district.
Create a uniform grid of geospatial values from a variety of unevenly spaced discrete point values.
Process Documentation
Capturing and communicating information about how data were processed is critical for reproducible science. Documenting changes made to data from acquisition through use and sharing in a project (what you 'did' to the data, not how you 'used' it) should be part of the formal metadata record accompanying a data release, or as part of a published methodology being shared more widely. Also see the next section on diagramming and workflow management.
Ideally a process should be documented as a set of sequential, nested procedures and steps, and will adhere to the chosen metadata standard format. The general format for FGDC-CSDGM standard metadata is this:
Lineage
Process Step (repeat as needed; e.g. Step 1, Step 2, etc.)
Description - Title and description of the activity, and should include the desired outcome and any data requirements or acceptance standards being met by this process.
Date - Date when the process was completed.
Figure 2 shows part of the Process section of a project metadata record having five defined process steps. View the complete metadata record.
Process Diagrams, Workflow Tools and Automation
The use of flowcharts, data flow diagrams, and workflow tools can be very helpful for communication and capturing the history of a work activity. Workflow Capture is described on a separate page in the Process section of this website.
Data processing at USGS usually involves the use of software and programming languages, and processes to handle routine or repeated interactions with data are often automated. Modular approaches to process development provide the most utility, as well-designed components can be reused in a variety of contexts or combined with other, compatible modules to accomplish very complex tasks.
Best Practices
Use existing metadata conventions appropriate to the data (FGDC-CSDGM, ISO19115, ISO 19157, CF-1.6, ACDD). Note that USGS must follow the FGDC-CSDGM or ISO standard.
Use known and open formats (txt, geoTIFF, netCDF).
Use checklists to manage workflow.
Use scripts to automate, and enhance reproducibility.
Use open-source solutions when possible.
Save your input data (will be published at "Publish/Share" stage).
Conduct a peer review on the processing software. To validate data produced by a 'software process,' that process should ideally be vetted. Visit the Software Management Website to learn best practices for managing, reviewing, and releasing your code.
Release code through USGS GitLab.
Produce data using standards that are appropriate to your discipline.
Policies that apply to Data Processing address appropriate documentation of the methods and actions used to modify data from an acquired state to the form used for research or produced for sharing. Metadata standards (FGDC, ISO) include sections for describing the "provenance" of data, meaning that enough "process" information is provided for the user to determine where data originated and what changes were made to get to the described form.
"Documentation: Data collected for publication in databases or information products, regardless of the manner in which they are published (such as USGS reports, journal articles, and Web pages), must be documented to describe the methods or techniques used to collect, process, and analyze data (including computer modeling software and tools produced by USGS); the structure of the output; description of accuracy and precision; standards for metadata; and methods of quality assurance."
and
"Standard USGS methods are employed for distinct research activities that are conducted on a frequent or ongoing basis and for types of data that are produced in large quantities. Methods must be documented to describe the processes used and the quality-assurance procedures applied."