Process

By Data Management

Data Processing covers any set of structured activities resulting in the alteration or integration of data. Data processing can result in data ready for analysis, or generate output such as graphs and summary reports. Documenting the steps for how data are processed is essential for reproducibility and improves transparency.

Data Processing's Place in the Data Lifecycle

Data processing does not represent a set of activities that must occur after acquisition and before analysis. A process can support any data handling activity in the lifecycle, such as screening or preparing data for preservation and sharing.

Process and Analyze - Closely Related Activities

Silhouette of USGS scientist collecting data of erupting fissure during Kilauea eruption

It can sometimes be difficult to determine where processing ends and analysis begins. In part this is because the two concepts are often intermingled to ensure that both data and research products meet a common set of goals.

Learn more

Processing activities include validation, transformation, subsetting, summarizing, integration, and derivation, among others.

Validation, Transformation, & Subsetting
Summarization, Integration, & Derivation
Process Documentation
Process Diagrams, Workflow Tools and Automation
Best Practices
Examples
What the U.S. Geological Survey Manual Requires

Validation

Data may need to be compared to natural limits, adjacent measurements, or historical data to verify that they are suitable for use. Although this activity falls under the umbrella of Data Quality Management, it is quite common to develop a process to handle those validation steps, one that can be codified and does not require manual intervention. See the section on Managing Quality for more information about data quality in the Science Data Lifecycle.

Examples:

Reject spurious values in a real-time data stream that are more than X% different from the previous value.
Flag coordinates that are outside the spatial bounding box of a project and ensure that the longitude portion of a pair of coordinates in the U.S. is given as a negative value.
Ignore colored dissolved organic matter (CDOM) values when turbidity readings exceed X.
If field pH is > 1 unit different than lab-derived pH for the same sample, flag both values.

Transformation

Transforming data includes converting, reorganizing, or reformatting data. This action does not change the meaning of the data but can enable use in a context different than the original intent, or facilitate display and analysis. It is not unusual for data to be reformatted for use in different software environments (Fig 1).

Examples:

Image illustrating how table formatting can be modified through data transformation. — Figure 1. Example of two tables where the Tall format contains the columns Site, Year, Count, and the Wide format contains the columns Site and each incremental year with "Count" under each year.(Public domain)

Convert raw electrical pulses from in situ remote sensors into a time series of data values, at the same time removing spikes and interpolating for missing sequence values.
Convert spreadsheet data to XML or CSV format.
Convert a series of date-time values from local time to Coordinated Universal Time (UTC).
Transform a series of 'named' coordinates (point values) to locations on a map (same data, different usage format).
Rotate a data table from a tall format (vertical) into a wide format (horizontal), or vice versa (Fig. 1).
- All values are preserved and intact, but presented in a different arrangement that may be better suited to certain uses or to meet software requirements.

Subsetting

Subsetting data includes extracting not only select parts of a larger dataset (retrieval filter), but also filtering columns or rows and excluding values from working datasets based on user-defined criteria. The result of subsetting is a more compact and well-defined set of data that meet a particular set of use requirements.

Examples:

Remove rows of data that do not meet the data requirements of a project, such as data that are outside of the geographic or temporal extent.
Cut out columns of data in a standardized retrieval from an external source to leave only the data of interest.
- For example, a project requiring annual total precipitation at a Site can remove the redundant individual monthly columns from a standardized weather data retrieval in wide format.

Summarization

Sometimes it is necessary to summarize data through grouping, aggregating, and producing statistics about the data. This can be considered a 'data reduction' step in a process, where fine-grained data are recast at a scale more amenable to integration, analysis or display.

Examples:

Convert 15-minute interval data to a daily total, min, and max.
Calculate the total annual rainfall by county from individual weather station hourly data (or by watershed, HUC, state, municipality, etc.).
Aggregate data by using a classification or domain system (e.g., express a sediment sample as percentages of different grain sizes).

Integration

Data integration builds a new data structure or combines datasets. Activities could involve merging, stacking, or concatenating data, and may use web services that allow access to authoritative data sources based on user-defined criteria. Integrated modeling can also be an avenue for data integration (see Modeling).

Examples:

USGS National Water Dashboard - Water information for the United States
Current Alerts for U.S. Volcanoes - Real-time volcano activity detections
The National Map - Topographic information for the nation
Geo Data Portal - Processing and comparing downscaled climate projection models
Web Coverage Service (WCS), Web Map Service (WMS), Web Processing Service (WPS) - Services that enable selecting, aggregating and integrating, based on user-defined criteria

Derivation

Data derivation is a processing component that creates new value types that were not present in the source data. Typically an algorithm is applied to derive a new value.

Examples:

Compute the difference between potential evapotranspiration and total rainfall, for a point grid in GIS.
Estimate the ion balance for a water sample (total cation meq / total anion meq).
From Census data, compute the population growth rate (or decline) over the last 50 years by congressional district.
Create a uniform grid of geospatial values from a variety of unevenly spaced discrete point values.

Process Documentation

Capturing and communicating information about how data were processed is critical for reproducible science. Documenting changes made to data from acquisition through use and sharing in a project (what you 'did' to the data, not how you 'used' it) should be part of the formal metadata record accompanying a data release, or as part of a published methodology being shared more widely. Also see the next section on diagramming and workflow management.

CSDGM metadata example of Process Steps. — Figure 2: CSDGM metadata example showing Process Steps within a project metadata record having five defined process steps. The Process Step section is under the Data Quality Information section of the full metadata record. (Public domain)

Ideally a process should be documented as a set of sequential, nested procedures and steps, and will adhere to the chosen metadata standard format. The general format for FGDC-CSDGM standard metadata is this:

Lineage

Process Step (repeat as needed; e.g. Step 1, Step 2, etc.)
- Description - Title and description of the activity, and should include the desired outcome and any data requirements or acceptance standards being met by this process.
- Date - Date when the process was completed.

Figure 2 shows part of the Process section of a project metadata record having five defined process steps. View the complete metadata record.

Process Diagrams, Workflow Tools and Automation

The use of flowcharts, data flow diagrams, and workflow tools can be very helpful for communication and capturing the history of a work activity. Workflow Capture is described on a separate page in the Process section of this website.

Data processing at USGS usually involves the use of software and programming languages, and processes to handle routine or repeated interactions with data are often automated. Modular approaches to process development provide the most utility, as well-designed components can be reused in a variety of contexts or combined with other, compatible modules to accomplish very complex tasks.

Best Practices

Use existing metadata conventions appropriate to the data (FGDC-CSDGM, ISO19115, ISO 19157, CF-1.6, ACDD). Note that USGS must follow the FGDC-CSDGM or ISO standard.
Use known and open formats (txt, geoTIFF, netCDF).
Use checklists to manage workflow.
Use scripts to automate, and enhance reproducibility.
Use open-source solutions when possible.
Save your input data (will be published at "Publish/Share" stage).
Conduct a peer review on the processing software. To validate data produced by a 'software process,' that process should ideally be vetted. Visit the Software Management Website to learn best practices for managing, reviewing, and releasing your code.
Release code through USGS GitLab.
Produce data using standards that are appropriate to your discipline.

Examples

USGS

Data processing section of the Documentation of the USGS Oceanographic Time-Series Measurement Database
USGS Water Resources Software used to process data

Other Groups

Integrated Ocean Observing System (IOOS): formalized processes for quality control of real-time data.
Processing and analysis for Sensor Data Management Middleware: using modules for processing and analysis.
Sensor Data Quality: QA/QC processing workflows for high-speed data streams.
Rolling deck to Repository (R2R): includes processes related to R2R data stewardship.
National Ecological Observatory Network (NEON):automated processing of field observatory data into derived data products, with provenance.

What the U.S. Geological Survey Manual Requires:

Policies that apply to Data Processing address appropriate documentation of the methods and actions used to modify data from an acquired state to the form used for research or produced for sharing. Metadata standards (FGDC, ISO) include sections for describing the "provenance" of data, meaning that enough "process" information is provided for the user to determine where data originated and what changes were made to get to the described form.

The USGS Manual Chapter 500.25 - USGS Scientific Integrity discusses the USGS's dedication to "preserving the integrity of the scientific activities it conducts and that are conducted on its behalf" by abiding to the Department of Interior 305 DM 3 - Integrity of Scientific and Scholarly Activities.

The USGS Manual Chapter 502.2 - Fundamental Science Practices: Planning and Conducting Data Collection and Research includes requirements for process documentation.

"Documentation: Data collected for publication in databases or information products, regardless of the manner in which they are published (such as USGS reports, journal articles, and Web pages), must be documented to describe the methods or techniques used to collect, process, and analyze data (including computer modeling software and tools produced by USGS); the structure of the output; description of accuracy and precision; standards for metadata; and methods of quality assurance."

and

"Standard USGS methods are employed for distinct research activities that are conducted on a frequent or ongoing basis and for types of data that are produced in large quantities. Methods must be documented to describe the processes used and the quality-assurance procedures applied."

The USGS Manual Chapter 502.4 - Fundamental Science Practices: Review, Approval, and Release of Information Products addresses documentation of the methodology used to create data and generate research results.:

"Methods used to collect data and produce results must be defensible and adequately documented."

References

USGS Fundamental Science Practices - USGS Code of Scientific Conduct.
Oak Ridge National Laboratory Distributed Active Archive Center. 2019. Data Management. https://daac.ornl.gov/datamanagement/.
The Quartz Guide to Bad Data.
Singh, M.P. & Vouk, M.A. Scientific Workflows: Scientific Computing Meets Transactional Workflows.
IOOS Quality Assurance of Real Time Oceanographic Data.
Open Geospatial Consortium (OGC).
Williams, S.J., Arsenault, M.A., Buczkowski, B.J., Reid, J.A., Flocks, James, Kulp, M.A., Penland, Shea, Jenkins, C.J., 2007, Surficial sediment character of the Louisiana offshore continental shelf region: A GIS Compilation: U.S. Geological Survey Open File Report 2006-1195, nomenclature.

Page last updated 1/2/24.

Process

Data Processing's Place in the Data Lifecycle

Process and Analyze - Closely Related Activities

Table of Contents

Validation

Transformation

Subsetting

Summarization

Integration

Derivation

Process Documentation

Process Diagrams, Workflow Tools and Automation

Best Practices

Examples

What the U.S. Geological Survey Manual Requires:

Recommended Reading

References

U.S. Geological Survey

U.S. Department of the Interior