Data Management

Process

Data Processing covers any set of structured activities resulting in the alteration or integration of data. Data processing can result in data ready for analysis, or generate output such as graphs and summary reports. Documenting the steps for how data are processed is essential for reproducibility and improves transparency.

Data Processing's Place in the Data Lifecycle

Data Processing

Data processing does not represent a set of activities that must occur after acquisition and before analysis. A 'process' can support any data handling activity in the lifecycle, such as screening data or preparing data for preservation and sharing.

Process and Analyze - Closely Related Activities

Process and Analyze - Closely Related Activities

It can sometimes be difficult to determine where processing ends and analysis begins. In part this is because the two concepts are often intermingled to ensure that both data and research products meet a common set of goals.

Learn more

Validation 

Data may need to be compared to natural limits, adjacent measurements, or historical data to verify that they are suitable for use. Although this activity falls under the umbrella of Data Quality Management, it is quite common to develop a process to handle those validation steps, one that can be codified and does not require manual intervention. See the section on Managing Quality for more information about data quality in the Science Data Lifecycle.

Examples:

  • Reject spurious values in a real-time data stream that are more than X% different from the previous value.
  • Flag coordinates that are outside the spatial bounding box of a project and ensure that the longitude portion of a pair of coordinates in the U.S. is given as a negative value.
  • Ignore colored dissolved organic matter (CDOM) values when turbidity readings exceed X.
  • If field pH is > 1 unit different than lab-derived pH for the same sample, flag both values.

 

Transformation 

Image illustrating how table formatting can be modified through data transformation.

Figure 1. Example of two tables where the Tall format contains the columns Site, Year, Count, and the Wide format contains the columns Site and each incremental year with "Count" under each year.
(Public domain)

Transforming data includes converting, reorganizing, or reformatting data. This action does not change the meaning of the data but can enable use in a context different than the original intent, or facilitate display and analysis. It is not unusual for a single dataset to be reformatted for use in different software environments (Fig 1).

Examples:

  • Convert raw electrical pulses from in situ remote sensors into a time series of data values, at the same time removing spikes and interpolating for missing sequence values.
  • Convert spreadsheet data to XML or CSV format.
  • Convert a series of date-time values from local time to Coordinated Universal Time (UTC).
  • Transform a series of 'named' coordinates (point values) to locations on a map (same data, different usage format).
  • Rotate a data table from a tall format (vertical) into a wide format (horizontal), or vice versa (Fig. 1).
    • All values are preserved and intact, but presented in a different arrangement that may be better suited to certain uses or to meet software requirements.

 

Subsetting

Subsetting data includes extracting not only select parts of a larger dataset (retrieval filter), but also filtering columns or rows and excluding values from working datasets based on user-defined criteria. The result of subsetting is a more compact and well-defined set of data that meet a particular set of use requirements.

Examples:

  • Remove rows of data that do not meet the data requirements of a project, such as data that are outside of the geographic or temporal extent.
  • Cut out columns of data in a standardized retrieval from an external source to leave only the data of interest.
    • For example, a project requiring annual total precipitation at a Site can remove the redundant individual monthly columns from a standardized weather data retrieval in wide format.

 

Summarization 

Sometimes it is necessary to summarize data through grouping, aggregating, and producing statistics about the data. This can be considered a 'data reduction' step in a process, where fine-grained data are recast at a scale more amenable to integration, analysis or display.

Examples:

  • Convert 15-minute interval data to a daily total, min, and max.
  • Calculate the total annual rainfall by county from individual weather station hourly data (or by watershed, HUC, state, municipality, etc.).
  • Aggregate data by using a classification or domain system (e.g., express a sediment sample as percentages of different grain sizes).

     

    Integration

    Data integration builds a new data structure or combines datasets. Activities could involve merging, stacking, or concatenating data, and may use web services that allow access to authoritative data sources based on user-defined criteria.

    Examples:

     

    Derivation

    Data derivation is a processing component that creates new value types that were not present in the source data. Typically an algorithm is applied to derive a new value.

    Examples:

    • Compute the difference between potential evapotranspiration and total rainfall, for a point grid in GIS.
    • Estimate the ion balance for a water sample (total cation meq / total anion meq).
    • From Census data, compute the population growth rate (or decline) over the last 50 years by congressional district.
    • Create a uniform grid of geospatial values from a variety of unevenly spaced discrete point values.

     

    Process Documentation 

    CSDGM metadata example of Process Steps.

    Figure 2: CSDGM metadata example showing Process Steps within a project metadata record having five defined process steps. The Process Step section is under the Data Quality Information section of the full metadata record
    (Public domain)

    Capturing and communicating information about how data were processed is critical for reproducible science. Documenting changes made to data from acquisition through use and sharing in a project (what you 'did' to the data, not how you 'used' it) should be part of the formal metadata record accompanying a data release, or as part of a published methodology being shared more widely. Also see the next section on diagramming and workflow management.

    Ideally a process should be documented as a set of sequential, nested procedures and steps, and will adhere to the chosen metadata standard format. The general format for FGDC-CSDGM standard metadata is this:

    Lineage

    • Process Step (repeat as needed; e.g. Step 1, Step 2, etc.)
      • Description - Title and description of the activity, and should include the desired outcome and any data requirements or acceptance standards being met by this process.
      • Date - Date when the process was completed.

    Figure 2 shows part of the Process section of a project metadata record having five defined process steps. View the complete metadata record

     

    Process Diagrams, Workflow Tools and Automation 

    The use of flowcharts, data flow diagrams, and workflow tools can be very helpful for communication and capturing the history of a work activity. Workflow Capture is described on a separate page in the Process section of this website. Good descriptions of a variety of diagramming tools can be found at this Minnesota Department of Health Web site.

    Data processing at USGS usually involves the use of software and programming languages, and processes to handle routine or repeated interactions with data are often automated. Modular approaches to process development provide the most utility, as well-designed components can be reused in a variety of contexts or combined with other, compatible modules to accomplish very complex tasks.

     

    Best Practices 

    • Use existing metadata conventions appropriate to the data (FGDC-CSDGM, ISO19115, ISO 19157, CF-1.6, ACDD). Note that USGS must follow the FGDC-CSDGM or ISO standard.
    • Use known and open formats (txt, geoTIFF, netCDF).
    • Use checklists to manage workflow.
    • Use scripts to automate, and enhance reproducibility.
    • Use open-source solutions when possible.
    • Keep code releases in public repositories such as GitHub. Note that USGS is developing internal repository capability for version control services, staging pre-release development, and to foster code reviews.
    • Save your input data (will be published at "Publish/Share" stage).
    • Conduct a peer review on the processing software. To validate data produced by a 'software process,' that process should ideally be vetted.
    • Produce data using standards that are appropriate to your discipline.

     

    Examples 

    USGS

     

    Other Groups

     

    What the U.S. Geological Survey Manual Requires: 

    Policies that apply to Data Processing address appropriate documentation of the methods and actions used to modify data from an acquired state to the form used for research or produced for sharing. Metadata standards (FGDC, ISO) include sections for describing the "provenance" of data, meaning that enough "process" information is provided for the user to determine where data originated and what changes were made to get to the described form.

    The USGS Manual Chapter 500.25 - USGS Scientific Integrity discusses the USGS's dedication to "preserving the integrity of the scientific activities it conducts and that are conducted on its behalf" by abiding to the Department of Interior 305 DM 3 - Integrity of Scientific and Scholarly Activities.

    The USGS Manual Chapter 502.2 - Fundamental Science Practices: Planning and Conducting Data Collection and Research includes requirements for process documentation.

    "Documentation: Data collected for publication in databases or information products, regardless of the manner in which they are published (such as USGS reports, journal articles, and Web pages), must be documented to describe the methods or techniques used to collect, process, and analyze data (including computer modeling software and tools produced by USGS); the structure of the output; description of accuracy and precision; standards for metadata; and methods of quality assurance."

    and

    "Standard USGS methods are employed for distinct research activities that are conducted on a frequent or ongoing basis and for types of data that are produced in large quantities. Methods must be documented to describe the processes used and the quality-assurance procedures applied."

    The USGS Manual Chapter 502.4 - Fundamental Science Practices: Review, Approval, and Release of Information Products addresses documentation of the methodology used to create data and generate research results.:

    "Methods used to collect data and produce results must be defensible and adequately documented."

     

    Recommended Reading

    • Wilson, G, et al, Good Enough Practices in Scientific Computing: https://arxiv.org/pdf/1609.00037.pdf [Link Verified July 30, 2018]
    • Read, J.S., Walker, J.I, Appling, A.P., Blodgett, D.L., Read, E.K., Winslow, L.A., 2016. geoknife: reproducible web-processing of large gridded datasets. Ecography. 39(4):354-360. https://doi.org/10.1111/ecog.01880 [Link Verified July 30, 2018]

     

    References