USGS - science for a changing world

USGS Data Management

Describe / Metdata > Workflow Capture

U. S. Geological Survey Data Lifecycle Diagram Plan Acquire Preserve Publish/Share Describe (Metadata and Documentation) Manage Quality Backup and Secure
USGS Data Lifecycle Diagram

Workflow Capture

Processing, analyzing, and transforming raw data into information can be a lengthy process. Researchers frequently modify and add to data, but rarely are these processes documented adequately. Without documentation on how a dataset was processed, others may not be able to reproduce the results. Therefore process metadata are required. Metadata communicate the what, where, and when, but process metadata describe the how.

What is a Workflow?

Key Points

  • Processing, transforming, and analyzing data should be documented in the form of process metadata.
    • Process metadata enable reproducibility of a researcher's analysis of the data.
  • A workflow formalizes the process metadata by conceptualizing each component of the analysis through a visual diagram.
  • A workflow typically comprises the data inputs, data transformations, and the analytical steps that results in the final data output.
  • Workflows come in two types:
    • Informal: visual flow diagram of a series of connected steps.
    • Formal/Executable: workflow diagrams executable in software systems.
  • Workflow capture enables transparency, reproducibility, and potential reuse.

A workflow is the formalization of the process metadata which includes a description of the researcher's method. In essence it conceptualizes the data inputs, transformations (e.g., log transformation), and analytical steps to achieve the final data output. Workflows come in two types: Formal and Informal.

Informal Workflows

Informal workflows are basic conceptualizations to describe the input, analytical steps, and output of a process which can range from simple to complex. These workflows can also include a variety of inputs/outputs, analytical processes that manipulate the data, decision nodes that specify conditions that determine the next step, and predefined processes that specify a fixed multi-step process.

Formal Workflows

Formal or executable workflows are also known as analytical pipelines that allow each step to be implemented in different software systems. Formal workflows can be stored easily and reused as a single access point for repetitive or new tasks. This is because the workflow keeps track of every analysis and the parameters/requirements of each step.

Importance of Workflow Capture

Workflow capture is important within the data lifecycle because it is critical to document the process of how the data are analyzed and transformed after collection. Documentation of the process is important to the researcher because it increases transparency and reproducibility, and others can easily follow the progression of the data analysis later. Clear documentation of the workflow encourages potential reuse of the data.

Best Practices: Document Your Workflow

Workflow example flowchart
Figure 1. Workflow example (reference link)
  • Process metadata: information about the process to obtain the data output.
    • Include a description of the procedure. There must be documentation each time you manipulate or analyze the data. This is represented in a series of conceptual steps.
    • Metadata: be sure to create metadata in tandem with the data to be collected.
  • Decide whether your workflow is Informal or Formal.
    • Informal: commented scripts.
      • Include well-documented code.
      • High-level information resides at the top (e.g., description of project, author, data, parameters).
      • Define each section and their dependencies (i.e., what inputs they need and what they output).
      • Construct a complete script which runs without intervention.
    • Formal/Executable
      • Use formal software (e.g., Kepler, VisTrails - see Tools below)
Disclaimer: Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Tools

  • Kepler
  • Description:
    (free) A scientific workflow application that enables scientists to create, document, and share complex models and analyses. The Java-based application is extremely flexible for working with data stored in various formats found locally or over the Internet. The program has a user-friendly graphical interface that easily connects data sources with complex analytical components to create an executable representation of the steps required to generate the desired results.
    URL:
    https://kepler-project.org/
  • VisTrails
  • Description:
    Developed at the University of Utah, VisTrails is an open-source scientific workflow and provenance management software that provides support for simulations, data exploration, and visualization. VisTrails is most distinguished for its comprehensive provenance infrastructure of maintaining a detailed history of the steps and data outputs from each exploratory run.
    URL:
    http://www.vistrails.org/
  • Taverna
  • Description:
    (free) The Taverna suite of tools are bringing together a range of features to make it easier for users to find, design, and execute complex workflows and share them with other people.
    URL:
    http://www.taverna.org.uk/
  • For Additional Tools, See the DataONE Web site:
  • URL:
    http://www.dataone.org/software-tools/tags/workflow

Recommended Reading

  • T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).
  • Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40, 24-32 (2007).

References

  • DataOne education modules. Accessed June 13, 2012 at https://www.dataone.org/education-modules
  • The Kepler Project. Accessed June 13, 2012, at https://kepler-project.org/
  • VisTrails. Accessed June 13, 2012, at http://www.vistrails.org/
  • Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40, 24, 32 (2007).
  • B. Ludäscher, I. Altintas, S. Bowers, J. Cummings, T. Critchlow et al. Scientific Process Automation and Workflow Management. Comp. Sci. Ser. Ch. 13 (Chapman and Hall, Boca Raton, 2009).
  • T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).
  • B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank et al. Scientific workflow management and the kepler system. Conc. Comp. Prac. Exper., 18 (2006).
  • W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://origin-www.usgs.gov/datamanagement/describe/capture.php
Page Contact Information: Email Us
Page Last Modified: Tuesday, April 08, 2014