Data Management

Workflow Capture

Metadata communicate the what, where, and when, but process metadata describe the how.

Capturing the 'How'

Processing & transforming raw data into information can be very involved. Researchers rarely document these processes adequately, though, leaving others unable to reproduce the results. Learn more below about capturing the 'how' using workflow tools.

What is a Workflow?

A scientific workflow is a representation of researchers' data inputs and methods for generating final data products and results.

Informal Workflows 

Informal workflows are basic conceptualizations to describe the input, analytical steps, and output of a process which can range from simple to complex. These workflows can also include a variety of inputs/outputs, analytical processes that manipulate the data, decision nodes that specify conditions that determine the next step, and predefined processes that specify a fixed multi-step process.

 

Formal Workflows 

Screenshot of a reproducible notebook created using Jupyter Notebooks.

Figure 1. Screenshot of a reproducible notebook created using Jupyter Notebooks to demonstrate a workflow for exploring climate data from the USGS GeoDataPortal. This notebook was created by Roland Viger and Rich Signell and is available from https://github.com/reproducible-notebooks/dust_bowl

Formal or executable workflows are also known as analytical pipelines that allow each step to be implemented in different software systems. Formal workflows can be stored easily and reused as a single access point for repetitive or new tasks. This is because the workflow keeps track of every analysis and the parameters/requirements of each step. Often times, these workflows can be documented as reproducible notebooks. For example, a Jupyter Notebook titled Dust Bowl (Figure 1) was created to demonstrate a reproducible workflow for exploring climate data from the USGS GeoDataPortal. 

 

Importance of Workflow Capture 

Workflow capture is important within the data lifecycle because it is critical to document the process of how the data are analyzed and transformed after collection. Documentation of the process is important to the researcher because it increases transparency and reproducibility, and others can easily follow the progression of the data analysis later. Clear documentation of the workflow encourages potential reuse of the data.

 

Best Practices: Document Your Workflow 

  • Process metadata: information about the process to obtain the data output.
     
    • Include a description of the procedure. There must be documentation each time you manipulate or analyze the data. This is represented in a series of conceptual steps. (See Process > Process Documentation)
       
    • Metadata: be sure to create metadata in tandem with the data to be collected.
       
  • Decide whether your workflow is Informal or Formal.
     
    • Informal: commented scripts.
      • Include well-documented code.
      • High-level information resides at the top (e.g., description of project, author, data, parameters).
      • Define each section and their dependencies (i.e., what inputs they need and what they output).
      • Construct a complete script which runs without intervention.
         
    • Formal/Executable
      • Use formal software (e.g., Kepler, VisTrails - see Tools below)

 

Tools 

Disclaimer: Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

  • Jupyter Notebooks
    http://jupyter.org/
    The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
     
  • Kepler
    https://kepler-project.org/
    A free scientific workflow application that enables scientists to create, document, and share complex models and analyses. The Java-based application is extremely flexible for working with data stored in various formats found locally or over the Internet. The program has a user-friendly graphical interface that easily connects data sources with complex analytical components to create an executable representation of the steps required to generate the desired results.
      
  • VisTrails
    https://www.vistrails.org/index.php/Main_Page
    Developed at the University of Utah, VisTrails is an open-source scientific workflow and provenance management software that provides support for simulations, data exploration, and visualization. VisTrails is most distinguished for its comprehensive provenance infrastructure of maintaining a detailed history of the steps and data outputs from each exploratory run.
     
  • Taverna
    https://taverna.incubator.apache.org/
    The Taverna suite of tools are bringing together a range of features to make it easier for users to find, design, and execute complex workflows and share them with other people.
     
  • DataOne
    https://www.dataone.org/software-tools/tags/workflow
    The DataONE Web site provides access to additional data workflow tools. 

 

Recommended Reading 

  • T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).
  • Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40, 24-32 (2007).

 

References 

  • DataOne education modules. https://www.dataone.org/education-modules. [Link Verified July 30, 2018]
  • The Kepler Project. https://kepler-project.org/. [Link Verified July 30, 2018]
  • VisTrails. https://www.vistrails.org/index.php/Main_Page. [Link Verified July 30, 2018]
  • Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40, 24, 32 (2007).
  • B. Ludäscher, I. Altintas, S. Bowers, J. Cummings, T. Critchlow et al. Scientific Process Automation and Workflow Management. Comp. Sci. Ser. Ch. 13 (Chapman and Hall, Boca Raton, 2009).
  • T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).
  • B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank et al. Scientific workflow management and the kepler system. Conc. Comp. Prac. Exper., 18 (2006).
  • W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).