Metadata communicate the what, where, and when, but process metadata describe the how.
Capturing the 'How'
Processing & transforming raw data into information can be very involved. Researchers rarely document these processes adequately, leaving others unable to reproduce the results. Learn more below about capturing the 'how' using workflow tools.
What is a Workflow?
A scientific workflow is a representation of researchers' data inputs and methods for generating final data products and results.
Informal workflows are basic conceptualizations to describe the input, analytical steps, and output of a process which can range from simple to complex. These workflows can also include a variety of inputs/outputs, analytical processes that manipulate the data, decision nodes that specify conditions that determine the next step, and predefined processes that specify a fixed multi-step process.
Formal or executable workflows are also known as analytical pipelines that allow each step to be implemented in different software systems. Formal workflows can be stored easily and reused as a single access point for repetitive or new tasks. This is because the workflow keeps track of every analysis and the parameters/requirements of each step. Often times, these workflows can be documented as reproducible notebooks. For example, a Jupyter Notebook titled Dust Bowl (Figure 1) was created to demonstrate a reproducible workflow for exploring climate data from the USGS GeoDataPortal.
Importance of Workflow Capture
Workflow capture is important within the data lifecycle because it is critical to document the process of how the data are analyzed and transformed after collection. Documentation of the process is important to the researcher because it increases transparency and reproducibility, and others can easily follow the progression of the data analysis later. Clear documentation of the workflow encourages potential reuse of the data.
Best Practices: Document Your Workflow
Process metadata: information about the process to obtain the data output.
Include a description of the procedure. There must be documentation each time you manipulate or analyze the data. This is represented in a series of conceptual steps. (See Process > Process Documentation)
Metadata: be sure to create metadata in tandem with the data to be collected.
Decide whether your workflow is Informal or Formal.
Informal: commented scripts.
Include well-documented code.
High-level information resides at the top (e.g., description of project, author, data, parameters).
Define each section and their dependencies (i.e., what inputs they need and what they output).
Construct a complete script which runs without intervention.
Use formal software (e.g., Kepler, VisTrails - see Tools below)
Disclaimer: Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Jupyter Notebooks https://jupyter.org/
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
A free scientific workflow application that enables scientists to create, document, and share complex models and analyses. The Java-based application is extremely flexible for working with data stored in various formats found locally or over the Internet. The program has a user-friendly graphical interface that easily connects data sources with complex analytical components to create an executable representation of the steps required to generate the desired results.
Developed at the University of Utah, VisTrails is an open-source scientific workflow and provenance management software that provides support for simulations, data exploration, and visualization. VisTrails is most distinguished for its comprehensive provenance infrastructure of maintaining a detailed history of the steps and data outputs from each exploratory run.
The Taverna suite of tools are bringing together a range of features to make it easier for users to find, design, and execute complex workflows and share them with other people.