|
|
 |
USGS Data Lifecycle Diagram
Workflow Capture
Processing, analyzing, and transforming raw data into information can be a lengthy process. Researchers frequently modify and add to data, but rarely are these processes documented adequately. Without documentation on how a dataset was processed, others may not be able to reproduce the results. Therefore process metadata are required. Metadata communicate the what, where, and when, but process metadata describe the how.
What is a Workflow?
Key Points
- Processing, transforming, and analyzing data should be documented in the form of process metadata.
- Process metadata enable reproducibility of a researcher's analysis of the data.
- A workflow formalizes the process metadata by conceptualizing each component of the analysis through a visual diagram.
- A workflow typically comprises the data inputs, data transformations, and the analytical steps that results in the final data output.
- Workflows come in two types:
- Informal: visual flow diagram of a series of connected steps.
- Formal/Executable: workflow diagrams executable in software systems.
- Workflow capture enables transparency, reproducibility, and potential reuse.
A workflow is the formalization of the process metadata which includes a description of the researcher's method. In essence it conceptualizes the data inputs, transformations (e.g., log transformation), and analytical steps to achieve the final data output. Workflows come in two types: Formal and Informal.
Informal Workflows
Informal workflows are basic conceptualizations to describe the input, analytical steps, and output of a process which can range from simple to complex. These workflows can also include a variety of inputs/outputs, analytical processes that manipulate the data, decision nodes that specify conditions that determine the next step, and predefined processes that specify a fixed multi-step process.
Formal Workflows
Formal or executable workflows are also known as analytical pipelines that allow each step to be implemented in different software systems. Formal workflows can be stored easily and reused as a single access point for repetitive or new tasks. This is because the workflow keeps track of every analysis and the parameters/requirements of each step.
Importance of Workflow Capture
Workflow capture is important within the data lifecycle because it is critical to document the process of how the data are analyzed and transformed after collection. Documentation of the process is important to the researcher because it increases transparency and reproducibility, and others can easily follow the progression of the data analysis later. Clear documentation of the workflow encourages potential reuse of the data.
Best Practices: Document Your Workflow
- Process metadata: information about the process to obtain the data output.
- Include a description of the procedure. There must be documentation each time you manipulate or analyze the data. This is represented in a series of conceptual steps.
- Metadata: be sure to create metadata in tandem with the data to be collected.
- Decide whether your workflow is Informal or Formal.
- Informal: commented scripts.
- Include well-documented code.
- High-level information resides at the top (e.g., description of project, author, data, parameters).
- Define each section and their dependencies (i.e., what inputs they need and what they output).
- Construct a complete script which runs without intervention.
- Formal/Executable
- Use formal software (e.g., Kepler, VisTrails - see Tools below)
Disclaimer: Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Tools
Recommended Reading
- T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).
- Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40, 24-32 (2007).
References
- DataOne education modules. Accessed June 13, 2012 at https://www.dataone.org/education-modules
- The Kepler Project. Accessed June 13, 2012, at https://kepler-project.org/
- VisTrails. Accessed June 13, 2012, at http://www.vistrails.org/
- Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox et al. Examining the Challenges of Scientific Workflows. Computer 40, 24, 32 (2007).
- B. Ludäscher, I. Altintas, S. Bowers, J. Cummings, T. Critchlow et al. Scientific Process Automation and Workflow Management. Comp. Sci. Ser. Ch. 13 (Chapman and Hall, Boca Raton, 2009).
- T. McPhillips, S. Bowers, D. Zinn, B. Ludäscher. Scientific workflow design for mere mortals. Fut. Gen. Comp. Sys. 25, 541-551 (2009).
- B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank et al. Scientific workflow management and the kepler system. Conc. Comp. Prac. Exper., 18 (2006).
- W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).
|