Data Management

Process and Analyze - Closely Related Activities

Important goals for both the processing and analysis of data are maximizing accuracy and productivity, while minimizing costs. To that end, researchers should design workflows that are efficient, scripted where possible, and use methods and software that standardize what is done to the data.

Any method or code developed for a workflow should be clearly written, well documented, modular, and accessible to facilitate research reproducibility. The use of open source solutions and repositories is recommended.

Here are some things to consider when developing methods and workflows:

 

Data Quality

  • Have a plan for data quality management throughout the workflow
  • Maintain documentation on data-quality and provenance

 

Efficiency

  • Use a scripting language to automate data processing and simplify documentation
  • Use standardized methods and protocols appropriate to your data, when available
  • When possible, support the research by building software or code modules that automatically acquire external datasets and execute processing and analysis code
  • Embrace modular workflows, processes, and code, where component parts are reusable

 

Transparency

  • Open source software development is encouraged
  • Readability - code and documentation should be concise and understandable
  • Use published or citable methods, or publish new methods as necessary

 

Reproducibility

  • Documentation is only one component of this goal
  • Enable anyone (including yourself) to rerun your analyses
    • This is accomplished when an independent reviewer can read your documentation, acquire the requisite datasets, and execute the processing and analyses using code or manual actions
  • Use a version control system for your code
  • Use software that installs code packages with all dependencies
  • Reference your data sources as specifically as possible

 

Accessibility

  • Use open source development environments when possible
  • Make code available through public repositories
  • Ensure that all data used in research is available

 

Documentation

  • Maintain documentation on data processing and analysis activities as they happen; reconstructing research activities retrospectively is less efficient and accurate
  • For release-stage products, include diagrams and other supplemental material, in addition to standard metadata, to assist with understanding or reproducing a process or analysis