USGS Data at Risk: Expanding Legacy Data Inventory and Preservation Strategies

Science Center Objects

As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectiv...

As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which  is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society. Recognizing these truths and the potential value of legacy data, USGS has been investigating legacy data management and preservation since 2006, including the 2016 “DaR” project, which developed legacy data inventory and evaluation methods and then tested them while preserving and releasing 5 at-risk USGS legacy datasets. This FY17 project will build on those FY16 project successes by:

  1. Improving the legacy data evaluation and prioritization algorithms and increasing user workflow efficiency.
  2. Promoting and expanding the USGS legacy data inventory.
  3. Continuing to preserve and publish critical, at-risk USGS legacy products.

The methods and tools developed through this project will enable USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve their highest-priority legacy data products.

Scope

As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which  is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats or technology. These “legacy data” are  invaluable for extending our historical understanding of the world’s natural resources, landscapes and hazards but lie unused because ultimately they are undiscovered and potentially unknown. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society.

Recognizing these truths and the potential value of USGS legacy data to modern scientific endeavors, USGS has has been investigating methods of inventorying and preserving legacy data since 2006 through projects like the USGS Data Rescue Program (2006-­2013), the Legacy Data Inventory and Reporting System (LDIRS; CDI 2014), and the 2016 Developing a USGS Legacy Data Inventory project, also known as the “Data at Risk” or “DaR” project (CDI 2016).

In particular, the FY16 DaR project represents a convergence of earlier USGS legacy data projects, new open data policies, and modern information technology to provide USGS Mission Areas and science centers with legacy data preservation support, tools, and methods. The primary objectives and results of the FY16 DaR project were:

  1. Create a USGS legacy data inventory that catalogs and describes known USGS legacy data sets.

    Results: We used the Legacy Data Inventory and Reporting System (LDIRS) to conduct a USGS-wide “Request for Legacy Data” (RFD) in May, 2016. We received 43 submissions from 20 USGS science centers with potential impacts across all USGS Missions. This formed the pool of submissions we evaluated and prioritized in Objective 2 (below) and prioritized and selected in Objective 3 (below). Since the RFD, the Fort Collins Science Center and EROS Center have continued to contribute legacy data to the inventory. The current inventory is available at: https://www.fort.usgs.gov/ldi/legacy-products

  2. Develop methods to evaluate and prioritize legacy data sets based on USGS Mission objectives.

    Results: We developed and tested a method to evaluate the risk and significance factors associated with a legacy data product and a second, algorithm-based method to prioritize legacy data based on its evaluation scores. 

  3. Preserve and release select, priority legacy data sets at risk of damage or loss.

    Applying the methods we developed in FY16 Objective 2 (above), we selected the top 5 legacy data products and partnered with the data owners to preserve and publish them as official USGS data releases. All legacy data products have started the IPDS review and approval process with official USGS data releases beginning in January 2017.

  4. Develop time and resource estimates to preserve and release legacy data.

    For each of the 5 selected preservation projects, we collected data on the time and resources required to complete each stage of data management plan (e.g., plan, acquire, process, analyze, preserve, and publish/share). This operational data will better inform future legacy data preservation and release estimates. These data will be published as case studies.

This FY17 CDI project seeks to build on the DaR FY16 project successes by:

  1. Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.
  2. Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data at risk needs.
  3. ​Continuing to study, preserve and publish at-risk, mission-critical USGS legacy data.

Beyond the scientific importance of preserving and publicly releasing new USGS legacy data, successfully completing these FY17 project objectives will establish  LDIRS as a simple, effective tool to manage the growing USGS legacy data inventory, enabling USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve and publish their highest-priority, legacy data products.  

Technical Approach

Objective 1: Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.

Based on FY16 DaR project data and LDIRS user feedback we have identified 3 significant improvements that will improve the legacy data inventory, evaluation, and prioritization processes for USGS staff:

  1. Expand the library of risk and significance factors and refine risk and significance scores and scoring algorithms;
  2. Aggregate the legacy data submission, the risk and significance evaluation, and the inventory prioritization processes into a single process; and
  3. ​Create Mission Area, Program and science center inventory dashboards that display multiple LDIRS reports in a single user display.

Objective 2: Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data-at-risk needs.

The FY16 DaR project focused on developing and validating legacy data inventory, evaluation and reporting methods. This work also resulted in engaging, productive community discussions that validated the utility and need for a USGS legacy data inventory. With those positive results to build on, Objective 2 of this project will expand the current USGS legacy data inventory.

To do this we will:

  1. Provide in-person legacy data inventory training and support to two USGS science centers who will conduct inventories of their legacy data collections. Results from these inventories and a third, previous CDI-partnered inventory (Fort Collins Science Center, 2015) will be used to develop case studies and training efforts below.
  2. Develop legacy data inventory case studies that describe the real-world experiences of the three USGS science centers that conducted legacy data inventories. Case studies will be publicly available from the LDIRS web site, as well as presented  at the “Legacy Data: Challenges and Solutions” session of the 2017 CDI Workshop.
  3. Create short instructional training videos for USGS data managers, explaining the submission, evaluation, and prioritization processes for legacy data inventories. Training videos and documentation will be available to USGS staff via the LDIRS web site.
  4. Develop a quarterly, opt-in USGS legacy data inventory report that provides USGS managers and data stewards a broad overview of the current USGS inventory from the perspective of a Mission Area, Program and/or science center.

Objective 3: Continuing to identify, preserve and study at-risk, mission-critical USGS legacy products.

Undeniably, preserving and publishing at-risk USGS legacy data was the most visible and powerful aspect of the FY16 DaR project. Case in point: the strongest feedback we received for this proposal’s FY17 statement of interest were specific requests to maximize the amount of funding for at-risk data preservation, which we have done. In addition, we identified patterns and efficiencies that provided FY17 improvements for users (see “Objective 1” above) through our study of the time and resources required to preserve and publish legacy data . Therefore, project Objective 3 is designed to:

  • Identify and prioritize at-risk USGS legacy data by conducting a FY17 USGS “Request for Legacy Data” (RFD).
  • Test the USGS Exit Survey process as a method of identifying at-risk USGS legacy data by conducting exit interviews on two career USGS staff and inventorying and evaluating their legacy data sets.
  • ​Use the legacy data inventory tools and methods developed in FY16 to select up to four more  mission-critical, at-risk USGS data sets to preserve and publish in 2017.

The FORT legacy data steward will ensure that all legacy data releases from this project will:

  • have complete, compliant FGDC-CSDGM metadata
  • address OSTP (Increasing Access to the Results of Federally Funded Scientific Research), OMB (M-13-13, Open Data Policy – Managing Information as an Asset), and Executive Order 13642 (Making Open and Machine Readable the New Default for Government Information) memorandums.
  • ​promote project successes and milestones using, at a minimum, USGS regional highlights.
  • produce a CDI final report chapter describing the data set(s) released and a summary of time and resources required to complete the release.

Project Timeline

Project Phase Status
Personal Data Inventory Case Study: Susan Skagen (USGS-FORT) Complete: May 2017
Personal Data Inventory Case Study: Kathryn Thomas (USGS-SBSC) Complete: July 2017
LDIRS Technical Improvements Complete: August 2017
Science Center Inventory Case Study: USGS-GLSC Complete: September 2017
Science Center Inventory Case Study: USGS-UMESC Complete: October 2017
Science Center Inventory Case Study: USGS-FORT In Progress
2017 DaR Request for Legacy Data Complete: September 2017
Migrating Bird Survey Data Along the San Pedro River and its Tributaries, Southeastern Arizona, 1989-1994 Complete: January 2018
Crest Stage Gage Site Visit Data, Montana, 1955-2016 Complete: February 2018
Central Mojave Desert Vegetation Mapping Project, California, 1997-1999 Peer Review
Golden Eagle International Radio Tracking Data, North America, 1992-1999 Peer Review
Chironomid Specimen Data, The Great Lakes (USA), 1957-2017 Preservation

Project Report

We refined the LDIRS prioritization algorithms to better assess temporal, geographic, and taxonomic extents, resulting in clearer prioritization scores with better intra-record differentiation. In addition, we incorporated the data assessment scoring into the data entry workflow, resulting in real-time prioritization.

We used several methods to continue to promote and expand the USGS legacy data inventory. First, we worked with two career scientists (Susan Skagen; Kathryn Thomas) and two science centers (GLSC and UMESC) to inventory their scientific records as a means of identify legacy data. Second, in September we conducted a second USGS-wide “request for legacy data” to further expand the total LDIRS inventory. Third, we continued to communicate the DaR project accomplishments and methods through USGS groups such as CDI, the FSPAC Data Preservation Subcommittee, the Data at Risk Working Group, the National Geospatial and Geophysical Data Preservation Program (NGGDPP) and the USGS Step-Up Program. In particular, the USGS Step-Up program used the LDIRS prioritization reports to select the North American Bat Banding Program data for their FY18 preservation work, an unfinished CDI-funded preservation project from 2014.  

During the FY16 and ‘17 funding periods, the DaR project has selected 13 high priority preservation projects to validate best practices for preserving and publishing USGS legacy data and software. To date, 6 have been published, 3 are in peer review, and 3 are completing data processing. Upon completion each project is summarized as a case study that documents that describes the methods validated and lessons learned.