USGS Data at Risk: Expanding Legacy Data Inventory and Preservation Strategies Completed
As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society. Recognizing these truths and the potential value of legacy data, USGS has been investigating legacy data management and preservation since 2006, including the 2016 “DaR” project, which developed legacy data inventory and evaluation methods and then tested them while preserving and releasing 5 at-risk USGS legacy datasets. This FY17 project will build on those FY16 project successes by:
- Improving the legacy data evaluation and prioritization algorithms and increasing user workflow efficiency.
- Promoting and expanding the USGS legacy data inventory.
- Continuing to preserve and publish critical, at-risk USGS legacy products.
The methods and tools developed through this project will enable USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve their highest-priority legacy data products.
Principal Investigator : Anthony L Everette, Tara M Bell
Scope
As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats or technology. These “legacy data” are invaluable for extending our historical understanding of the world’s natural resources, landscapes and hazards but lie unused because ultimately they are undiscovered and potentially unknown. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society.
Recognizing these truths and the potential value of USGS legacy data to modern scientific endeavors, USGS has has been investigating methods of inventorying and preserving legacy data since 2006 through projects like the USGS Data Rescue Program (2006-2013), the Legacy Data Inventory and Reporting System (LDIRS; CDI 2014) (content no longer available), and the 2016 Developing a USGS Legacy Data Inventory project, also known as the “Data at Risk” or “DaR” project (CDI 2016).
In particular, the FY16 DaR project represents a convergence of earlier USGS legacy data projects, new open data policies, and modern information technology to provide USGS Mission Areas and science centers with legacy data preservation support, tools, and methods. The primary objectives and results of the FY16 DaR project were:
-
Create a USGS legacy data inventory that catalogs and describes known USGS legacy data sets.
Results: We used the Legacy Data Inventory and Reporting System (LDIRS) to conduct a USGS-wide “Request for Legacy Data” (RFD) in May, 2016. We received 43 submissions from 20 USGS science centers with potential impacts across all USGS Missions. This formed the pool of submissions we evaluated and prioritized in Objective 2 (below) and prioritized and selected in Objective 3 (below). Since the RFD, the Fort Collins Science Center and EROS Center have continued to contribute legacy data to the inventory. The current inventory is available at: https://www.fort.usgs.gov/ldi/legacy-products (content no longer available) -
Develop methods to evaluate and prioritize legacy data sets based on USGS Mission objectives.
Results: We developed and tested a method to evaluate the risk and significance factors associated with a legacy data product and a second, algorithm-based method to prioritize legacy data based on its evaluation scores. -
Preserve and release select, priority legacy data sets at risk of damage or loss.
Applying the methods we developed in FY16 Objective 2 (above), we selected the top 5 legacy data products and partnered with the data owners to preserve and publish them as official USGS data releases. All legacy data products have started the IPDS review and approval process with official USGS data releases beginning in January 2017. -
Develop time and resource estimates to preserve and release legacy data.
For each of the 5 selected preservation projects, we collected data on the time and resources required to complete each stage of data management plan (e.g., plan, acquire, process, analyze, preserve, and publish/share). This operational data will better inform future legacy data preservation and release estimates. These data will be published as case studies.
This FY17 CDI project seeks to build on the DaR FY16 project successes by:
- Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.
- Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data at risk needs.
- Continuing to study, preserve and publish at-risk, mission-critical USGS legacy data.
Beyond the scientific importance of preserving and publicly releasing new USGS legacy data, successfully completing these FY17 project objectives will establish LDIRS as a simple, effective tool to manage the growing USGS legacy data inventory, enabling USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve and publish their highest-priority, legacy data products.
Technical Approach
Objective 1: Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.
Based on FY16 DaR project data and LDIRS user feedback we have identified 3 significant improvements that will improve the legacy data inventory, evaluation, and prioritization processes for USGS staff:
- Expand the library of risk and significance factors and refine risk and significance scores and scoring algorithms;
- Aggregate the legacy data submission, the risk and significance evaluation, and the inventory prioritization processes into a single process; and
- Create Mission Area, Program and science center inventory dashboards that display multiple LDIRS reports in a single user display.
Objective 2: Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data-at-risk needs.
The FY16 DaR project focused on developing and validating legacy data inventory, evaluation and reporting methods. This work also resulted in engaging, productive community discussions that validated the utility and need for a USGS legacy data inventory. With those positive results to build on, Objective 2 of this project will expand the current USGS legacy data inventory.
To do this we will:
- Provide in-person legacy data inventory training and support to two USGS science centers who will conduct inventories of their legacy data collections. Results from these inventories and a third, previous CDI-partnered inventory (Fort Collins Science Center, 2015) will be used to develop case studies and training efforts below.
- Develop legacy data inventory case studies that describe the real-world experiences of the three USGS science centers that conducted legacy data inventories. Case studies will be publicly available from the LDIRS web site, as well as presented at the “Legacy Data: Challenges and Solutions” session of the 2017 CDI Workshop.
- Create short instructional training videos for USGS data managers, explaining the submission, evaluation, and prioritization processes for legacy data inventories. Training videos and documentation will be available to USGS staff via the LDIRS web site.
- Develop a quarterly, opt-in USGS legacy data inventory report that provides USGS managers and data stewards a broad overview of the current USGS inventory from the perspective of a Mission Area, Program and/or science center.
Objective 3: Continuing to identify, preserve and study at-risk, mission-critical USGS legacy products.
Undeniably, preserving and publishing at-risk USGS legacy data was the most visible and powerful aspect of the FY16 DaR project. Case in point: the strongest feedback we received for this proposal’s FY17 statement of interest were specific requests to maximize the amount of funding for at-risk data preservation, which we have done. In addition, we identified patterns and efficiencies that provided FY17 improvements for users (see “Objective 1” above) through our study of the time and resources required to preserve and publish legacy data . Therefore, project Objective 3 is designed to:
- Identify and prioritize at-risk USGS legacy data by conducting a FY17 USGS “Request for Legacy Data” (RFD).
- Test the USGS Exit Survey process (content no longer available) as a method of identifying at-risk USGS legacy data by conducting exit interviews on two career USGS staff and inventorying and evaluating their legacy data sets.
- Use the legacy data inventory tools and methods developed in FY16 to select up to four more mission-critical, at-risk USGS data sets to preserve and publish in 2017.
The FORT legacy data steward will ensure that all legacy data releases from this project will:
- have complete, compliant FGDC-CSDGM metadata
- address OSTP (Increasing Access to the Results of Federally Funded Scientific Research), OMB (M-13-13, Open Data Policy – Managing Information as an Asset), and Executive Order 13642 (Making Open and Machine Readable the New Default for Government Information) memorandums.
- promote project successes and milestones using, at a minimum, USGS regional highlights.
- produce a CDI final report chapter describing the data set(s) released and a summary of time and resources required to complete the release.
Project Timeline
Project Phase | Status |
---|---|
Personal Data Inventory Case Study: Susan Skagen (USGS-FORT) | Complete: May 2017 |
Personal Data Inventory Case Study: Kathryn Thomas (USGS-SBSC) | Complete: July 2017 |
LDIRS Technical Improvements | Complete: August 2017 |
Science Center Inventory Case Study: USGS-GLSC | Complete: September 2017 |
Science Center Inventory Case Study: USGS-UMESC | Complete: October 2017 |
2017 DaR Request for Legacy Data | Complete: September 2017 |
Migrating Bird Survey Data Along the San Pedro River and its Tributaries, Southeastern Arizona, 1989-1994 | Complete: January 2018 |
Crest Stage Gage Site Visit Data, Montana, 1955-2016 | Complete: February 2018 |
Central Mojave Desert Vegetation Mapping Project, California, 1997-1999 | Complete: November 2018 |
Golden Eagle (Aquila chrysaetos) Satellite Telemetry and Observational Data, Western North America, 1993-1997 | Complete: November 2020 |
Project Report
We refined the LDIRS prioritization algorithms to better assess temporal, geographic, and taxonomic extents, resulting in clearer prioritization scores with better intra-record differentiation. In addition, we incorporated the data assessment scoring into the data entry workflow, resulting in real-time prioritization.
We used several methods to continue to promote and expand the USGS legacy data inventory. First, we worked with two career scientists (Susan Skagen; Kathryn Thomas) and two science centers (GLSC and UMESC) to inventory their scientific records as a means of identify legacy data. Second, in September we conducted a second USGS-wide “request for legacy data” to further expand the total LDIRS inventory. Third, we continued to communicate the DaR project accomplishments and methods through USGS groups such as CDI, the FSPAC Data Preservation Subcommittee, the Data at Risk Working Group, the National Geospatial and Geophysical Data Preservation Program (NGGDPP) and the USGS Step-Up Program. In particular, the USGS Step-Up program used the LDIRS prioritization reports to select the North American Bat Banding Program data for their FY18 preservation work, an unfinished CDI-funded preservation project from 2014.
During the FY16 and ‘17 funding periods, the DaR project has selected 13 high priority preservation projects to validate best practices for preserving and publishing USGS legacy data and software. To date, 6 have been published, 3 are in peer review, and 3 are completing data processing. Upon completion each project is summarized as a case study that documents that describes the methods validated and lessons learned.
- Source: USGS Sciencebase (id: 58b5ddc3e4b01ccd54fde3fa)
Developing a USGS Legacy Data Inventory to Preserve and Release Historical USGS Data
North American Bat Data Integration
Mining the USGS Data Landscape
Central Mojave Desert Vegetation Mapping Project, California, 1997-1999: Plots Points and Photographs
Magnetotelluric Data from the San Andreas Fault, Parkfield CA, 1990
Migrating Bird Survey Data Along the San Pedro River and its Tributaries, Southeastern Arizona, 1989-1994
Community for Data Integration 2017 annual report
Software to Process and Preserve Legacy Magnetotelluric Data
- Overview
As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society. Recognizing these truths and the potential value of legacy data, USGS has been investigating legacy data management and preservation since 2006, including the 2016 “DaR” project, which developed legacy data inventory and evaluation methods and then tested them while preserving and releasing 5 at-risk USGS legacy datasets. This FY17 project will build on those FY16 project successes by:
- Improving the legacy data evaluation and prioritization algorithms and increasing user workflow efficiency.
- Promoting and expanding the USGS legacy data inventory.
- Continuing to preserve and publish critical, at-risk USGS legacy products.
The methods and tools developed through this project will enable USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve their highest-priority legacy data products.
Principal Investigator : Anthony L Everette, Tara M BellScope
As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats or technology. These “legacy data” are invaluable for extending our historical understanding of the world’s natural resources, landscapes and hazards but lie unused because ultimately they are undiscovered and potentially unknown. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society.
Recognizing these truths and the potential value of USGS legacy data to modern scientific endeavors, USGS has has been investigating methods of inventorying and preserving legacy data since 2006 through projects like the USGS Data Rescue Program (2006-2013), the Legacy Data Inventory and Reporting System (LDIRS; CDI 2014) (content no longer available), and the 2016 Developing a USGS Legacy Data Inventory project, also known as the “Data at Risk” or “DaR” project (CDI 2016).
In particular, the FY16 DaR project represents a convergence of earlier USGS legacy data projects, new open data policies, and modern information technology to provide USGS Mission Areas and science centers with legacy data preservation support, tools, and methods. The primary objectives and results of the FY16 DaR project were:
-
Create a USGS legacy data inventory that catalogs and describes known USGS legacy data sets.
Results: We used the Legacy Data Inventory and Reporting System (LDIRS) to conduct a USGS-wide “Request for Legacy Data” (RFD) in May, 2016. We received 43 submissions from 20 USGS science centers with potential impacts across all USGS Missions. This formed the pool of submissions we evaluated and prioritized in Objective 2 (below) and prioritized and selected in Objective 3 (below). Since the RFD, the Fort Collins Science Center and EROS Center have continued to contribute legacy data to the inventory. The current inventory is available at: https://www.fort.usgs.gov/ldi/legacy-products (content no longer available) -
Develop methods to evaluate and prioritize legacy data sets based on USGS Mission objectives.
Results: We developed and tested a method to evaluate the risk and significance factors associated with a legacy data product and a second, algorithm-based method to prioritize legacy data based on its evaluation scores. -
Preserve and release select, priority legacy data sets at risk of damage or loss.
Applying the methods we developed in FY16 Objective 2 (above), we selected the top 5 legacy data products and partnered with the data owners to preserve and publish them as official USGS data releases. All legacy data products have started the IPDS review and approval process with official USGS data releases beginning in January 2017. -
Develop time and resource estimates to preserve and release legacy data.
For each of the 5 selected preservation projects, we collected data on the time and resources required to complete each stage of data management plan (e.g., plan, acquire, process, analyze, preserve, and publish/share). This operational data will better inform future legacy data preservation and release estimates. These data will be published as case studies.
This FY17 CDI project seeks to build on the DaR FY16 project successes by:
- Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.
- Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data at risk needs.
- Continuing to study, preserve and publish at-risk, mission-critical USGS legacy data.
Beyond the scientific importance of preserving and publicly releasing new USGS legacy data, successfully completing these FY17 project objectives will establish LDIRS as a simple, effective tool to manage the growing USGS legacy data inventory, enabling USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve and publish their highest-priority, legacy data products.
Technical Approach
Objective 1: Refining the legacy data evaluation and prioritization algorithms; increasing LDIRS user workflow efficiency.
Based on FY16 DaR project data and LDIRS user feedback we have identified 3 significant improvements that will improve the legacy data inventory, evaluation, and prioritization processes for USGS staff:- Expand the library of risk and significance factors and refine risk and significance scores and scoring algorithms;
- Aggregate the legacy data submission, the risk and significance evaluation, and the inventory prioritization processes into a single process; and
- Create Mission Area, Program and science center inventory dashboards that display multiple LDIRS reports in a single user display.
Objective 2: Promoting and expanding the USGS legacy data inventory to better understand USGS legacy data-at-risk needs.
The FY16 DaR project focused on developing and validating legacy data inventory, evaluation and reporting methods. This work also resulted in engaging, productive community discussions that validated the utility and need for a USGS legacy data inventory. With those positive results to build on, Objective 2 of this project will expand the current USGS legacy data inventory.To do this we will:
- Provide in-person legacy data inventory training and support to two USGS science centers who will conduct inventories of their legacy data collections. Results from these inventories and a third, previous CDI-partnered inventory (Fort Collins Science Center, 2015) will be used to develop case studies and training efforts below.
- Develop legacy data inventory case studies that describe the real-world experiences of the three USGS science centers that conducted legacy data inventories. Case studies will be publicly available from the LDIRS web site, as well as presented at the “Legacy Data: Challenges and Solutions” session of the 2017 CDI Workshop.
- Create short instructional training videos for USGS data managers, explaining the submission, evaluation, and prioritization processes for legacy data inventories. Training videos and documentation will be available to USGS staff via the LDIRS web site.
- Develop a quarterly, opt-in USGS legacy data inventory report that provides USGS managers and data stewards a broad overview of the current USGS inventory from the perspective of a Mission Area, Program and/or science center.
Objective 3: Continuing to identify, preserve and study at-risk, mission-critical USGS legacy products.
Undeniably, preserving and publishing at-risk USGS legacy data was the most visible and powerful aspect of the FY16 DaR project. Case in point: the strongest feedback we received for this proposal’s FY17 statement of interest were specific requests to maximize the amount of funding for at-risk data preservation, which we have done. In addition, we identified patterns and efficiencies that provided FY17 improvements for users (see “Objective 1” above) through our study of the time and resources required to preserve and publish legacy data . Therefore, project Objective 3 is designed to:- Identify and prioritize at-risk USGS legacy data by conducting a FY17 USGS “Request for Legacy Data” (RFD).
- Test the USGS Exit Survey process (content no longer available) as a method of identifying at-risk USGS legacy data by conducting exit interviews on two career USGS staff and inventorying and evaluating their legacy data sets.
- Use the legacy data inventory tools and methods developed in FY16 to select up to four more mission-critical, at-risk USGS data sets to preserve and publish in 2017.
The FORT legacy data steward will ensure that all legacy data releases from this project will:
- have complete, compliant FGDC-CSDGM metadata
- address OSTP (Increasing Access to the Results of Federally Funded Scientific Research), OMB (M-13-13, Open Data Policy – Managing Information as an Asset), and Executive Order 13642 (Making Open and Machine Readable the New Default for Government Information) memorandums.
- promote project successes and milestones using, at a minimum, USGS regional highlights.
- produce a CDI final report chapter describing the data set(s) released and a summary of time and resources required to complete the release.
Project Timeline
Project Phase Status Personal Data Inventory Case Study: Susan Skagen (USGS-FORT) Complete: May 2017 Personal Data Inventory Case Study: Kathryn Thomas (USGS-SBSC) Complete: July 2017 LDIRS Technical Improvements Complete: August 2017 Science Center Inventory Case Study: USGS-GLSC Complete: September 2017 Science Center Inventory Case Study: USGS-UMESC Complete: October 2017 2017 DaR Request for Legacy Data Complete: September 2017 Migrating Bird Survey Data Along the San Pedro River and its Tributaries, Southeastern Arizona, 1989-1994 Complete: January 2018 Crest Stage Gage Site Visit Data, Montana, 1955-2016 Complete: February 2018 Central Mojave Desert Vegetation Mapping Project, California, 1997-1999 Complete: November 2018 Golden Eagle (Aquila chrysaetos) Satellite Telemetry and Observational Data, Western North America, 1993-1997 Complete: November 2020 Project Report
We refined the LDIRS prioritization algorithms to better assess temporal, geographic, and taxonomic extents, resulting in clearer prioritization scores with better intra-record differentiation. In addition, we incorporated the data assessment scoring into the data entry workflow, resulting in real-time prioritization.
We used several methods to continue to promote and expand the USGS legacy data inventory. First, we worked with two career scientists (Susan Skagen; Kathryn Thomas) and two science centers (GLSC and UMESC) to inventory their scientific records as a means of identify legacy data. Second, in September we conducted a second USGS-wide “request for legacy data” to further expand the total LDIRS inventory. Third, we continued to communicate the DaR project accomplishments and methods through USGS groups such as CDI, the FSPAC Data Preservation Subcommittee, the Data at Risk Working Group, the National Geospatial and Geophysical Data Preservation Program (NGGDPP) and the USGS Step-Up Program. In particular, the USGS Step-Up program used the LDIRS prioritization reports to select the North American Bat Banding Program data for their FY18 preservation work, an unfinished CDI-funded preservation project from 2014.
During the FY16 and ‘17 funding periods, the DaR project has selected 13 high priority preservation projects to validate best practices for preserving and publishing USGS legacy data and software. To date, 6 have been published, 3 are in peer review, and 3 are completing data processing. Upon completion each project is summarized as a case study that documents that describes the methods validated and lessons learned.
- Source: USGS Sciencebase (id: 58b5ddc3e4b01ccd54fde3fa)
- Science
Developing a USGS Legacy Data Inventory to Preserve and Release Historical USGS Data
Legacy data (n) - Information stored in an old or obsolete format or computer system that is, therefore, difficult to access or process. (Business Dictionary, 2016) For over 135 years, the U.S. Geological Survey has collected diverse information about the natural world and how it interacts with society. Much of this legacy information is one-of-a-kind and in danger of being lost forever through dNorth American Bat Data Integration
The purpose of this project was to integrate the Bat Banding Program data (1932-1972) and the U.S. and Canada diagnostic data for white-nose syndrome with the USGS Bat Population Data (BPD) Project and provide the bat research community with secure, role-based access to these previously unavailable datasets. The objectives of this project were to: 1) integrate WNS diagnostic data into the BPD (httMining the USGS Data Landscape
The scientific legacy of the USGS is the data and the scientific knowledge derived from it gathered over 130 years of research. However, it is widely assumed, and in some cases known, that high quality data, particularly legacy data critical for large time-scale analyses such as climate change and habitat change, is hidden away in case files, file cabinets, and hard drives housed in USGS science c - Data
Central Mojave Desert Vegetation Mapping Project, California, 1997-1999: Plots Points and Photographs
The Mojave Plots Points data are 1,219 plot locations in the Central Mojave Desert where field data were recorded and photographs were taken from 1997-1999 to provide context for the classification of the Central Mojave Desert into various vegetation classes. The 1,219 plot locations in the plots points shapefile (plots_points.shp) are each assigned a unique identifier called the FinalPlotCode. TMagnetotelluric Data from the San Andreas Fault, Parkfield CA, 1990
The U.S. Geological Survey (USGS) Geology, Geophysics and Geochemistry Science Center (GGGSC) collaborated with the USGS Data at Risk (DaR) team to preserve and release a subset of magnetotelluric data from the San Andreas Fault in Parkfield, California. The San Andreas Fault data were collected by the Branch of Geophysics, a precursor to the now GGGSC, between 1989 and 1994. The magnetotelluric dMigrating Bird Survey Data Along the San Pedro River and its Tributaries, Southeastern Arizona, 1989-1994
Data files in this data series represent migrating bird count and habitat information collected during 1989, 1991, 1993, and 1994 field seasons at 13 riparian sites along the San Pedro River and its tributaries in southeastern Arizona, USA. At each site observations were made at up to 20 points, separated by 100 m arrayed along the riparian zone. Observation periods started at 20 minutes after sun - Publications
Community for Data Integration 2017 annual report
The Community for Data Integration (CDI) is a group that helps members grow their expertise on all aspects of working with scientific data. The CDI’s activities advance data and information integration capabilities in the U.S. Geological Survey and in the wider Earth and biological sciences. This annual report describes the presentations, activities, collaboration areas, workshop, and other CDI-spAuthorsLeslie Hsu, Madison L. Langseth - Software
Software to Process and Preserve Legacy Magnetotelluric Data
The USGS Crustal Geophysics and Geochemistry Science Center (CGGSC) collaborated with the USGS Data at Risk (DaR) team to preserve and release a subset of magnetotelluric data from the San Andreas Fault in Parkfield, California. The San Andreas Fault data were collected by the Branch of Geophysics, a precursor to the now CGGSC, between 1989 and 1994. The magnetotelluric data selected for this pres - Connect