Mining the USGS Data Landscape

Science Center Objects

The scientific legacy of the USGS is the data and the scientific knowledge derived from it gathered over 130 years of research. However, it is widely assumed, and in some cases known, that high quality data, particularly legacy data critical for large time-scale analyses such as climate change and habitat change, is hidden away in case files, file cabinets, and hard drives housed in USGS scienc...

The scientific legacy of the USGS is the data and the scientific knowledge derived from it gathered over 130 years of research. However, it is widely assumed, and in some cases known, that high quality data, particularly legacy data critical for large time-scale analyses such as climate change and habitat change, is hidden away in case files, file cabinets, and hard drives housed in USGS science centers and field stations (both hereafter “science centers”). Many USGS science centers, such as the Fort Collins Science Center, have long, established research histories, are known repositories of data sets, and conduct periodic “file room cleanout” days that establish and enforce some minimal data lifecycle management and maintains a cursory inventory of maintainable data – data that is of high enough interest/impact that they should be maintained at a minimum readable format for future access and use. But science centers currently lack a clear understanding of data lifecycle management best practices and simple inventory tools to manage their data through its lifecycle. We proposed testing the CDI lifecycle framework by applying it to a handful of known data, and documenting the considerations and requirements of effectively applying the CDI data lifecycle framework. Further, we proposed creating a simple “USGS Data Mine” tool that enables science centers to conduct and maintain their data inventories, while contributing to and assisting with the growing greater USGS data landscape.

Objectives

1. Validate and document the application of the CDI Data Management Lifecycle framework

Through cooperation with the CSAS ‘Species Occurrence Records and Data Transformation Processes’ project we were able to expand the scope of this objective to include additional bat and white-tailed kite data to the project, increasing our sample size for estimating resource requirements from 3 datasets to 9. Details of the progress for each dataset follows.

  • Southeastern Arizona riparian bird and habitat data (Completed)

    Data has been quality controlled and documented and is being prepared for submission as a USGS Digital Data Series product.

  • Texas, Kansas, Oklahoma, South Dakota, North Dakota wetlands  and shorebird data (Processing)

    The original WB3 format for these data requires Quattro Pro 7 software licensing which we have acquired and converted files to CSV and XLSX format for further processing.

  • Eastern Colorado prairie bird and habitat data (Processing)

    The original WB3 format for these data requires Quattro Pro 7 software licensing which we have acquired and converted files to CSV and XLSX format for further processing.

  • Bats of the Rocky Mountain Arsenal Mist Net Data (Completed)

    Data and metadata have been reviewed and archived in the Fort Collins Science Center Sciencebase community and is ready for FSP approval.

  • Bat Inventory of Ouray National Wildlife Refuge Mist Net Data (Completed)

    Data and metadata have been reviewed and archived in the Fort Collins Science Center Sciencebase community and is ready for FSP approval.

  • Bats of Mesa Verde National Monument Mist Net Data (Processing)

    Data has been reviewed and metadata is in the process of being completed. Both have been archived in the Fort Collins Science Center Sciencebase community.

  • White-tailed Kite Historic Data (Processing)

    We are working with the original USGS PI and the museums who contributed data to this datasets to verify data agreements and to develop metadata. The data itself has been converted from its original QuatroPro 6 format to XLSX and CSV for final data processing.

  • White-tailed Kite Physiological Data (Processing)

    Metadata for this dataset is complete and the data has been converted from its original QuatroPro 6 format to XLSX and CSV for processing.

  • White-tailed Kite Morphological Data (Processing)

    Metadata for this dataset is complete and the original capture data has been converted from its original QuatroPro format to XLSX and CSV for processing.

  • Chapter 1 of the project’s completion report is in draft form.

2. Inventory, prioritize and estimate the cost of integrating a USGS data mine

  • Conducted an inventory of FORT datasets based on metadata produced between 1994-2011. We’ve identified 440 potential datasets to date. All inventoried records have been migrated to the Fort Collins Science Center’s Sciencebase community as  part of a ‘Data Mine’ space, which is restricted to internal FORT data stewards and principal investigators. Once datasets are completely processed and approved for public distribution, those Sciencebase dataset items will be moved to the FORT community’s public ‘Datasets’ space for distribution.

  • FORT Case Files were also inventoried where dataset metadata was associated with active FORT staff. Fewer than a dozen additional datasets have been identified through this review of Case Files, however, this process is ongoing.

  • Initial estimates for datasets similar to those completing Objective 1 processing have been assigned. These are updated as dataset processing times are analyzed. The current average is 24 hours per dataset to process.

  • Chapter 2 of the project’s completion report is in draft form.

3. Objective 3: Develop a USGS Data Mine Web application

  • Based on our experience inventorying the FORT’s metadata and case files we are currently documenting our suggested inventory workflows and designing wireframes for the Data Mine data management application. These make up Chapter 3 of the project completion report.

  • We are evaluating other data and project lifecycle management applications being developed within USGS to partner with, the hope being the Data Mine application as a legacy data portal that unifies and/or expands upon the strengths of each of those related applications.



Note: this description is from the FY13 Annual Report

Validated and documented the application of the CDI Data Management Lifecycle framework

Through cooperation with the CSAS ‘Species Occurrence Records and Data Transformation Processes’ project we were able to expand the scope of this objective to include additional bat and white-tailed kite data to the project, increasing our sample size for estimating resource requirements from 3 datasets to 9. 

  • Southeastern Arizona riparian bird and habitat data

  • Texas, Kansas, Oklahoma, South Dakota, North Dakota wetlands  and shorebird data

  • Eastern Colorado prairie bird and habitat data

  • Bats of the Rocky Mountain Arsenal Mist Net Data

  • Bat Inventory of Ouray National Wildlife Refuge Mist Net Data

  • Bats of Mesa Verde National Monument Mist Net Data

  • White-tailed Kite Historic Data

  • White-tailed Kite Physiological Data

  • White-tailed Kite Morphological Data

Inventoried, prioritized and estimated the cost of integrating a USGS data mine

  • Conducted an inventory of FORT datasets based on metadata produced between 1994-2011. We’ve identified 440 potential datasets to date. All inventoried records have been migrated to the Fort Collins Science Center’s Sciencebase community as  part of a ‘Data Mine’ space, which is restricted to internal FORT data stewards and principal investigators. Once datasets are completely processed and approved for public distribution, those Sciencebase dataset items will be moved to the FORT community’s public ‘Datasets’ space for distribution.

  • FORT Case Files were also inventoried where dataset metadata was associated with active FORT staff. Fewer than a dozen additional datasets have been identified through this review of Case Files, however, this process is ongoing.

  • Initial estimates for datasets similar to those completing Objective 1 processing have been assigned. These are updated as dataset processing times are analyzed. The current average is 24 hours per dataset to process.

Wireframed a USGS Data Mine Web application

  • Based on our experience inventorying the FORT’s metadata and case files we documented our suggested inventory workflows and designed wireframes for the Legacy Data Inventory and Reporting System (LDIRS).

  • We evaluated other data and project lifecycle management applications being developed within USGS to partner with, the hope being the Data Mine application as a legacy data portal that unifies and/or expands upon the strengths of each of those related applications.