ScienceBase Updates - Fall 2020

By ScienceBase Instructions and Documentation

Fall 2020 topics include information on the updated ScienceBase homepage, AWS S3 Publishing, persistent identifiers in metadata, a tip on propagating metadata to your landing page, and a featured data release on riparian vegetation along the Colorado River.

ScienceBase Homepage Gets a Refresh

In September 2020, visitors to the ScienceBase homepage (https://www.sciencebase.gov/catalog) are greeted with an updated look. The revamped design seeks to provide a more modern feel to ScienceBase, while making it easier to navigate to important links and into collections of content within the system. Additionally, the language presented on the homepage has been revised to provide more clarity to both first time visitors and long-time users about the purpose of the web application and the functionality that ScienceBase provides.

Screenshot of the updated ScienceBase home page

ScienceBase’s support for metadata files and flexible permission controls provides convenience for publishing data but represents only a small part of ScienceBase’s overall functionality. The system also supports targeted search and full integration with other tools or code-based workflows through the ScienceBase Application Programming Interface (API). Links to resources that help users understand these advanced capabilities and how they can be used to support project needs are now provided from the homepage and in the updated user menus. USGS researchers planning the management of data resources may wish to review some of these advanced functionalities to learn more about how to connect ScienceBase to customized workflows for managing or consuming content stored within the system.

For questions or feedback about the new homepage or general system use, please email sciencebase@usgs.gov.

Unique Persistent Identifiers (PIDs) for USGS Metadata Records

In fiscal year 2021, the SAS Science Data Management Team will launch a new version of the USGS Science Data Catalog (SDC). One important new update is that the SDC will require a unique persistent identifier (PID) for every metadata record in the Catalog. This persistent identifier will enable the SDC and the downstream federal data catalogs to uniquely identify and recognize a metadata record that describes a specific dataset/data release.

The ScienceBase Data Release Team automatically sends ScienceBase data release metadata records to the SDC . Moving forward, the team will also register and add PIDs to all ScienceBase data releases automatically upon publication, so authors will not need to worry about this step at all. The team retroactively added PIDs to metadata records for older data releases as of August 2020. To create PIDs for metadata records in ScienceBase data releases, the team leverages the ScienceBase item ID, which is the last part of the ScienceBase URL (e.g., https://www.sciencebase.gov/catalog/item/5ccb4a64e4b09b8c0b7808a6). All USGS metadata PIDs start with the prefix ‘USGS:’. The team takes the item ID from the item where the metadata record is attached and appends it it to the prefix to create the PID.

Example ScienceBase PID: USGS:5ccb4a64e4b09b8c0b7808a6

To ensure that PIDs are in fact unique, the ScienceBase data release team will more strictly enforce the current recommendation of having only one metadata record per ScienceBase item (additional metadata records can still be included in zip files or on child items).

Using the ScienceBase item ID will allow the ScienceBase data release team to more easily manage the metadata PIDs, particularly when revising current data releases. Metadata for data release revisions will receive the same PID(s) as the original metadata, and as a result, metadata in the SDC will be replaced with the newer version. If both versions of the data release remain public via separate ScienceBase items, a new PID will be added to the updated metadata and both metadata records will appear in the SDC.

For those who are curious about where the PIDs are placed within each metadata record, it depends on whether you are providing CSDGM XML records (usually created with the Online Metadata Editor or the Metadata Wizard) or ISO 19115-x XML records (usually created with mdEditor or exported from netCDF).

CSGDM

The PIDs will be added to CSDGM records as a theme keyword array:

ISO 19115-x

The PIDs will be added to the fileIdentifier element within ISO 19115-x records:

If you have any questions about PIDs for metadata records in ScienceBase, please email sciencebase_datarelease@usgs.gov.

Featured Data Release

Map of the Colorado River corridor through Grand Canyon National Park showing Lees Ferry, major tributaries, recreational reach divisions (C: critical; NC: noncritical), and long‐term monitoring sites (Hadley and others, 2018).

Data citation: Sankey, J.B., Ralston, B.E., Grams, P.E., Schmidt, J.C., and Cagney, L.E., 2015, Riparian vegetation, Colorado River, and climate: Five decades of spatiotemporal dynamics in the Grand Canyon with river regulation—Data: U.S. Geological Survey data release, https://doi.org/10.5066/F7J67F0P.

Science Center: Southwest Biological Science Center

Published in 2015, this data release consists of image-based classifications of total vegetation from 1965, 1973, 1984, 1992, 2002, 2004, 2005, and 2009, and characteristics of the river channel along the riparian area of the Colorado River between Glen Canyon Dam and Lake Mead Reservoir. The data have been cited by four other publications, and the related primary publication (https://doi.org/10.1002/2015JG002991) by 33 publications per Scopus. The publication was also featured in a Forbes article. One citing publication re-used the data to meet a need in the camping community: Hadley and others (2018) used the canyon‐wide maps of vegetation available through this data release to quantify the causes of change in campsite area on sandbars along the Colorado River at 35 of 37 long‐term monitoring sites from 2002–2009. These camping areas are an important recreational resource visited by over 25,000 people annually (Hadley and others, 2018).

Data citation and reuse, as shown in the example above, are only one way of measuring the impact of a data release. If you know of a data product available in ScienceBase that has gone on to be reused in other projects, inform policy decisions, garner attention in major media outlets, or any other interesting use, we'd love to hear about it. Please complete this form to contribute your data story.

Image and data reuse citation: Hadley, D.R., Grams, P.E. and Kaplinski, M.A., 2018, Quantifying geomorphic and vegetation change at sandbar campsites in response to flow regulation and controlled floods, Grand Canyon National Park, Arizona: River Research and Applications, v. 34, no.9), p. 1208-1218, https://doi.org/10.1002/rra.3349.

ScienceBase Introduces AWS S3 Publishing, Supporting Direct File Access from Storage for Cloud-Optimized File Formats

In the summer of 2020, ScienceBase added new functionality to better support access to large, publicly accessible files stored in the system. One of the ongoing challenges in the scientific community deals with access limitations (bandwidth restrictions, download times, network timeouts, etc.) associated with handling larger files.

To help address these challenges, ScienceBase has a new beta feature available that will allow an authorized user to publish a file to a publicly readable AWS S3 bucket. The files will be moved to a dedicated ScienceBase bucket where they will be directly readable from object storage at a Uniform Resource Identifier (URI). The ability to access a file in this way (i.e., direct read from cloud storage) can have multiple benefits for data stored in ‘cloud optimized’ file formats. The first use case that this workflow is intended to support deals with Cloud Optimized GeoTIFF (COG) files. Raster files (gridded data such as imagery, landcover data, or thematic measurements such as temperature or precipitation values) are a very common file format and are often large in size. A COG file provides the same data and functionality as a normal GeoTIFF, but includes some additional information in the file to support faster rendering of the dataset (pyramids) and subsetting over http protocol. Publishing raster files in a COG format and storing them in a publicly readable storage location allows users to access the data (the full dataset or geographically bounded subsets) to view it dynamically in a map viewer, or to pull raw values directly into a programmatic workflow (e.g., in Python or R). This enables more flexible code-based workflows and eliminates the need for users to download entire files in many cases.

For more information on using this feature, or working with ScienceBase to publish a collection of data as COG files, please contact sciencebase@usgs.gov.

A screengrab of the advanced file interface that allows a user to copy the files attached to a ScienceBase item to a publicly accessible AWS S3 bucket

An example of a dataset published in ScienceBase where the author provided the file in a Cloud Optimized GeoTIFF (COG) file format to support direct access from storage is available here: https://doi.org/10.5066/P9HEDYNT.

Screenshot of example code connecting to a ScienceBase COG file using Python with rasterio. The direct file URI is provided to the script and the values can be directly read into the interactive workflow.

Screenshot of example code being read into an interactive console straight from ScienceBase cloud storage.

View a published ScienceBase COG file in an online viewer (consuming directly from ScienceBase).

An image showing a dynamic view of a ScienceBase COG file, consumed directly from ScienceBase into an online viewer (www.cogeo.org). The viewer supports rapid zoom and display, illustrating the benefits of cloud-optimized storage formats and standardized access protocols. This example can be viewed here.

Did You Know?

The most efficient way to add descriptive information to an item in ScienceBase is to use the autopopulate feature, which can parse .xml metadata. If you upload an .xml metadata record (in either FGDC-CSDGM or ISO format), ScienceBase will recognize the format and bring up a dialog window:

screenshot showing dialog that asks if user wants to populate ScienceBase item fields using content from metadata file — Screenshot showing dialog that asks if user wants to populate ScienceBase item fields using content from metadata file.

If you choose "Yes", ScienceBase will automatically populate key fields in the edit form using content from the metadata. You may still need to manually edit some of the information on the item, but this option can provide a helpful head start.

Subscribe to the ScienceBase Mailing List for Quarterly Updates.