Skip to main content
U.S. flag

An official website of the United States government

ScienceBase Updates - Fall 2022

Fall 2022 topics include news on the ScienceBase integration with Globus to support release of large USGS datasets, making your data release more accessible, a tip on connecting directly to a .csv or .txt file in ScienceBase, and a featured data release on monitoring trends in burn severity.

ScienceBase Updates Header
ScienceBase Updates Header

ScienceBase Integration with Globus to Support Release of Large USGS Datasets

As the size of USGS research outputs continues to increase, the ability to store and publicly host these ever-growing datasets needs to keep pace. In 2017, the Science Analytics and Synthesis (SAS) Science Data Management team completed the certification process to establish ScienceBase as a USGS Trusted Digital Repository. While ScienceBase saw a large uptick in use for public data release, large files continued to pose challenges for researchers; at that time, ScienceBase could only handle file uploads of approximately 2 GB. Since then, the ScienceBase team has made incremental progress in the size of files supported within the system. First, an increase in supported files of up to 10 GB when the ScienceBase large file uploader was introduced in 2016. Later in 2019, the supported file size rose to 30 GB with the ability to upload files directly to ScienceBase cloud storage.  

More recently, the ScienceBase team has been contacted by researchers needing to release much larger data products, both with respect to the size of individual files (e.g., 400+ GB), as well as the number of files (e.g., 100,000+ files). To meet this growing need, the ScienceBase data release team has developed two processes that can now use Globus to facilitate data transfer and access.

First, what is Globus? Globus is a service that allows users to efficiently, reliably, and securely move data between systems through a single web interface. Essentially, Globus can monitor a file transfer and can restart where it left off in the event of a network interruption, dramatically improving resiliency for large data transfers. Globus is in widespread use in the research landscape, with Globus endpoints available at hundreds of universities, laboratories, and computing facilities around the world. The USGS now has a subscription to Globus and multiple Globus endpoints. USGS users can log into Globus with their Active Directory credentials.

Screenshot of the Globus login screen displaying "U.S. Geological Survey" in the organizational login dropdown menu
Screenshot of the Globus login screen displaying "U.S. Geological Survey" in the organizational login dropdown menu

Case 1: Globus to ScienceBase Transfer 

ScienceBase can now ingest files from Amazon Web Services (AWS) S3 buckets with the proper Identity and Access Management (IAM) configuration. This supports the ability to pull files from other USGS Cloud Hosting Solutions (CHS) locations, or research partners, into ScienceBase CHS storage. However, many researchers in USGS still do not work directly with S3 buckets (via console or command line interface), and those who do may find the IAM configuration process challenging. To solve this problem, the Science Analytics and Synthesis group within Core Science Systems has established an AWS S3 bucket with the proper IAM configuration to support ingest into ScienceBase. This eliminates the complexity of working through IAM configurations on a case-by-case basis for buckets. The ScienceBase data release team has developed a process using Globus to help users get their data into this staging location, after which the files can be attached to ScienceBase items and moved into ScienceBase cloud storage via the application’s user interface (or via code). 

Who should use this file upload method? 

  • Users with data that are already available on an existing Globus endpoint. 

  • Users with data larger than 30 GB. 

  • Users experiencing timeouts when uploading data through the ScienceBase Cloud Uploader. 

Who should NOT use this file upload method? 

  • Users whose primary challenge pertains to a large number of unique data files. While the upload of S3 files to ScienceBase items can now be scripted using the Python API wrapper sciencebasepy, ScienceBase still has a limit of 100 files per item. Contact the ScienceBase team at sciencebase@usgs.gov for strategies and options. 

What is the workflow for using Globus to transfer data to ScienceBase? 

  1. Contact the ScienceBase Data Release team (sciencebase_datarelease@usgs.gov) to request a Globus Collection to support your data transfer to ScienceBase. 

  1. Upload data from a local machine or an existing Globus endpoint to the Globus Collection via the Globus client. 

  1. After files have been transferred: in ScienceBase, navigate to File Manager from the ScienceBase landing page and upload S3 files from the user’s Globus Collection into ScienceBase. 

  1. Files are automatically deleted from the Globus staging Collection after 30 days.

Case 2: Globus Deep Storage Data Release 

The ScienceBase Data Release team has also recently developed a process for releasing what the team is calling a “Deep Storage” data release. For these data releases, the data remain in a Globus Collection and public users will need a free Globus account to access the data. The ScienceBase data release landing page and the attached XML metadata record support the discovery and presentation of the data release, but file access is accomplished via Globus to navigate through the data release collection and obtain the data. Unlike the temporary Globus Collections used to support the S3 data transfer to ScienceBase (described above) these deep storage collections will persist on USGS on-premise or cloud storage configured as long-term cataloged collection.

Screenshot of a Globus deep storage data release collection
Screenshot of a Globus deep storage data release collection

Who should use this file upload method? 

  • Users with large volumes of data files that, when compiled, total multiple TBs of data.  

Who should NOT use this file upload method? 

  • Users that need fast, programmatic access to data file content via web services (e.g., Cloud Optimized GeoTIFFs). For example, web applications cannot be built on top of the data in this deep storage. 

What is the workflow for setting up a Globus Deep Storage Data Release? 

  1. Contact the ScienceBase Data Release team to request a Globus Deep Storage Collection. 

  1. Upload data to Globus Collection. 

  1. Upload collection metadata to ScienceBase landing page and appropriately catalog the Globus file holdings within a ScienceBase item(s). 

  1. Contact the ScienceBase Data Release Team to make the data release public. 

Did You Know?

For files stored in ScienceBase on-premise storage, it is possible to connect directly to a .csv file or structured .txt file in ScienceBase using Python or R, without downloading the file first.

You can find a ScienceBase on-premise file path URL in an item's JSON content. Example file paths: https://www.sciencebase.gov/catalog/file/get/5d93775de4b0c4f70d0d48b7?name= wlci_literature_database.csv or https://www.sciencebase.gov/catalog/file/get/5d93775de4b0c4f70d0d48b7?f=__disk__d8%2F37%2F1a%2Fd8371abdc16a266922688d1f3994969296c523bd.

Referencing the https file location enables the data to be brought into a workflow such as a Jupyter Notebook or script with common libraries such as Pandas.

Example Python: df = pd.read_csv(urlTarget)

This can provide additional flexibility for analysis and visualization options through data science libraries such as NumPy, Plotly, Shiny, etc. When paired with a targeted ScienceBase query, this can also support an approach to work with multiple files across different items, filtering by item attributes.

Users should note that with the continued migration of ScienceBase to the cloud, there will likely be some updates to how files are referenced from items. As files are moved to object storage in the cloud, the direct read capability described here may change.

However, new workflows are also available for certain file types in cloud storage to support programmatic workflows (e.g., Cloud Optimized GeoTiff files). As the data storage environment evolves in ScienceBase, our team will continue to provide updates to users on any changes or new capabilities.

 

Featured Data Release

Burn severity mosaic image
Burn severity mosaic image

U.S. Geological Survey, USDA Forest Service, Nelson, K., 2021, Monitoring Trends in Burn Severity Thematic Burn Severity Mosaic from 1984 to present (ver. 2.0, June 2022): U.S. Geological Survey data release, https://doi.org/10.5066/P9NETC0T

USGS Data Owner: Earth Resources Observation and Science (EROS) Center 

The Monitoring Trends in Burn Severity (MTBS) program maps wildfires that occur throughout the contiguous United States. Data points collected such as frequency, size, and severity of wildfires allow for analysis of the effects these events can have over time and space. This release contains a burn severity mosaic for the years between 1984 to 2021.  

The related publication, which investigates changes to the mapping procedures and data products that have occurred in this timeframe, has been cited by 30 other publications. While many of these uses of the data are to classify frequency and perimeter trends, others have used the measures of severity to investigate vegetation regrowth (Moressi and others, 2022 and Li and others, 2022), or how wildfire impacts snowpack (Giovando and Niemann, 2022). 

References

Giovando, J., and Niemann, J.D., 2022, Wildfire Impacts on Snowpack Phenology in a Changing Climate Within the Western U.S.: Water Resources Research, v. 58, no. 8, https://doi.org/10.1029/2021WR031569

Morresi, D., Marzano, R., Lingua, E., Motta, R., and Garbarino, M., 2022, Mapping burn severity in the western Italian Alps through phenologically coherent reflectance composites derived from Sentinel-2 imagery: Remote Sensing of Environment, v. 269, p. 112800, https://doi.org/10.1016/j.rse.2021.112800

Li, Z., Angerer, J.P., and Wu, X.B., 2022, The impacts of wildfires of different burn severities on vegetation structure across the western United States rangelands: Science of The Total Environment, v. 845, p. 157214, https://doi.org/10.1016/j.scitotenv.2022.157214.

 

How to Make Your Data Release More FAIR: Accessible

The FAIR (findable, accessible, interoperable, and reusable) guiding principles for data, first outlined in Wilkinson and others (2016), have quickly become a popular way to assess and improve the usability and utility of scientific datasets. However, it can be difficult to glean practical and straightforward ways to implement the principles in your own data releases. We will explore a few small ways to make your data more FAIR in the next few Updates, continuing with Accessible (see the Summer 2022 Updates for the piece on Findable). 

Using the ScienceBase data release process ensures that a few of the principles under Accessible are already fulfilled for you. For example, through the revision process, we ensure that metadata records are available even when data are no longer available, and we maintain ScienceBase as a repository that is free and open to the public. Here are a few other simple ways to make your data more accessible on ScienceBase.

Web Services or Direct Download?  

When creating your data release, it’s important to consider how your users will primarily access the data: through web services or by direct download. 

If you anticipate workflows in which the data are read directly from the ScienceBase item via web services: 

  • Certain geospatial file formats are recognized by ScienceBase and can be displayed in preview maps and used to generate web services. These are shapefiles (.shp), GeoTIFFs (.tif), and ESRI Service Definition files (.sd). 

  • Uploaded spatial zip files must be unzipped for ScienceBase to recognize the format. When one of these geospatial file formats is uploaded, ScienceBase will recognize the format and bring up a popup window, asking if an extension should be created. Selecting "Create Extensions" will allow ScienceBase to display the file in the preview map and generate web services for the data.  

  • Web services can make your data release more accessible if is the files are intended to be primarily accessed programmatically. With ScienceBase’s Web Map Services (WMS) and Web Feature Services (WFS), spatial data can be viewed in in client-side GIS software or online visualization tools like ArcGIS Online, The National Map (TNM) Viewer, and other applications. 

Screenshot of a ScienceBase landing page with the Spatial Services section circled
Screenshot of a ScienceBase landing page with the Spatial Services section circled
  • ScienceBase also now supports programmatic access to cloud optimized file types such as cloud optimized GeoTIFFs (COGs). Learn more about providing access to cloud optimized files here or in the Fall 2020 Updates

  • While they can help meet certain data needs, users should remember that spatial services are not required and can motivate other considerations that may introduce more complexity in some cases than is necessary (e.g., optimizing display and performance in mapping applications versus preservation of data fidelity). 

If you anticipate users primarily downloading data directly: 

  • Spatial files uploaded in zipped format will not display in preview maps or generate web services; however, they will remain available for download. 

  • If the data are not intended to be primarily accessed via web services, it can be best to keep them zipped. Having the data package zipped together makes the data more accessible to users downloading the data directly. 

  • If you’d like to display a map of the study area of the data but don’t need web services, you can upload the study area map as an image, and it will display in the top right corner of the landing page.

Updating the DOI 

The digital object identifier (DOI) associated with your data release is key to keeping your data accessible by providing a persistent link to your data. However, DOI links can break if the DOI’s record is not kept up to date. ScienceBase uses the DOI Tool to reserve and publish DOIs for USGS data releases. When your data release is published, the DOI record is updated with the URL of the ScienceBase landing page to which the DOI should resolve. If the data are moved from ScienceBase for any reason, or the landing page is removed, the DOI record must be updated to keep the data release accessible. 

During data release revisions, data authors should work with the ScienceBase team to ensure that the DOI is pointing to the correct URL. If data need to be moved from ScienceBase from any reason, contact the ScienceBase Data Release team (sciencebase_datarelease@usgs.gov) to ensure that the original DOI link is properly redirected.

Utilize tagging  

Using the tagging feature in ScienceBase can make your data easier to query and retrieve . When data are consistently tagged, users can pull together and traverse relevant results more easily. For example, users looking for water quality data releases in ScienceBase could use the query string:

https://www.sciencebase.gov/catalog/items?q=&filter=systemType=Data+Release&filter=tags={"type": "USGS Scientific Topic Keyword","name": "Water Quality"}

to see all results with tag type “USGS Scientific Topic Keyword” and tag name “Water Quality”. From there, you can download a CSV that includes the item IDs of all data releases returned in the search, making it easier to programmatically access files on these pages or otherwise interact with the data.

The Denver photo library is an example of consistent and thorough tagging, with the tags also being utilized in the accompanying photographic library explorer. To add tags to your data release landing page, you can manually add tag types and tag names in the “Tags” tab in the edit form, or parse your metadata’s keywords onto the page by uploading your metadata file and selecting “yes” when asked if you’d like to propagate the metadata to the page.

 

 

Subscribe to the ScienceBase Mailing List for Quarterly Updates.