Exploring the USGS Science Data Life Cycle in the Cloud

Science Center Objects

Executive Summary Traditionally in the USGS, data is processed and analyzed on local researcher computers, then moved to centralized, remote computers for preservation and publishing (ScienceBase, Pubs Warehouse). This approach requires each researcher to have the necessary hardware and software for processing and analysis, and also to bring all external data required for the workflow over the ...

Executive Summary

Traditionally in the USGS, data is processed and analyzed on local researcher computers, then moved to centralized, remote computers for preservation and publishing (ScienceBase, Pubs Warehouse). This approach requires each researcher to have the necessary hardware and software for processing and analysis, and also to bring all external data required for the workflow over the internet to their local computer.  To explore a more efficient and effective scientific workflow, we explored an alternate model: storing scientific data remotely, and performing data analysis and visualization close to the data, using only a local web browser as an interface.  Although this environment was not a good fit for the policies of CHS, we were able to demonstrate huge efficiency gains using these data-proximate scalable workflows both on NSF’s XSEDE Jetstream Cloud and the USGS Yeti HPC cluster.



Goals

The USGS Science Data Life Cycle has been characterized by CDI as “Plan, Acquire, Process, Analyze, Preserve, Publish/Share”. Traditionally in the USGS, data is processed and analyzed on local researcher computers, then moved to centralized, remote computers for preservation and publishing (ScienceBase, Pubs Warehouse). This approach requires each researcher to have the necessary hardware and software for processing and analysis, and also to bring all external data required for the workflow over the internet to their local computer.



With big data storage capabilities and multiprocessing capabilities provided by the Cloud or HPC facilities, and new client/server technologies like Juptyer Notebooks, we should be able to conduct most of the Science Data Life Cycle components remotely, where we can share and scale our hardware resources, share software environments, perform analysis close to the data, and preserve, publish and share our data for public use. This is in addition to the normal benefits of the Cloud or an HPC facility:  no local sunk costs, economy of scale, and reliable services.  



Remote data-proximate, scalable analysis and visualization are particuarly useful for USGS modelers, who conduct surface water, groundwater ocean and geophysical simulations on hundreds of CPUs and generate datasets in the 10GB - 10TB range.  Thes datasets are too big to easily transmit to colleagues, and too big for the existing capabilities of ScienceBase. Modelers need scalable resources to perform simulations (acquire), a reliable and efficient way to access information from these large datasets (process/analyze), and a trusted digital repository for data release and citation (preserve), as required by Instructional Memo IM-OSQI-2015-011. Scientists working with structure-from-motion approaches to transform photographs into topography also have massive data and computing needs.

Lessons Learned

We learned that at it's current stage of evolution and level of restrictions, CHS is not the best platform for testing exploratory services we envisioned for our workflows (e.g. JupyterHub, THREDDS, Docker, Kubernetes and Globus Connect).   For exploratory or development work, better to develop technologies on more open and flexible systems like NSFs XSEDE, or on commercial cloud environments in settings where we can work under the umbrella of an organization USGS is affiliated with, like the Earth System Information Partners (ESIP).



Remote Cloud and HPC environments with Client/Server technologies like JupyterHub indeed hold the potential to dramatically improve the scientific workflows of USGS researchers, removing the need for expensive local hardware and large bandwidth.   These technologies allow people to work locally in their browser without transfering data, and harnessing the power and reliability of these remote systems. 

Successes

On the Cloud, we need a new approach to storing large multidimensional data.  The NetCDF and HDF files we have used for decades on regular file systems don't work well on object storage.  On this project we examined a new solution to this problem: the Highly Scalable Distributed System (HSDS) developed by John Readey of the HDF Group.  HSDS basically takes stores each chunk of an HDF5 or NetCDF4 file (which uses HDF5 under the hood) in an S3 object.   If the file is not chunked, HSDS will chunk it when stored to S3.   Python users can then access these "files" using the h5pyd library, which is a drop in replacement for the h5py library.  We worked with the HDF Group and the Xarray team to develop the capability to open these datasets directly in Xarray.   The result is that Xarray users can use the same workflows that they currently use for NetCDF5 and HDF5 files, but harness the power of the Cloud transparently. 





We worked with Unidata (Julian Chastang) and NSF XSEDE Jetstream personnel (Jeremy Fisher, Andrea Zonca) to implement the full workflow on XSEDE HPC and the Jetstream cloud, using the existing Globus service and implementing JupyterHub on Kubernetes, and THREDDS on Docker and setting up a HSDS service (working with John Readey from HDFGroup).   Presentations on the full workflow were given at the ESIP 2017 Summer meeting in Bloomington, and the 2018 Winter Meeting in Bethesda. 



We also forged a partnership with the Pangeo team, an NSF Earth Cube funded project with strong overlap in goals to this CDI project.   Pangeo is developing remote analysis and visualization of big data using Jupyterhub, Dask, and Zarr for large multidimensional data storage on the Cloud.  Under the auspices of ESIP, John Readey (HDF Group) and I wrote a proposal to implement the pangeo environment on AWS for exploring the National Water Model, which was recently awarded one year funding.





Figure 1.  Snapshots from a Jupyter notebook using a scalable computer cluster on the USGS Yeti HPC facility to compute maximum wave height on a 80GB dataset.   The user is interacting with Yeti in their browser on their laptop connected on the TIC network.   Because the processing is happening remotely, the local user does not need a fast internet connection or powerful local hardware. 



Presentations

"Scalable, Data-Proximate, Reusable Workflows on the Cloud", Rich Signell, ESIP 2018 Winter Meeting.  [slides][video]



"HDF Data Services" (A service for accessing HDF5/NetCDF4 data on the Cloud), John Readey, HDF Group. [video]



"Pangeo: JupyterHub, Dask, Xarray and Zarr on the Cloud", Matthew Rocklin, Anaconda [video]



"Pangeo: Jupyter, Xarray, Dask, and NetCDF on HPC", Matthew Rocklin, Anaconda [video]

Documents

The Pangeo project 



Storing large multidimensional data in the Cloud 

Running THREDDS, pycsw, ERDDAP, ncWMS and TerriaMap via Docker



Setting up local browser access to scalable, data-proximate analysis on Yeti

Deploying JuptyerHub with Kubernetes on Jetstream

Code

Main CDI Project

JupyterHub on CHS

Dockerized THREDDS on CHS

Pangeo Project