Hunting Invasive Species with HTCondor: High Throughput Computing for Big Data and Next Generation Sequencing

Science Center Objects

Large amounts of data are being generated that require hours, days, or even weeks to analyze using traditional computing resources. Innovative solutions must be implemented to analyze the data in a reasonable timeframe. The program HTCondor (https://research.cs.wisc.edu/htcondor/) takes advantage of the processing capacity of individual desktop computers and dedicated computing resources as a s...

Large amounts of data are being generated that require hours, days, or even weeks to analyze using traditional computing resources. Innovative solutions must be implemented to analyze the data in a reasonable timeframe. The program HTCondor (https://research.cs.wisc.edu/htcondor/) takes advantage of the processing capacity of individual desktop computers and dedicated computing resources as a single, unified pool. This unified pool of computing resources allows HTCondor to quickly process large amounts of data by breaking the data into smaller tasks distributed across many computers.



This project team implemented HTCondor at the USGS Upper Midwest Environmental Sciences Center (UMESC) to leverage existing computing capabilities for data processing and analysis. HTCondor can be used for a wide range of projects including processing DNA sequencing data (currently done as part of invasive species monitoring); validating new statistical models over a wide range of possible parameter combinations; and analyzing long-term vegetation and fish data from the Upper Mississippi River. The HTCondor pool is online and operational at UMESC. The USGS Wisconsin Water Science Center was able to connect to the pool at the USGS Oregon Water Science Center through “flocking” (fig. 16). This test identified cybersecurity and data transfer challenges that can be overcome in future work. Flocking with HTCondor requires communication among machines in various centers which, in turn, requires traffic to be allowed through the firewall of each center. When flocking takes place, many connections must be made, but all traffic is consolidated into a single port which is 9618. As a result, only port 9618 must be opened, and traffic can be limited to USGS computers. This minimizes the risk of allowing two-way traffic through firewalls from center to center. The technology underlying the test was shown to be successful and was an important step toward connecting and leveraging computing resources throughout the USGS.

Accomplishments

The accomplishments for this project are described below.

  • HTCondor was installed and configured at UMESC.
  • Configuration files and testing for flocking among USGS centers were created and performed.
  • Documentation examples for using HTCondor to scale scientific processing with cluster computing were uploaded to USGS BitBucket at https://my.usgs.gov/bitbucket/projects/CDI/repos/hunting_invasive_specie....
  • Testing of Docker as a technology to exchange configuration of complex workflows and datasets throughout HTCondor pools was performed.





Note:  This description is from the Community for Data Integration 2016 Annual Report.