Without data standardization, big data are a big mess. Using existing data standards helps make big data useful and more equitable. USGS scientist Abby Benson was recently recognized by USGS partner Earth Science Information Partners for leading an effort to help new scientists do just that.
For USGS Scientist Abby Benson, data standardization for big data is FAIR game
By Michelle Collier, Public Affairs Specialist (detail), Office of Communications and Publishing
In the complicated world of statistical math, the more data you analyze, the more certain you can be of an accurate result. Having a lot of data also helps to answer really complicated scientific questions. Which is exactly why “Big Data” is such a big deal.
The term "Big Data" may seem intimidating, but it has a modest meaning. It is simply either a very large collection of data, i.e., a big data set, or a very complex data set.
The challenge is, once you try to analyze very large or complex data sets, regular computers start to smoke and fizzle out. So, working with big data often requires more advanced computers and computer software.
There are many benefits, however, to big data that make it worth the computational challenges. First and foremost, it can democratize the way data are collected: if many scientists studying the same phenomenon combine their data, they can get the benefits of working with big data without collecting it themselves. This removes some of the barriers to what a scientist can research-- such as limited funding or geopolitical borders. It promotes research into “bigger” questions, geographically and temporally speaking.
Unfortunately, just as scientists may speak different languages, they can also use different units of measurement or methods to acquire the data. Think centimeters versus inches or a tape measure versus a laser rangefinder. This leads to considerable, time-guzzling effort spent translating or converting data into similar formats just so they can be used together.
The solution? Standardization!
The USGS Core Science Systems Mission Area handles a lot of big data. Wrangling all those data is surely a big task: enter USGS Scientist Abby Benson. Abby was recognized this year for leading an effort to help new scientists find and use existing data standards. When followed, these standards ensure that new studies produce data using compatible methods and units of measurement so they can become big data.
Abby’s work also encourages scientists to ensure the data are findable, accessible, interoperable, and reusable, or FAIR. Using the FAIR principles along with data standardization not only upholds a vital tenet of science, but it also makes scientific research more equitable.
I recently chatted with Abby to learn more about her Catalyst Award from the Earth Science Information Partners (ESIP) and get to know her work.
Thank you very much, Abby, for meeting with me today. I’m excited to learn more about your work. First and foremost, can you tell me about yourself? What’s your background and what first interested you about working with big data?
My background is actually in ecology. I studied thirteen-lined ground squirrels on prairie dog colonies in northeastern Colorado. I didn’t actually start working with big data until I came to USGS. The Mission Area I am part of, Core Science Systems, is heavily involved in big data efforts including the ones I am the lead for.
Speaking of the USGS, what is your role at the USGS?
I’m the biodiversity science specialist within the Science Analytics and Synthesis program and I’m the United States node manager for the Global Biodiversity Information Facility and the Ocean Biodiversity Information System, known as GBIF and OBIS. I assist data providers throughout the Bureau and the U.S. with sharing their data in standardized ways to those global systems. And ensuring biological observation data meet the FAIR principles that help make it easier to discover and use in many projects. I help shepherd data from its raw, native format, which makes it difficult to reuse, into a standardized form that allows it to be integrated with other datasets from around the world.
Wow! That’s very inspiring. You know, I have to admit that I’ve heard the term “big data” thrown around a lot, but it’s been hard for me to fully grasp what that means. How would you explain what “big data” is to a kid?
I’ll do my best. So, people tend to lean on the definition that includes the three Vs: volume, velocity, and variety, which obviously, that's still a little out there. But really in my mind, big data are data that are difficult to work with easily on your computer. You might need tools that will help you work with it or maybe a different or bigger computer that will help you work with those data. And so, it's bringing together a lot of data from different places or just a data set that's really big itself, that it almost doesn't load on your machine because it's so big. So that's what I think of as big data.
So, it’s usually a lot of data or it's just really complicated. It probably needs a really fancy, high-tech computer to be able to work with the data. Is that right?
Yeah, a lot of times a high-tech computer, tools or different ways to format the data to help you work with it.
OK, that helps. So, the people that work with big data, like you, are they called data scientists?
You know, it’s funny because I think of myself as a biologist first, but data scientists are the ones working with that big data to answer questions that are impossible to answer with only little bits of information. They bring existing data together in new and exciting ways to answer questions that are bigger than one researcher or one project. So, if we really want to understand patterns at global scales or even national scales, or across long time periods, a single project isn't going to be able to go out and collect that kind of data. We're gonna have to bring data together from multiple people to answer those questions and so that's kind of what data scientists do.
Very interesting! Okay, going back to your work, you recently received an award from the Earth Science Information Partners or ESIP. Can you tell me about the work you were recognized for and why it matters?
Definitely! I received the 2022 ESIP Catalyst Award, which is given to participants who have brought about positive change in ESIP and inspired other members to take action. I was recognized for helping to kick off a new collaborative group in ESIP focused on biological data standards. The goal of this group is to maximize data relevance and utility for understanding changes in biodiversity over time. Basically, we want to make sure that biological observations that scientists are making are reusable for others down the road. We’re really focused on increasing the awareness of existing standards and trying to get more people interested in those standards. And hopefully using them for their data. That’s why we made the Biological Data Standards Primer, which is an infographic-style product that highlights the existing standards that are out there for biological observation data. I do want to say that work was truly a team effort, and it wouldn’t have been possible without my co-chairs Diana LaScala-Gruenewald from the Monterey Bay Aquarium Research Institute, Robert McGuinn from NOAA [National Oceanic and Atmospheric Administration], and Erin Satterthwaite from the University of California San Diego. The whole ESIP Biological Data Standards Cluster also pitched-in to create this product that we’re using to promote the use of standards.
Oh, that is so cool! I love the idea of standardizing methods to make data more relevant, equitable and useable. How does the work you did with ESIP intersect with your work at the USGS?
The work of the cluster is closely linked with my work at USGS. As the U.S. Node Manager for GBIF and OBIS, getting more people to use standards for their biological observation data is one of my primary roles. The more data that are standardized, the more data that can be reused by data scientists for analyses at national or even global scales and over longer periods of time. And so, the cluster is really part of trying to grow that and build that out into the scientific community.
With the data standardized and integrated, we can ask really big questions. Questions that were previously unimaginable to try to answer. For example, “How does the changing status and trend of coral reefs affect ecosystem function and the provision of ecosystem services and benefits?” That's a really big question and if we're going to answer that at a global scale, we're going to need data from all over the place to come together to really answer that question robustly. And so that's where I see the work of this ESIP cluster and my work at USGS coming together.
That’s great. It really does help to put it all into perspective. On the subject of ESIP, can you explain what ESIP is and why USGS staff are part of it?
The Earth Science Information Partners, or ESIP, is a community of data practitioners that come together to discuss all things data related. This is a group that really focuses on the nexus of data management, computer science, information systems and open science, among other things. It’s the place where the data nerds get to dive into the details. The USGS, along with NASA [National Aeronautics and Space Administration] and NOAA, provides financial support to ESIP, and USGS staff get to learn from and share their data science accomplishments, needs, and challenges with people that are working on similar topics. It’s a really collaborative and supportive space that ESIP creates, which enhances the connections within the data science community.
Wow, that’s so great! I love that data nerds are coming together to create a larger community and advance Earth science data. Thank you so much for speaking with me today. I do have one more question for you. Do you have a favorite dataset?
That’s really hard, but one that comes to mind for me is the Puerto Rico Coral Reef Monitoring Program data set. We just published the newest 2021 update. For me, it exemplifies the power of long-term monitoring and collaborating across networks because there are multiple entities involved, all sharing data in standardized ways. That data set, published initially last year, already has 29 publications citing it in GBIF, which tells me these data are extremely valuable to downstream users.
Find out what else Abby is working on in her USGS staff profile.
Responses have been lightly edited for clarity.