- UNC Media Hub

By Nicholas Baumstein

Natural history researchers have a problem: too much data. More precisely, an enormous backlog of handwritten field notes, physical specimens, images, videos, and just about every other form of collected data spanning thousands of years of natural history collecting.

There are an estimated 2 billion specimens in museums around the world, built up over decades through donated “orphan collections” from institutions that can no longer maintain them. The true number, however, is unknown, because many of these specimens remain uncataloged in cabinets, closets, and backroom archives.

“Natural history collections started as curiosity cabinets,” said Lily Hughes, research curator in the ichthyology department, the branch of zoology devoted to the study of fish, at the North Carolina Museum of Natural Sciences (NCMNS). These collections were highly local, tied to the places where they were gathered and shaped by thousands of curious individuals documenting the world around them.

Now that these records are housed in museums, the goal is to sort and digitize them, making the data widely accessible and allowing researchers to track changes in the natural world.

Integrated Digitized Biocollections, funded by the National Science Foundation, currently hosts more than 121 million digital specimen records housed in the United States. That represents a major leap in public access to this information, much of which was unavailable just a decade ago. Still, according to the National Library of Medicine, this accounts for less than 30 percent of all natural history specimens in the country.

With over 300 million specimens still awaiting digitization, the work ahead is substantial. Museums are not exactly flush with funding, and recent federal cuts have only slowed progress. As ichthyology collections manager Gabriela Hogue put it, “In many natural history museums, a single person is responsible for entering all the collection data and cannot work quickly enough to keep pace with how our planet is changing.”

At the NCMNS alone, there are roughly 4 million specimens, including 1.3 million in Ichthyology. That collection holds more than 110,000 jars of fish from over 80 countries, representing more than 3,000 species out of roughly 37,000 known worldwide. Just as important are the archives: an estimated 450,000 pages of field notes documenting locations and conditions. “What differentiates us from being just a warehouse of fish specimens is that we really care not just about the specimen but the data that is associated with it,” Hughes said.

Many of those notes capture moments that cannot be recreated: a rare fish caught miles off the Atlantic coast half a century ago, or one found in a remote mountain stream in North Carolina. Each moment becomes a data point. “If you take a data point, and all the other data points from other museum collections that have a species, we could put together a map that shows where this species lives,” Hughes said. “If we put all this information together, we can start to understand where things live on our planet, which is not always easy for humans to see.”

For Hughes, getting that information into the public domain is critical. “The thing that is really important about that data is we are digitizing it, publishing it, and sharing it,” she said. “All of this data powers biodiversity research. We are enabling researchers to ask their own questions about whatever fish they are interested in.”

“A lot of species are slowly disappearing,” Hughes said, but the data is being indexed even more slowly. That leaves a small team with the task of processing massive amounts of material before it loses relevance.

This is where artificial intelligence enters the picture. A team led by Chris Bizon at UNC-Chapel Hill’s Renaissance Computing Institute, or RENCI, has been developing a tool to streamline digitization. “Our whole goal at RENCI is to use computational methods and thinking to advance research in a variety of fields,” Bizon said.

The first obstacle the team encountered was the handwriting, which is often barely legible. “Chicken scratch is charitable,” Bizon said. The variability is extensive: The same term might appear under multiple abbreviations, numbers are crossed out and rewritten, and the system has to decide which version is correct. The text itself refuses to behave, drifting past margins, hovering above lines or dipping below them.

“As humans, we don’t have much trouble with these things,” Bizon said. “We understand how the text is going to flow, and we recognize the specialized vocabulary.” To illustrate the limits of AI, he described uploading an image of dominoes and asking two models to add up the numbers. One answer came back as 170, the other 190. The correct total was 168.

*Picture of dominos taken by Chris Bizon*

“The point is not that they’re wrong all the time, it is that they’re just tools,” Bizon said. “They are not magic. They have an error rate, like any other algorithm. You have to build guardrails to get useful information out of them.”

Another challenge, and one of the most time-consuming parts of digitization, is figuring out where specimens were collected. “Now everyone has a GPS in their pocket, but people just used to write down where they were,” Hughes said. Those notes can be frustratingly vague: a road name and a nearby river, or a town plus a rough distance. The problem is that those clues do not age well. “Road names change, and there are like seven little rivers in North Carolina,” she added.

To compensate, the system looks for patterns across the data. If most records point to North Carolina but one entry places a specimen in Virginia, the system can flag it for review. “We can use that internal consistency as checks on the data to make sure that we are getting it right,” Bizon said.

The tool is still an early prototype, designed in part to support a pitch to the National Science Foundation in the coming months. Even so, early results are promising. Initial tests suggest it can reduce the time required to process a single field note from about 20 minutes to five. Human review is still necessary, but across hundreds of thousands of records, those saved minutes add up quickly, potentially cutting thousands of hours from the workload.

Examples of field notes from the Ichthyology Department, NC Museum of Natural Sciences
Examples of field notes from the Ichthyology Department, NC Museum of Natural Sciences
Examples of field notes from the Ichthyology Department, NC Museum of Natural Sciences

With the rapid improvement of “multimodal” AI models, which are systems that can process and generate information across images, video and text, researchers are starting to see ways around the same bottlenecks facing ichthyology. Julie Horvath, a former researcher at the NCMNS who now works at RENCI, serves as a bridge between the two worlds, translating research needs into technical tools.

“We do not have the personnel to review all of this research,” Horvath said. “The museum has a great group of researchers, and integrating AI into their work could be highly promising.”

One of the projects she works on focuses on monkeys living on Cayo Santiago, an island off the coast of Puerto Rico inhabited entirely by wildlife and home to more than 1,500 monkeys. Researchers visit periodically to collect blood samples, which are photographed under a microscope and uploaded online.

From there, thousands of volunteer scientists help identify white blood cells in the images. “Researchers look at the immune system and how it reacts to external conditions,” said Dev Gandhi, a solutions architect at RENCI. “They also study how environmental changes on the island affect that response. Because monkeys are among the closest relatives to humans, researchers are trying to translate those findings to us.”

The project runs into familiar problems. “Because this is a free project, the image quality varies a lot,” Gandhi said. “Some have more debris, some are not captured properly, so it is difficult to tell whether there is a white blood cell or not.”

And like many volunteer-driven efforts, progress is slow. “The monkey project has been going on for a decade, and we only have 100,000 images classified,” Gandhi said. “With a machine learning model, that same number could be processed in a day or two.”

Farther south, researchers face a different kind of visibility problem: the open ocean. “The Galápagos are about 1,000 kilometers from the mainland,” said Corbin Jones, a biology professor and director of genomic technologies at UNC-Chapel Hill. “Around the islands there is a lot of life, but once you get into the open ocean, not so much. So how do you figure out what is actually out there?”

To answer that, teams deploy baited 360-degree cameras and record hours of footage. “We run these surveys twice a year across a large stretch of ocean, and there are about 20 sites,” Jones said. He estimates there are more than 20 terabytes of video waiting to be reviewed. “It takes an enormous amount of time for someone to watch it all, and people make mistakes.”

At the Universidad San Francisco de Quito, researcher Felipe Grijalva is using AI segmentation to break down that footage into usable pieces. “It can identify species, count how many there are, and even track which species appear together,” Jones said. The result is a process that replaces thousands of hours of manual review with something far more manageable.

Jones and his colleagues are also looking even smaller, at microscopic organisms that form the base of the ocean’s food chain. “Phytoplankton are the primary producers of the ocean,” he said. “What they do determines everything else. When they thrive, there is plenty of food. When they do not, everything else struggles.”

Studying them is slow, meticulous work that can keep researchers at microscopes for hours on end. An AI model being developed by Dr. Margarita Lankford, an environmental research specialist at UNC-Chapel Hill, could significantly speed that up, opening the door to questions that would otherwise take too long to answer. Jones expects a working version of the model within the next few years.

*Underwater photo taken near the Galápagos Islands by researchers at the Universidad San Francisco de Quito*

With so much anxiety surrounding AI, and no shortage of urgent global problems, natural science research can feel distant or even expendable. But much of this work operates on a different timeline, one focused on building a clearer, deeper understanding of the planet over decades.

That baseline knowledge is not abstract. It underpins efforts to track biodiversity loss, monitor environmental change, and even anticipate the spread of disease. Without it, responses are slower, less precise, and often too late.

Despite the hope that AI can improve human understanding of the environment, its potential costs are real. A 2025 report from Greenpeace Germany warned that AI’s electricity demand, emissions, water use, and raw material needs are all rising quickly, and that data center electricity demand could be 11 times higher in 2030 than in 2023 without government intervention. The same systems that accelerate research also require significant energy, water for cooling, and large-scale infrastructure to operate. Data centers are already facing pushback from communities across the country that view AI less as a solution and more as an environmental strain.

That tension between potential benefit and potential harm is not going away. But in places like NCMNS and RENCI, AI is extending researchers’ reach, turning decades of backlog into usable data. As Corbin Jones put it: “To protect something, you have to know what you have. A lot of our job is figuring out what’s there. The next step is understanding the risks. The third is knowing what makes it unique. From there, you can actually do something with it.”

How AI Could Solve Natural History’s Biggest Research Challenge

How AI Could Solve Natural History’s Biggest Research Challenge

MediaHub

Leave a Reply Cancel reply