Number background from Unsplash. Photo of Davis from UNC Libraries. Journals created using Adobe Stock, Elsevier logo from Elsevier. | Graphic designed by Jane Durden.
Story by Hannah Rosenberger
Graphics by Jane Durden
After months of negotiations, UNC-Chapel Hill’s library system has yet to sign a contract renewal with two large academic publishers due to proposed language about artificial intelligence use.
The new contracts from Elsevier and Springer, two of the largest scholarly journal publishers in the world, would require a near-total ban on AI usage related to content licensed to UNC-CH, said Anne Gilliland, UNC Libraries’ scholarly communications officer.
The publishers want to prohibit their content being used either to train AI systems created by the university or to be input as a prompt on AI models like ChatGPT, Gilliland said.
Academic research is often found behind a paywall and often only accessible via expensive subscriptions or university library systems. If publishers’ articles are used to train AI models and produce new work, the publishers could fear competition coming from their own content, now made available through an entity they don’t control.
But from the library side, Gilliland said UNC Libraries has no way to fulfill that contract language because it can’t monitor the activity happening with an article after it is pulled from a library database.
For now, UNC Libraries has maintained access to journals through both publishers. But the ongoing negotiations make things complicated, Gilliland said, especially because the Elsevier journals comprise the largest selection of journals out of the library budget.
“This isn’t something that we would ever, ever walk away from lightly, if at all,” Gilliland said. “I mean, they’re very important journals to us. And one of the things that’s sort of part of the whole negotiating situation is — how do we pay them without agreeing to contract terms that will bind us long term in ways that we don’t think we can sustain?”
AI and copyright concerns
Publisher concerns about AI usage are at least partially rooted in the question of copyright.
Although academic publishers like Elsevier and Springer do not create original research material, they do have copyright control over their content as publishers, giving them a level of authority over what happens to it — and who can use it — after publication.
One of the first battlegrounds for addressing the question of copyright for material used to train AI systems is an ongoing lawsuit brought by the New York Times against OpenAI, the company that created ChatGPT.
In the lawsuit, the Times alleges the tech firm used millions of copyrighted articles to train its AI system without permission, and that, when prompted by users, ChatGPT regurgitated NYT articles almost verbatim. In one instance, the chatbot also falsely stated that two recommendations for office chairs came from the Times’ Wirecutter product review page.
The specific legal question to address this claim is whether using copyrighted material to train AI models falls under the “fair use” doctrine.
Copyright regulations protect intellectual property from being used in undesirable ways by people who are not the creator or rightsholder, but there are ways that copyrighted material can be used legally under the fair use doctrine. For example, if copyrighted material is transformed into a completely different piece of work, this “transformative use” would be an acceptable — and legal — fair use of the material.
The New York Times lawsuit presents an interesting dilemma because there are some instances of substantial similarity with the copyrighted source material, said Dave Hansen, the executive director of Authors Alliance, an organization advocating for authors’ ability to share their work broadly.
But according to UNC-CH School of Law assistant professor Dustin Marlan, there’s at least an argument to be made for fair use in examples other than direct reproduction.
“Here you have a situation where, yes, the copyrights are probably getting infringed, as a result of OpenAI using them as training data, but the output looks very different,” Marlan said. “It’s not a word-for-word replica of the original source. Instead, what’s going on is ChatGPT is producing something that looks very different than the original New York Times stories.”
Using AI for research
AI systems like ChatGPT use their knowledge of language patterns — developed from the content used to train them — to predict the next word in a sequence or sentence, said Shashank Srivastava, an assistant professor in UNC-CH’s computer science department who specializes in AI.
It’s similar to the predictive function in email apps or text messages that anticipates the next word you’ll type, Srivastava said.
“These language models like ChatGPT are essentially the same thing — autocomplete but on steroids,” he said.
In most cases, then, AI models use common sentence constructions and phrasing choices to produce their content. Hansen said the way the system works can be more like statistics than a new form of creative expression — and copyright doesn’t protect facts, data or ideas.
One reason the fair use doctrine exists in the first place is to promote scientific progress, Hansen said, including the research people are specifically producing with AI.
But with the proposed language in UNC Libraries’ contract negotiations, Hansen is concerned about researchers who rely on AI and library-accessed content in their work, such as using an AI system to identify patterns in a large body of text.
That type of activity, in his view, would generally be protected by fair use. But, he said, because publishers are concerned about commercial AI applications, they’re proposing license clauses that disregard the copyright conversation altogether — which could prevent these researchers from doing their work at all.
“There’s a group of researchers who’s been using versions of this technology for a decade plus,” Hansen said. “They’ve been quietly going along doing it without any problem, and then OpenAI stormed on the scene, and now all of the large publishers that own rights are terrified of what’s gonna happen to their content and their copyrights. And so they’re placing all these restrictions on it.”
The publisher Elsevier is concerned about generative AI systems being trained on datasets that combine peer-reviewed research with less reliable sources, according to a statement from an Elsevier spokesperson.
“Elsevier recognises the value of GenAI for the researcher as a valuable support tool as long as the tool is used in a self-hosted or closed environment and does not train the algorithm of a third party, nor that content or data feeds the (large language model) of the third party,” the statement said.
Implications for UNC Libraries’ users
Naji Husseini, the associate director of undergraduate studies for UNC-CH’s joint department of biomedical engineering with North Carolina State University, recognizes the power of AI programs like ChatGPT for enhancing student learning.
As someone who has read a lot of poor scientific writing, he said he’s in favor of tools that can help improve writing quality, including using ChatGPT as an editor, or for brainstorming.
But, he said, when students input homework problems directly into generative AI software — without reading or comprehending its response — that causes more concern about students’ ability to actively apply course material.
“Students just generate gibberish in some free response questions on homework — I can tell immediately,” Husseini said. “I can read something and it has no connection to class, it’s written in florid language that doesn’t really mesh with their peers. That’s not a student that wrote it, and that’s not a student that thought about it.”
While publishers are more directly concerned with their content being used to train AI systems, students’ inadvertent, almost naive use of generative AI has caused the most hesitation for Gilliland about UNC-CH’s contracts with Elsevier and Springer.
In some cases, Gilliland believes using AI with journal content is fair use — such as, for example, a student pasting a section of an article into ChatGPT and asking for a summary of the information.
She said she thinks an unaware student using AI in this capacity is mostly likely to breach this publisher policy, were it to go into effect. But it’s also unlikely they would be caught.
UNC Libraries has no way of enforcing the AI usage policies Elsevier and Springer want in their contracts, Gilliland said. And because the university has yet to reach a contract agreement, she is still unclear on what exactly would happen to the library-publisher relationship or to the violator.
“We don’t feel that we can exert sufficient control — and don’t want to — over the activities of students, faculty and staff,” Gilliland said. “We can’t somehow keep every student from ever typing in some text from a journal article if they hope to get a better idea of an explainer of what the article meant.”
Gilliland said the publishers did express some willingness to be more lenient in instances where the AI use isn’t connected to training a larger model.
An Elsevier spokesperson said what was important was making “appropriate reasonable efforts” to ensure that products were being used according to contract terms.
Looking forward
Marlan, the copyright lawyer, expects that companies like OpenAI might pay licensing fees to publishers — both organizations like the New York Times and academic distributors like Springer and Elsevier — if they want to use the content to train their systems.
But Liz Milewicz, the director of the ScholarWorks Center for Open Scholarship at Duke University, said she’s concerned this type of licensing agreement will cause AI to become a paid service, like Internet access. That could come through companies like OpenAI or publishers creating their own AI systems, like Elsevier’s Scopus AI, which is trained on Elsevier-published content.
“I can see a real value in at least acknowledging and compensating those entities that enable that content to be shared,” she said. “But I also recognize that right now there are a lot of publishing companies that stand to benefit quite richly from that model because they own the content. It doesn’t matter who created it originally. They’re the ones who now own and control it.”
That type of licensing deal also doesn’t pose as clear of a solution for university library systems like UNC-CH’s, whose main usage of these resources isn’t AI driven.
In 2020, UNC Libraries’ license deal with Elsevier cost $2.6 million annually, according to a UNC-CH announcement. The high price tag forced the library to reduce the number of titles it subscribed to, but even with the reduction, it has the most expensive content of UNC-CH’s subscriptions, Gilliland said.
“Much of the university is blissfully unaware of the incredible cost of providing access to scholarly research,” Milewicz said. “And much of the great wealth that has accrued to publishers over the last decade — not all publishers, but some of these major science publishers — has been because they are providing access to the content that scholars want to use.”
Gilliland said the next step in UNC Libraries’ negotiations with Elsevier and Springer will likely be a meeting with an attorney representing the university, and then with representatives from the publishers.
She’s hopeful journal access will continue during negotiations. But she knows these conversations are also ongoing at many other research universities, including Duke University Libraries.
“It seems to be growing as more and more contracts come up for renewal and people are dealing with these contract terms,” Gilliland said. “I don’t know how it’s all going to end up.”