Skip to main content

Copyright and fair use: GenAI hangs in the balance

A breakdown of the New York Times vs. OpenAI case
Created on February 23|Last edited on March 24

Introduction

One of the most significant GenAI-related lawsuits to date is making its way through the court system: New York Times vs. OpenAI. Essentially, the NY Times is suing OpenAI and Microsoft for intellectual property violations because Times articles were used in large language model training (ChatGPT in this case). This is fundamentally a copyright infringement case that touches on the nuances of LLM training, prompting, model memorization, and how content is ultimately used.
Personally, I’m hoping the courts find a balanced perspective. After all, the purpose of copyright law is to properly incentivize the innovation and creativity of authors and artists, while balancing the societal benefits gained from fair use.
What would be extremely disappointing is if the courts come down with decisions that effectively prevent GenAI tools from training on publicly available copyrighted content altogether. That would likely force GenAI providers to pay for licenses, which could lead to different negative outcomes.
The worst case scenario is that GenAI simply ceases to exist. LLMs are used for far more than content creation and halting GenAI’s progress means that these tools wouldn’t be used to better understand clinical trials in medicine, for example, or myriad other use cases that benefit humanity. The best case would be that only the biggest players could afford these licenses, foreclosing on open-source models and smaller providers.
Here's how I see things:
Copyright owners and publishers—like the NY Times—have claimed that GenAI uses data to train models in a way that violates copyright protections. Under copyright law, the copyright owner is the only one that can make copies of, distribute and create derivative works based on their original content, allowing authors and artists to financially benefit from their creativity. That being said, use by others under the fair use exception is permitted for research, criticism, and other limited purposes for the public interest.
Now, there’s not much dispute that GPT and other GenAI tools scrape text, images, and other copyrighted material from the internet for training. That means users are creating derivative works based off of copyrighted materials.
But the controversy here is whether we can rely on the fair use exception for this case. That boils down to four criteria:

1. How “transformed” is generated content?

We need to look at what’s being created by these models to answer this question. Put simple: the more transformative the content it is, the more likely it falls under fair use.
Now, unfortunately for OpenAI, there are examples of GPT generating both substantially similar stories and regurgitated paragraphs in the NY Times lawsuit. In my view, this is the Times’ most compelling argument and it may result in a copyright infringement decision in its favor.
OpenAI has emphasized training its large language models to avoid buggy behavior that results in regurgitation, and it seems only a matter of time before the amount of regurgitation is reduced to irrelevancy. We should have little doubt that GPT will evolve its capabilities beyond where it’s at today. It would be short-sighted for the court to come down with a broader ruling or more severe damages than necessary, simply because the NY Times managed to find prompts that demonstrate a current gap in GPT.

2. How much of the copyrighted work was used?

In general, GPT has been trained on countless underlying works, without using significant portions of any single source of copyright protected materials for training. Of course, if OpenAI has a specific dataset in mind like the NY Times database, with millions of articles that would really benefit model training, it should probably explore a licensing deal to mitigate risks and foster partnership—and this has reportedly been in progress with publishers for some time prior to the Times filing its lawsuit.
While it’s feasible for OpenAI to work out an expensive deal—and OpenAI will probably settle on a bigger number than it would’ve liked to initially—we should factor in the societal benefits of open-source GenAI models and smaller players in the GenAI space, instead of normalizing licensing deals that would only be feasible for large, well-funded players. This could lead to an non-competitive ecosystem and exacerbate the problem of GenAI oligopolies.

3. How “creative” is the copyrighted work?

Copyright only protects originality and creativity, not underlying facts or the labor involved in simply compiling facts. Which is to say NY Times articles would fall lower on the spectrum of creativity relative to works of fiction or art.
Even when it comes to fiction or art, my understanding is that GenAI models typically don’t copy and incorporate creative content in a literal sense. What’s actually stored in a model is usually something akin to a statistical analysis of the input as it relates to a given parameter, e.g. shape, line, color, etc. This argument does get trickier with some text-to-image models since a user can prompt the model in the style of a certain artist and get something that really does look like, say, Hieronoymous Bosch. Text generation is simply a lot more nuanced.
I also think GenAI does differ from traditional fair use analysis. It’s much more practical and compelling to evaluate the output, not what was consumed (the input).
While OpenAI began as a non-profit entity focused on AI research, the argument is growing against OpenAI as it monetizes GPT and develops more commercial applications.
Still, it’s hard to tell whether OpenAI is really competing in the current or potential market for publishers like the NY Times. GPT is simply not designed to produce accurate research. There’s a reason you see that “ChatGPT can make mistakes. Consider checking important information” disclaimer on their app.
Which is to say: if I wanted to generate content based on current events, I would still go to news sources such as the NY Times to verify the content I am working on. LLMs are improving at a rapid pace but they still do hallucinate. This is well-established and something GenAI companies will readily admit.

Parting Thoughts

I remain open-minded and interested in empirical data that explores concerns of reduction in the value or demand of innovative journalism and art. Moreover, it’s yet to be seen how GenAI might change the traditional roles of journalists and artists in the longer term, keeping in mind that they can also benefit from GenAI tools, which enhance productivity. I think this article by Ken Liu is an interesting look at how an artist can approach these tools with curiosity, learning what they can and cannot do, versus rejecting them on their face.
Finally, it is an important overarching consideration that users are also responsible for their prompts and vetting output to avoid copyright infringement. Users determine how the generated content is distributed (e.g. whether it’s published or not) for commercial gain or not-for-profit, and in which market. Therefore, we can’t attribute control solely to the GenAI tools and companies themselves.
Fair use analysis was intentionally designed to be vague, allowing courts to decide on a case-by-case basis depending on specific facts. In NY Times v. OpenAI, the NY Times might be able to prevail based on a bug or rare model behavior that led to regurgitation. But beyond that, courts should tread lightly, avoid overly broad rulings, and leave licensing issues (and the future of the GenAI ecosystem) alone.


No information contained in this blog post should be construed as legal advice from Weights & Biases or the individual author, nor is it intended to be a substitute for legal counsel on any subject matter.

Tags: Articles
Iterate on AI agents and models faster. Try Weights & Biases today.