The Augmented Educator

The Augmented Educator

Is AI Training 'Stealing'?

What a $1.5 Billion Settlement Reveals About AI, Copyright, and Data Provenance

Michael G Wagner's avatar
Michael G Wagner
Oct 19, 2025
∙ Paid
Share
Upgrade to paid to play voiceover

Before we begin, a quick disclaimer: I am not a lawyer, and nothing in this essay should be considered legal advice. This is my interpretation as an educator and creator, based on my understanding of publicly available court documents and reporting, aimed at helping us all navigate these complex and rapidly developing issues.

The Copyrighted Character Conundrum

When OpenAI unveiled Sora 2 not that long ago, the response was immediate and visceral. The system’s ability to generate video clips featuring unmistakable facsimiles of beloved, copyrighted characters made the abstract debate about AI and copyright feel suddenly, uncomfortably concrete. It was not science fiction, nor was it something far-off. It was happening now, and it was remarkably good.

This ability to imitate copyrighted material exists far beyond video. Image generators like Midjourney have long allowed users to create images of “Mickey Mouse in the style of Van Gogh” with a few keystrokes. The technology doesn’t just approximate these characters; it recreates them with stunning fidelity, deploying them in contexts their creators never imagined or allowed.

Many creatives consider this a deep betrayal. Artists, writers, and filmmakers have articulated their reaction in sharp terms: this is theft. It’s digital strip-mining, they argue, where a lifetime of creative work is consumed without permission or compensation to build a commercial product worth billions. The assumption driving this anger is simple. For an AI to replicate a style or character so effectively, the company behind it must have done something illegal to acquire that knowledge.

This reaction is completely understandable. However, the term “stealing” may hide more than it reveals. A closer examination of recent high-stakes lawsuits reveals a far more nuanced legal landscape than the rhetoric of theft suggests. U.S. courts are beginning to indicate that the process of training an AI may indeed constitute legal fair use, provided the training data was lawfully obtained. The illegality, it turns out, lies not in the training itself but in the provenance of the data. Understanding this distinction is essential for having a productive conversation about the future of creativity and compensation in the age of AI.

The High Price of Piracy: Bartz v. Anthropic

In August 2024, a trio of authors (Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson) filed a class-action lawsuit against Anthropic, the company behind the Claude family of language models. The primary charge was serious: Anthropic had trained Claude by feeding it hundreds of thousands of books obtained from notorious pirated ebook repositories, specifically Library Genesis and the Pirate Library Mirror, commonly known as “shadow libraries.” This wasn’t a question of inadvertent use or ambiguous licensing terms. The complaint accused Anthropic of knowingly downloading and storing massive quantities of copyrighted material from sources explicitly dedicated to piracy.

The legal stakes were existential. Under U.S. copyright law, statutory damages range from $750 to $30,000 per work, escalating to $150,000 per work if a court finds the infringement was willful. With a certified class representing approximately 500,000 infringed works, Anthropic faced potential liability in the tens of billions of dollars, enough to destroy even a highly valued company.

On June 23, 2025, Judge William Alsup of the U.S. District Court for the Northern District of California issued a pivotal ruling that fundamentally reframed the legal debate. Rather than delivering a sweeping judgment for either side, Judge Alsup meticulously separated Anthropic’s actions into distinct components and evaluated each under the fair use doctrine.

His assessment of the legality of the training procedure itself was quite remarkable. Judge Alsup found that using books to teach an AI statistical patterns of language is “spectacularly transformative.” In one widely quoted passage, he drew an analogy to human learning: “Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them—but to turn a hard corner and create something different.” The AI, he reasoned, doesn’t function as a substitute for the original books. It uses text to learn how to generate something entirely new. On this crucial point, the court sided with the AI company, suggesting that the act of training on lawfully acquired material falls under fair use.

However, Judge Alsup’s ruling contained a critical caveat. While the training process might be transformative, the initial act of downloading and storing a massive library of pirated books remained what it had always been: large-scale copyright infringement. Fair use, the court made clear, cannot be invoked to legitimize theft. The judge found this aspect of Anthropic’s conduct to be “inherently, irredeemably infringing,” an action that “plainly displaced demand for authors’ books—copy for copy.”

This bifurcated ruling created what legal observers have called a “liability firewall.” It protected the innovative act of training under fair use while unequivocally condemning the underlying piracy. The distinction proved decisive. In September 2025, Anthropic agreed to a historic $1.5 billion settlement, which makes up the largest public copyright settlement in history. The settlement was not a fine for the act of training but compensation for the illegal acquisition of training data. This is a crucial distinction that reshapes the entire debate.

Keep reading with a 7-day free trial

Subscribe to The Augmented Educator to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Michael G Wagner
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture