Is AI Training 'Stealing'?
What a $1.5 Billion Settlement Reveals About AI, Copyright, and Data Provenance
Before we begin, a quick disclaimer: I am not a lawyer, and nothing in this essay should be considered legal advice. This is my interpretation as an educator and creator, based on my understanding of publicly available court documents and reporting, aimed at helping us all navigate these complex and rapidly developing issues.
The Copyrighted Character Conundrum
When OpenAI unveiled Sora 2 not that long ago, the response was immediate and visceral. The system’s ability to generate video clips featuring unmistakable facsimiles of beloved, copyrighted characters made the abstract debate about AI and copyright feel suddenly, uncomfortably concrete. It was not science fiction, nor was it something far-off. It was happening now, and it was remarkably good.
This ability to imitate copyrighted material exists far beyond video. Image generators like Midjourney have long allowed users to create images of “Mickey Mouse in the style of Van Gogh” with a few keystrokes. The technology doesn’t just approximate these characters; it recreates them with stunning fidelity, deploying them in contexts their creators never imagined or allowed.
Many creatives consider this a deep betrayal. Artists, writers, and filmmakers have articulated their reaction in sharp terms: this is theft. It’s digital strip-mining, they argue, where a lifetime of creative work is consumed without permission or compensation to build a commercial product worth billions. The assumption driving this anger is simple. For an AI to replicate a style or character so effectively, the company behind it must have done something illegal to acquire that knowledge.
This reaction is completely understandable. However, the term “stealing” may hide more than it reveals. A closer examination of recent high-stakes lawsuits reveals a far more nuanced legal landscape than the rhetoric of theft suggests. U.S. courts are beginning to indicate that the process of training an AI may indeed constitute legal fair use, provided the training data was lawfully obtained. The illegality, it turns out, lies not in the training itself but in the provenance of the data. Understanding this distinction is essential for having a productive conversation about the future of creativity and compensation in the age of AI.
The High Price of Piracy: Bartz v. Anthropic
In August 2024, a trio of authors (Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson) filed a class-action lawsuit against Anthropic, the company behind the Claude family of language models. The primary charge was serious: Anthropic had trained Claude by feeding it hundreds of thousands of books obtained from notorious pirated ebook repositories, specifically Library Genesis and the Pirate Library Mirror, commonly known as “shadow libraries.” This wasn’t a question of inadvertent use or ambiguous licensing terms. The complaint accused Anthropic of knowingly downloading and storing massive quantities of copyrighted material from sources explicitly dedicated to piracy.
The legal stakes were existential. Under U.S. copyright law, statutory damages range from $750 to $30,000 per work, escalating to $150,000 per work if a court finds the infringement was willful. With a certified class representing approximately 500,000 infringed works, Anthropic faced potential liability in the tens of billions of dollars, enough to destroy even a highly valued company.
On June 23, 2025, Judge William Alsup of the U.S. District Court for the Northern District of California issued a pivotal ruling that fundamentally reframed the legal debate. Rather than delivering a sweeping judgment for either side, Judge Alsup meticulously separated Anthropic’s actions into distinct components and evaluated each under the fair use doctrine.
His assessment of the legality of the training procedure itself was quite remarkable. Judge Alsup found that using books to teach an AI statistical patterns of language is “spectacularly transformative.” In one widely quoted passage, he drew an analogy to human learning: “Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them—but to turn a hard corner and create something different.” The AI, he reasoned, doesn’t function as a substitute for the original books. It uses text to learn how to generate something entirely new. On this crucial point, the court sided with the AI company, suggesting that the act of training on lawfully acquired material falls under fair use.
However, Judge Alsup’s ruling contained a critical caveat. While the training process might be transformative, the initial act of downloading and storing a massive library of pirated books remained what it had always been: large-scale copyright infringement. Fair use, the court made clear, cannot be invoked to legitimize theft. The judge found this aspect of Anthropic’s conduct to be “inherently, irredeemably infringing,” an action that “plainly displaced demand for authors’ books—copy for copy.”
This bifurcated ruling created what legal observers have called a “liability firewall.” It protected the innovative act of training under fair use while unequivocally condemning the underlying piracy. The distinction proved decisive. In September 2025, Anthropic agreed to a historic $1.5 billion settlement, which makes up the largest public copyright settlement in history. The settlement was not a fine for the act of training but compensation for the illegal acquisition of training data. This is a crucial distinction that reshapes the entire debate.
A Wrinkle in the Ruling: Kadrey v. Meta
Just two days after Judge Alsup’s ruling, a parallel case produced a strikingly different conclusion. Authors had sued Meta for training its Llama model using books sourced from the same shadow libraries. Judge Vince Chhabria, also of the Northern District of California, found that Meta’s training constituted fair use regardless of the data’s pirated source. Judge Chhabria placed greater emphasis on the fourth fair use factor—market harm—and concluded that because the plaintiffs had not adequately demonstrated that the AI’s output was directly harming the market for their books, the fair use defense succeeded.
Judge Chhabria explicitly criticized Judge Alsup’s reasoning, suggesting it gave too much weight to the transformative nature of the technology while undervaluing market harm analysis. This direct contradiction between two federal judges in the same courthouse highlights the profound legal uncertainty that persists in this area. The outcome of an AI copyright lawsuit may depend heavily on the specific judge assigned to the case and their particular interpretation of fair use.
The Broader Legal Battlefield
The Anthropic and Meta cases represent just two data points in a much larger and still-evolving legal landscape. Over 50 similar lawsuits are currently working their way through U.S. courts, involving defendants ranging from OpenAI and Microsoft to Stability AI, and plaintiffs spanning news organizations, visual artists, and book publishers.
Not all judges are siding with AI companies. In The New York Times v. Microsoft Corporation, a judge denied OpenAI’s motion to dismiss, allowing the case to proceed on the argument that ChatGPT’s outputs can act as a direct market substitute for the Times’ journalism. The court found this claim plausible enough to warrant further discovery. In Andersen v. Stability AI, involving visual artists, the court rejected the notion that AI models contain only unprotectable “data,” acknowledging the possibility of infringement claims based on both the training process and the outputs generated.
These divergent rulings show that the “transformative use” argument is not a guaranteed shield for AI developers, but they also reveal an emerging pattern in judicial reasoning. While judges disagree on whether pirated data taints an otherwise transformative use, there is growing consensus that training on lawfully acquired material strengthens the fair use defense considerably. Judge Alsup’s distinction between the “spectacularly transformative” training process and the “irredeemably infringing” act of piracy suggests that provenance, rather than the training methodology itself, may be the decisive factor.
Legal experts predict we are unlikely to see additional major summary judgment decisions on fair use until mid-2026 at the earliest, leaving the industry to operate under legal uncertainty for the foreseeable future. Yet the settlements that have been reached, particularly Anthropic’s massive payout, provide the clearest indication of where liability actually lies. Companies are recognizing that relying on pirated data carries existential financial risks they can no longer afford to take—not because training is inherently illegal, but because the foundation upon which that training rests can undermine even the strongest transformative use argument.
Provenance, Not Process
These examples highlight that the origin of the data is crucial. The legal argument is moving beyond whether training constitutes fair use, which seems increasingly likely under specific circumstances, and focusing instead on the legality of the data collection methods. The critical question for AI developers is therefore no longer simply “is our use transformative?” but “can we prove we acquired this data legally?”
This shift exposes a fundamental problem: our entire copyright and licensing infrastructure was built for a pre-AI world. While most platforms and content providers have begun updating their terms of service to explicitly address AI training, the vast majority of existing material, particularly content created before 2022, exists under license agreements that never considered this use case. The terms of service we accepted on social media, the licenses we purchased for stock photos, the permissions we granted when uploading content to platforms—these were often drafted in an era when “use” meant human consumption, not machine learning.
These legal frameworks were designed around human readers, viewers, and listeners engaging with discrete works, not algorithms that consume billions of words to extract and encode statistical patterns. This temporal mismatch creates a gray zone where content that was legally acquired for one purpose may or may not be legally deployable for another, depending on how courts interpret the scope of those original licenses.
These court rulings are forcing a long-overdue evolution. From now on, every licensing agreement must contain explicit clauses that grant or deny permission for content to be used in AI training datasets. We are witnessing the birth of a new licensing market, driven by legal necessity rather than abstract principle. The Anthropic settlement, with its benchmark of approximately $3,000 per work, provides rights holders with concrete leverage in these negotiations.
This creates an urgent imperative for creators and educators. We must become vigilant about the fine print of every platform we use. We need to scrutinize licensing terms, advocate for clarity about AI use, and understand how our work may be deployed in ways we never expected. The days of passive acceptance of boilerplate terms of service are over for anyone who cares about maintaining control over their creative output.
Beyond the Law: The Unanswered Ethical Question
It needs to be pointed out that a fair-use ruling from a judge does not settle the question of what is fair in a moral or ethical sense. The legal analysis of the argument addresses only a narrow set of concerns. It does not resolve the deeper tensions that drive this debate.
Consider the central ethical dilemma: even if an AI company legally purchases one copy of every book ever written, is it right for them to build a $183 billion enterprise on the collective knowledge encoded in those works without a model for sharing that value with the creators who produced it? The law, as it currently stands, suggests this is permissible. Our ethical intuitions may suggest otherwise.
The Anthropic settlement, while historic in scale, raises important questions about accountability and market structure. At $1.5 billion, the payout represents less than 1% of Anthropic’s reported valuation. For a well-capitalized company, this functions less as a deterrent than as a retroactive licensing fee. It is a way to clean up past infractions while continuing to benefit from models trained on the very data they’re now paying for. Smaller companies and startups, meanwhile, cannot follow this “infringe now, pay later” playbook. They must bear the higher costs and longer timelines of ethical data acquisition from day one, raising barriers to entry and potentially safeguarding the market position of the very companies whose behavior created the legal precedent.
But framing this practice solely as theft also prevents us from designing the new systems we desperately need. It channels our energy into fighting yesterday’s battles rather than constructing tomorrow’s solutions. We need licensing frameworks that account for AI use. We need transparent reporting about what data was used to train which models. And we need mechanisms for compensating creators when their work contributes to systems that generate enormous economic value. None of these solutions emerge from shouting “theft” into the void.
What the Courts Tell Us—and What They Don’t
The cases discussed here have brought clarity to one aspect of the AI copyright debate. The training process itself, when conducted on lawfully acquired material, appears increasingly likely to be protected as transformative fair use. Instead, the vulnerability lies in the data’s origin. It lies in the industry’s historical reliance on mass piracy to acquire the vast datasets required for effective model training. Anthropic’s $1.5 billion settlement was not payment for innovation but compensation for theft of the traditional kind: downloading and storing copyrighted works from pirate sites.
This distinction matters immensely for how the industry moves forward. Companies now recognize that data provenance is not a secondary concern but a primary legal obligation. The era of treating the internet as a free repository for training data is over. The settlements, the court rulings, and the ongoing litigation all point toward a future where AI developers must demonstrate that their training data was lawfully obtained through explicit licensing agreements, partnerships with content owners, or the use of verifiably public-domain sources.
But legal compliance alone does not resolve the ethical questions. The law may determine what is permissible, but our values must determine what is acceptable. The path forward for creators and educators is to use the clarity these lawsuits have provided to demand better. We must push for transparency in training data, advocate for licensing models that explicitly address AI use, and lead nuanced ethical discussions about how we will value and sustain human creativity in our increasingly augmented world. We must recognize that legal compliance and ethical obligation are not the same thing, and both require our attention.
The conversation is just beginning. The questions are complex, and the stakes are high. But we cannot address these challenges productively if we remain anchored to a framing that the courts are telling us misses the mark. Understanding what the law actually says—and what it fails to address—is the first step toward building a future where both human creativity and technological innovation can flourish.
How are you thinking about AI and copyright in your creative or educational practice? Have you reviewed the terms of service for the platforms where you share your work? If you’re using AI tools, have you considered the provenance of their training data? For educators: how are you helping students understand the distinction between legal compliance and ethical responsibility in this rapidly evolving landscape? Share your experiences, concerns, and strategies in the comments.
P.S. I believe transparency builds the trust that AI detection systems fail to enforce. That’s why I’ve published an ethics and AI disclosure statement, which outlines how I integrate AI tools into my intellectual work.





