Training AI Models? Here's What You Need to Know About Copyright Risks
Training AI Models? Here's What You Need to Know About Copyright Risks
It is undeniable that artificial intelligence (AI) is everywhere and having unprecedented rapid effects on business and society. The law, however, evolves at a slower pace and it takes key decisions by courts and government agencies to help us understand the legal issues around new technologies.
Recently several key federal court decisions and Copyright Office reports grapple with how large language models (LLMs) are trained, and whether that training violates copyright laws. These District court decisions out of Delaware and California provide some important guidance on building models, training them, and other key legal risks to consider in the use of AI. However as the law struggles with the tensions between rights owners, model developers and the AI ecosystem as a whole, the facts and specifics of each situation and technology are likely to impact how cases are resolved and the law evolves.
- Thomson Reuters v. Ross Intelligence [1]
This is the first district court to assess, and reject, the fair use defense in the context of AI training data. Judge Stephanos Bibas of the Third Circuit sitting by designation in the Delaware District Court determined that defendant’s use was not transformative, especially since the protected material was used to create a tool to directly compete with the plaintiff.
Here the defendant, Ross Intelligence (Ross), used headnotes from Westlaw, a renowned legal research platform owned by Thomson Reuters, to train a competing legal research AI product. Thomson Reuters alleged that the use of its protected Westlaw headnotes infringed Thomson Reuters’ rights. The defendant Ross argued that its use of Westlaw’s content was transformative because, while the headnotes were used to train Ross’ AI system, the headnotes were not provided directly to users. It should be noted that Ross’s AI tool was a non-generative AI, as in, it did not create new content but instead used the ingested data to provide relevant case law in response to user questions.
In ruling against Ross, the court determined that Ross’s AI tool served a similar purpose as Westlaw headnotes—legal research—and that Ross’s commercial intent undermined its fair use arguments. The court explained that ‘transformative use’ must add something fundamentally new to the copyrighted content. Using the content, through a different technical process, for the same purpose does not sufficiently transform the protected work such that fair use applies.
Ross further argued that it did not directly use the Westlaw headnotes, as it had a vendor create bulk memos synthesizing case holdings based on Westlaw headnotes. These bulk memo are what Ross’ AI ingested as training materials. However, upon review the court found the bulk memos to be substantially similar to the Westlaw headnotes and rejected the concept that going through a vendor could insulate Ross from liability, as Ross was ultimately the one using the protected content. Ross also argued that there was no market for licensing the headnotes, therefore Ross was not interfering with plaintiff’s business and there was no harm to Thomson Reuters. The court rejected this argument as well, finding that the potential market for licensing headnotes was both legitimate and foreseeable.
Key takeaway: Businesses must be cautious with their use of copyrighted data. There are serious risks associated with using third-party materials to train, develop, improve, or operate AI tools. The fair use defense is not available when the entirety of the protected data is input into the AI system. This is particularly relevant when the parties are competitors.
2. Bartz v. Anthropic [2]
In Bartz v. Anthropic Judge William Alsup in the Northern District of California held that using copyrighted books to train an AI model might qualify as fair use, as long as the books were obtained through lawful means, such as purchasing physical or digital copies.
Anthropic used millions of copyrighted books to train its Claude LLM, which was capable of generating prose that mimics the writing style of human authors. A group of authors brought suit, arguing that Anthropic used their books without permission to train its AI. Anthropic sourced its data by both purchasing millions of books and downloading millions of pirated books in digital form for free. Interestingly, the Claude system never provided users with infringing copies of plaintiffs’ copyrighted works. Instead, the allegations focus on the reproduction of the protected works to train Anthropic’s AI.
Judge Alsup ruled that the training with lawfully acquired books could be considered fair use, emphasizing the transformative nature of using the purchased books to develop a generative AI system that wrote original works in the style of a human author. Judge Alsup noted that authors cannot exclude other from using their works to learn, particularly since this is how books have been used for centuries. However, he allowed the claims involving pirated books to proceed, making clear that “every factor points against fair use” when it comes to illegally sourced material and that “no damages from pirating copies could be undone by later paying for copies of the same works.”
Just last month, Judge Alsup also granted class certification in this matter, allowing the case to proceed on behalf of a class of authors whose works were allegedly accessed by Anthropic. Anthropic has also filed a motion for an interlocutory appeal on the piracy finding, which is currently pending.
Key takeaway: Using copyrighted data to train AI might be fair use, but, as Bartz makes clear, how the data is acquired could determine whether use is protected. Further, the output of the AI is equally important. In Bartz the trained AI never reproduced or published the copyrighted materials to the public, which allowed the court to focus solely on the inputs and how they were used.
3. Kadrey v. Meta [3]
Meta developed its own LLM, Llama, by exposing it to massive amounts of text from various sources, including books. Again, the books were not being copied or republished—they were being used to teach Llama how language works so it could generate entirely new content.
Several authors took issue with their works being included in Llama’s ingestion and brought suit.
In assessing Meta’s motion to dismiss, Judge Vince Chhabria of the Northern District of California started from the viewpoint that, in most cases, using copyrighted materials without the author’s permission, to train AI models would likely constitute infringement. This is especially the case where such use undermines the market for the copyrighted works. Further in assessing Meta’s fair use arguments, even though Judge Chhabria acknowledged the transformative nature of using copyrighted works to train a generative LLM, he cautioned that transformation alone does not guarantee fair use. The most significant factor, he explained, is how the use impacts the market value for the original works.
The court found that the output from the model was in no way a “recasting or adaptation of the plaintiffs’ books” as Llama could only output trivial snippets of the ingested works and therefore its output was not a direct infringement nor an infringing derivative work. For those claims to survive, the plaintiffs would have needed to allege that Llama’s outputs “incorporate in some form a portion of” their books. Further the court found that plaintiffs did not provide evidence that Meta’s use of their works resulted in any market harm, either by causing the AI to reproduce substantial portions of their books or by negatively affecting the market for licensing books as AI training data.
Importantly, Judge Chhabria’s ruling was fact-specific. He cautioned that his decision “does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”
Key takeaway: For rights holders, alleging infringement isn't enough. Plaintiffs should be ready to explain how AI training harms the market for the original work. The court noted that plaintiffs barely gave lip service to the issue of market impact and presented no evidence of how AI-generated content would actually compete with their specific works.
For AI developers, this suggests that transformation alone may not be enough to successfully assert a fair use defense and uses also creating minimal market impact may receive stronger fair use protection.
4. The Interplay of LLM Training and Copyright Outputs [4]
Beyond training inputs, another important copyright question is whether AI-generated outputs are protectable. The U.S. Copyright Office, in reports issued in January and May 2025, states that works produced solely by AI, without meaningful human creative input, cannot be copyrighted. However, it allows for copyright protection when AI assists human creativity. Determining whether human contributions to AI-generated outputs are sufficient to constitute authorship must be analyzed on a case-by-case basis. Further, in the May report, the Copyright Office acknowledged that training using copyrighted materials “clearly implicate[s] the right of reproduction,” apparently establishing a presumption that such training could be infringing absent fair use or other related defenses.
This interplay is the subject of a pending case that was recently filed by film studios including Disney and Universal against the AI image generation platform Midjourney.[4] The allegations in the suit center around the platform’s creation of allegedly infringing derivative works that include the studios’ characters. The studios also allege that the training data that Midjourney uses are the product of copyright infringement, as the data was allegedly scraped form the internet without proper consent or license. This action has the potential to provide further guidance in the realms of both AI training and output.
Key takeaway: Copyright concerns don't end with training data. AI-generated outputs raise ongoing questions about ownership, protection, and potential infringement that companies need to consider in their AI strategies.
Conclusion
The common thread across these developments is that training your AI models using copyrighted materials constitutes risk. While some training uses are receiving fair use protection, companies can't assume all AI development is automatically covered by the doctrine. Indeed, if the training data came from unlicensed material, that could subject you to liability for pirating works—even if the ultimate use of that material to train the model constitutes fair use.
If you're developing models in-house, conduct a thorough audit of your training data sources. If you're working with outside vendors or purchasing pre-trained models, understand where their data comes from and how it was acquired. Consider seeking licenses for high-risk content categories, and when possible, negotiate contractual protections such as representations and indemnities.
The legal landscape is evolving rapidly, with courts balancing innovation against creators' rights. Companies need to stay informed about new developments and build compliance into AI development from the ground up.
[1] Thomson Reuters Enter. Centre GmbH et al v. ROSS Intelligence Inc., 1:20-cv-00613-SB (D. Del. Feb. 11, 2025).
[2] Andrea Bartz, et. al. v. Anthropic, PBC, No. C 24-05417 WHA (N.D. Cal. June 23, 2025).
[3] Richard Kadrey, et. al. v. Meta Platforms, Inc., No. 3:23-cv-03417-VC (N.D. Cal. June 25, 2025).
[4] Disney Enterprises, Inc., et al. v. Midjourney, Inc., No. 2:25-cv-05275 (C.D. Cal. 2025).