The Legality of AI Training on Copyrighted Written Works
2025 | Josiah Coronado (Editor-in-Chief)
Introduction
“Imagine a future where the very books, articles, and works of art we cherish could be leveraged by algorithms to create new forms of expression — without the permission of the authors. The rising use of AI to train on copyrighted literature has sparked intense debate about ownership, creativity, and the boundaries of intellectual property. As technology accelerates and AI becomes an ever more powerful tool for generating new works, the question looms: when AI is trained on copyrighted content, who truly owns the output — the machine, the creator of the algorithm, or the original author? This paper examines the legal landscape surrounding the training of AI on copyrighted literature, exploring the delicate balance between fostering innovation and protecting the rights of creators in the digital age.” This was a hook written entirely by Artificial Intelligence (AI), and it might have been difficult to discern. In the world we live in, it is becoming increasingly difficult to distinguish human-made writing and art from AI. This will only exacerbate as AI continues to be trained on an increasing amount of material.
AI training occurs when an AI model is fed large amounts of data to enhance its pattern recognition and learns to produce improved outputs based on the patterns it identifies. Issues arise when AI is trained on the work of writers without their consent, allowing it to replicate their writing styles and words, essentially depriving them of the financial benefits that would otherwise be theirs. Currently, multiple parties in separate legal cases are suing OpenAI and similar AI companies for alleged copyright infringement. The subjects of these lawsuits are alleged to have trained their large language models using databases comprising pirated copies of various authors’ works. Due to many similarities between these cases, it is possible to examine select cases from this roster and ascertain possible future implications..
Historical Context
In Tremblay v. OpenAI, Incorporated, a group of authors filed a class-action lawsuit against OpenAI, alleging that the company used copyrighted literary works as training data for its AI model, ChatGPT. In this case, as in many others, the company moved to dismiss all claims against it except for direct copyright infringement, and the court ruled in favor of OpenAI.[1] They dismissed the plaintiff’s vicarious copyright infringement claim, which was based on the potentially copyright-violating outputs users could generate with ChatGPT. The court also rejected the argument that any output from ChatGPT is inherently an infringing derivative work and found that the Plaintiffs had not alleged “what the outputs entail” or “that any particular output is substantially similar…to [plaintiff’s] books.”[2] Substantial similarity is defined differently depending on the jurisdiction in which a case is filed. The case was filed in the United States District Court, N.D. California, which means it falls under the jurisdiction of the Ninth Circuit. The Ninth Circuit defines substantial similarity using a two-part test – the first part being an extrinsic test looking at the objective details of the material in question, and the second being intrinsic – based on an “ordinary person’s subjective impressions of the similarities between the works”.[3] Since none of the evidence provided established substantial similarity between the protected expression of the copyrighted works and the outputs, the court ruled that the Plaintiffs failed to allege any direct infringement by users for which OpenAI could be found indirectly liable.[4] Additionally, the court also dismissed claims alleging that OpenAI had violated the Digital Millennium Copyright Act (DMCA). The DMCA is an amendment to the Federal copyright law, which was created to address the relationship between copyright and the internet.[5] The court reasoned that Plaintiffs had failed to allege that OpenAI intentionally altered or removed any copyright management information (CMI), such as copyright notices, authors’ names, or works' titles, from the data to conceal infringement.
Although the training process automatically removed this information, it was insufficient to support a claim under Section 1202(b)(3) of the DMCA.[6] The section states that it is unlawful for a person to, without the authorization of the copyright owner or the law, distribute or import for distribution works or copies of works if that person knows that CMI has been removed or altered without proper authorization and that it will enable infringement.[7] OpenAI was cleared of this because it was not provable that they knew that it would allow infringement or had the intent to infringe. For the most part, this is how other cases have proceeded, with plaintiffs being unable to prove infringement despite the blatant use of copyrighted materials by various AI companies.
The Facts
Recently, a new case emerged that may offer hope for authors, researchers, and traditional publishers. In Thomson Reuters v. Ross Intelligence, Ross Intelligence (Ross), an up-and-coming competitor to Westlaw, hired a third party to create memos with legal questions and answers for Ross’ AI to train on after failing to acquire a license from Westlaw to use their legal material.[8] Thomson Reuters (Reuters) alleged that this third party, LegalEase Solutions (Legal Ease), used Westlaw’s copyrighted headnotes and Key Number System in completing the task. They also claimed that the 25,000 questions used for training were essentially Westlaw headnotes. Ross conceded that the headnotes had “influenced” the questions, but that their lawyers had the final hand in drafting them.[9] This case was decided in a Delaware District Court. Still, similar cases of alleged copyright infringement by AI companies have occurred and are ongoing in multiple other states, including California and New York.
The Case
In Thomas Reuters v. Ross Intelligence, the main defense Ross employed was fair use. They claimed that since Westlaw had a copyright covering a large number of different headnotes and key numbers, copying only a small portion would not be enough for infringement. They went on to state that the headnotes and system had played a role, but their lawyers hadn’t directly copied them - only drafted them. Reuters, on the other hand, argued that all 25,000 question-and-answer sets were copies and moved for summary judgment on 2,830 of those because they claimed Ross’s expert admitted LegalEase had copied them, and that it was therefore undisputed. Summary judgment is a party petitioning the court to apply the law to an undisputed fact. Within this case, the combined parties had five summary judgment motions. Reuters moved for summary judgment on its claim of LegalEase’s tortious (wrongful) interference with contract and infringement of its copyrights, Ross counter-moved on its preemption defense to the previous claim, and both sides moved for summary judgment on Ross’s fair-use defense. In contrast, Reuters moved for summary judgment on its claim of copyright infringement.
A copyright-infringement claim has three elements: ownership of a valid copyright, actual copying, and substantial similarity.[10] All three elements were disputed in the case, with the second being the only legal dispute. Judge Stephanos Bibas could not decide the first element for summary judgment due to a factual dispute. He decided the second in favor of Reuters, and he stated that the jury must decide the third due to the multifaceted nature of determining the similarity of such a large number of different headnotes. The court denied the other cross-motions for summary judgment, holding that it could not rule as a matter of law that the legal research startup’s alleged use of Westlaw headnotes to train a competing artificial intelligence platform constitutes fair use.[11] Although the argument was not fully resolved, it still represented a major advancement in publishing. The base issues in this case of copyright infringement and fair use will not be confined to this instance. Currently, there are other ongoing cases with authors that may result in different outcomes, and only time will tell if or how lawmakers will settle these issues.
Conclusion
AI and copyright laws often struggle to coexist in the same space without conflict arising. There is potential for numerous future cases involving infringement by AI companies. Currently, numerous cases are still pending regarding this issue, with other fields such as visual art and music being similarly affected. This case doesn’t necessarily prevent AI companies from copyright infringement, but it should serve as a warning and possible precedent for future cases. These future cases could drastically change the future of all forms of art, affecting the livelihoods of authors, traditional publishers, illustrators, editors, artists, and more for better or worse. Future cases will, conversely, affect any AI companies, as the AI may gain or lose its limits, resulting in decreased or increased profits.
Sources
“Paul TREMBLAY, et al., Plaintiffs, v. OPENAI, INC., et al., Defendants.” 2024. FindLaw.
Dunning, Angela, Arminda Bepko, and Jessica Graham. 2024. “Court Dismisses Most Claims in Authors’ Lawsuit Against OpenAI.” Cleary AI and Technology Insights.
Funky Films, Inc. v. Time Warner Entertainment Co., 462 F.3d 1072 (9th Cir. 2006).
Ibid.
“The Digital Millennium Copyright Act | U.S.” n.d. Copyright.
Ibid.
“17 U.S. Code § 1202 - Integrity of copyright management information | U.S. Code | US Law | LII / Legal Information Institute.” n.d. Law.Cornell.Edu.
Delman, Edward. 2023. “Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence Inc.”
Ing, Vanessa K. 2023. “Thomson Reuters v. Ross Intelligence: AI Copyright Law and Fair Use on Trial.”
“Thomson Reuters Enterprise Centre GmbH et al v. ROSS Intelligence Inc., No. 1:2020cv00613 - Document 547 (D. Del. 2023).” 2023. Justia Law.
Delman, Edward. 2023. “Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence Inc.”