Does Copyright Apply to AI Training? – Summary of the U.S. Copyright Office Report
As someone who frequently uses AI tools to summarize materials and organize complex legal content, I’ve often asked myself a question:
“If I didn’t train this AI myself, do I really have the right to use the result freely?”
In recent months, this question has moved from theory to courtroom reality. Several major copyright lawsuits have been filed in the U.S. involving the use of copyrighted materials in generative AI training. Two particularly significant cases include:
The New York Times vs. OpenAI & Microsoft
The Times alleges that ChatGPT reproduced its articles nearly verbatim, raising serious questions about whether AI models trained on journalistic content are infringing copyright.Sarah Silverman et al. vs. Meta and OpenAI
This class action suit claims that published books were used in AI training datasets without permission, bringing attention to how text-based models acquire and use copyrighted material.
In response to these growing concerns, the U.S. Copyright Office published a report in March 2025 titled
“Copyright and AI – Part 3: Generative AI Training”, which examines how copyright law applies to the use of protected content in AI model training.
Why Does Generative AI Raise Copyright Concerns?
Generative AI (GAI) systems produce human-like text, images, or music based on training data—often scraped in massive volumes from online sources.
The problem is that much of this data may be protected by copyright. Whether that use is lawful depends on key questions:
Was the data simply referenced, or was it copied and reused?
Does the AI output replace or replicate the original content?
The answers have legal consequences.
Is Fair Use a Valid Defense?
Under U.S. copyright law, unauthorized use of copyrighted material may be allowed if it qualifies as fair use, based on four key factors:
1. Purpose and Character of Use
If used for nonprofit education or research, fair use is more likely to apply.
But commercial AI systems like ChatGPT or Claude often face higher scrutiny.
2. Nature of the Copyrighted Work
Use of factual content (e.g., news, public data) weighs in favor of fair use.
Use of highly creative works (e.g., novels, music, art) weighs against it.
3. Amount and Substantiality
Use of entire works or repeated usage of key content reduces fair use viability.
For AI training, how the data is selected and used matters.
4. Effect on the Market
If the AI output replaces the original or affects the creator’s income, fair use is unlikely.
Even style imitation may count as market harm.
The Licensing Debate: Voluntary vs. Compulsory
The report acknowledges increasing interest in voluntary licensing models, where AI developers obtain content through direct licensing or licensing platforms.
For example, Getty and Bria are actively building AI models on fully licensed datasets.
On the other hand, some groups have proposed compulsory licensing, a system where the government grants access to content in exchange for a fee.
However, the Copyright Office recommends caution:
“There is no clear evidence of market failure that would justify government-mandated access to copyrighted content.”
Practical Takeaways for Companies and Developers
If your AI system trains on protected content, getting a license is the safest legal path.
If you rely on fair use, be prepared to justify it with evidence (e.g., transformative use, minimal impact on market).
SaaS and API-based platforms should update their Terms of Use to clearly state data usage scope and allow opt-outs if necessary.
The law is still evolving, and this report serves as a key step in shaping future copyright guidance in the age of AI.
© LexSoy Legal LLC. All rights reserved.
All content on this website is the intellectual property of LexSoy Legal LLC and is protected by copyright law. No portion may be reproduced or distributed without permission.