The Legal Status of AI Training Data

Generative AI systems are trained on vast corpora that often include copyrighted books, articles, images, and code. Courts and commentators are now grappling with whether copying works into training datasets—and retaining intermediate copies—implicates the reproduction right, whether training is fair use, and how market harm should be measured when outputs may substitute for licensed content.

CopyrightFederalCurrentLast checked 2026-06-08

Issue overview

Core legal questions

Does copying copyrighted works to create or use training datasets infringe the reproduction right under 17 U.S.C. § 106?
Can AI training qualify as fair use under 17 U.S.C. § 107, and how does transformative use analysis apply?
Does the source of training data—licensed, scraped, or allegedly pirated—affect liability or fair use?
How should courts assess market harm when AI models may compete with licensing markets for creative works?
Do model weights, memorization, or output substitution change the copyright analysis?

Aidicia is an educational legal research portfolio. It does not provide legal advice, create a lawyer-client relationship, or replace advice from a licensed attorney.

The Legal Status of AI Training Data

Issue overview

Core legal questions

Connected research