The Legal Status of AI Training Data
Generative AI systems are trained on vast corpora that often include copyrighted books, articles, images, and code. Courts and commentators are now grappling with whether copying works into training datasets—and retaining intermediate copies—implicates the reproduction right, whether training is fair use, and how market harm should be measured when outputs may substitute for licensed content.
Issue overview
Generative AI systems are trained on vast corpora that often include copyrighted books, articles, images, and code. Courts and commentators are now grappling with whether copying works into training datasets—and retaining intermediate copies—implicates the reproduction right, whether training is fair use, and how market harm should be measured when outputs may substitute for licensed content.
Core legal questions
- Does copying copyrighted works to create or use training datasets infringe the reproduction right under 17 U.S.C. § 106?
- Can AI training qualify as fair use under 17 U.S.C. § 107, and how does transformative use analysis apply?
- Does the source of training data—licensed, scraped, or allegedly pirated—affect liability or fair use?
- How should courts assess market harm when AI models may compete with licensing markets for creative works?
- Do model weights, memorization, or output substitution change the copyright analysis?
Connected research
Cases
Definitions