Educational AI-law intelligenceno legal advicesource-grounded research

The Legal Status of AI Training Data

Generative AI systems are trained on vast corpora that often include copyrighted books, articles, images, and code. Courts and commentators are now grappling with whether copying works into training datasets—and retaining intermediate copies—implicates the reproduction right, whether training is fair use, and how market harm should be measured when outputs may substitute for licensed content.

CopyrightFederalCurrentLast checked 2026-06-08

Issue overview

Generative AI systems are trained on vast corpora that often include copyrighted books, articles, images, and code. Courts and commentators are now grappling with whether copying works into training datasets—and retaining intermediate copies—implicates the reproduction right, whether training is fair use, and how market harm should be measured when outputs may substitute for licensed content.

Core legal questions

  • Does copying copyrighted works to create or use training datasets infringe the reproduction right under 17 U.S.C. § 106?
  • Can AI training qualify as fair use under 17 U.S.C. § 107, and how does transformative use analysis apply?
  • Does the source of training data—licensed, scraped, or allegedly pirated—affect liability or fair use?
  • How should courts assess market harm when AI models may compete with licensing markets for creative works?
  • Do model weights, memorization, or output substitution change the copyright analysis?

Connected research

Aidicia is an educational legal research portfolio. It does not provide legal advice, create a lawyer-client relationship, or replace advice from a licensed attorney.