ArticleCurrent

The New Copyright Problem Created by AI Training

Generative AI did not invent copyright disputes, but it did create a new scale of copying problem: models learn from millions of works, yet the law still asks whether that learning is copying, fair use, or market substitution.

The Legal Status of AI Training Data

CopyrightFederalCurrentUpdated 2026-06-0818 min read

Research Map

Parent issue

IssueThe Legal Status of AI Training Data

Cases discussed

Statutes

Key doctrines

Introduction: Why AI Training Created a New Copyright Problem

Generative AI systems require enormous amounts of text, images, and code. Much of that material is protected by . The resulting litigation is not merely about file-sharing or pirated downloads—it is about whether the training process itself, and the artifacts it produces, fit within existing exclusive rights and limitations.

This article maps the core copyright problem: courts are using to decide whether AI training is a or a violation of the copyright owner's .

How Generative AI Training Uses Copyrighted Works

Training begins with . Providers assemble corpora from web scrapes, licensed datasets, user uploads, and—according to several complaints—. Works are tokenized, embedded, and used to update .

The process rarely distributes full copies to end users, but it typically requires creating internal copies at scale. That technical pipeline is what triggers copyright analysis.

Why Training Implicates the Reproduction Right

Section 106 grants the reproduction right. Plaintiffs argue that every ingestion step that fixes a work in a server or dataset copy infringes, including retained longer than necessary for transient processing.

Defendants respond that weights are not copies of expressive works and that any copies are instrumental to a non-expressive statistical process. The law has not fully resolved which copies count and when.

Why This Is Different From Ordinary Piracy

Classic piracy involves distributing substitutes for the original work. AI training litigation often separates three stages: acquisition of data, training on copies, and deployment of outputs.

A defendant may argue training is lawful even if some outputs are not, or vice versa. That staging is why AI copyright feels structurally different from file-sharing cases, even when shadow libraries are involved.

Fair Use as the Central Legal Battleground

under 17 U.S.C. § 107 is the primary defense. Courts will weigh purpose, nature, amount, and . Early cases such as Thomson Reuters v. ROSS suggest that commercial substitution can defeat fair use even when copying is intermediate.

LLM defendants emphasize transformative purpose: models learn patterns, not expression. Rights holders emphasize scale, expressiveness of outputs, and harm to .

The Learning vs. Copying Problem

Copyright protects expression, not ideas. Machine learning blurs the line because training consumes expression to build a functional system. Commentators and courts struggle with whether the process extracts unprotectable facts or retains protectable expression in weights or outputs.

cases—where models regurgitate training text—push the analysis toward copying. Where outputs are novel, defendants have a stronger transformative narrative.

Why Dataset Source Matters

Not all training data is equivalent in litigation. Licensed corpora, open-licensed code, and public-domain works raise different risks than datasets built from infringing libraries.

Allegations of affect willfulness, damages, and the equitable tone of fair use. Source transparency is becoming a practical compliance issue separate from pure doctrinal analysis.

The Market Harm and Licensing Problem

The fourth fair use factor asks about harm to actual or potential markets. News publishers, image licensors, and software vendors argue AI products for their works and undercut emerging AI licensing programs.

If courts recognize a robust licensing market for training uses, defendants must show why unpaid copying should still be permitted at commercial scale.

How Courts Are Starting to Draw Lines

Pending cases—including Bartz, Kadrey, NYT v. OpenAI, and Getty—may diverge on early motions, settlements, or fact-specific trials. Thomson Reuters v. ROSS already illustrates a skeptical approach when the defendant markets a competing product trained on expressive material.

Expect courts to separate acquisition, training, and output infringement rather than treating "AI" as a single legal event.

Conclusion: Why AI Copyright Law Will Turn on Acquisition, Training, Outputs, and Markets

The new copyright problem is layered. Acquisition asks whether data was licensed or pirated. Training asks whether copies are transformative learning. Outputs ask whether the deployed model substitutes for protected markets. Markets ask whether fair use should yield to licensing.

Aidicia tracks this issue through linked cases, statutes, and definitions so researchers can move from doctrine to docket without treating AI copyright as a single yes-or-no question.

Aidicia is an educational legal research portfolio. It does not provide legal advice, create a lawyer-client relationship, or replace advice from a licensed attorney.