ALAN opinion 10 min read March 25, 2026

Copyright, Carbon, and Consent: The Ethical Price of Training on Trillions of Tokens

Creative works and natural resources consumed as invisible inputs to large language model training

Table of Contents

The Hard Truth

What if the most expensive thing about building a language model is not the compute, not the engineers, not the data centers — but the things we decided not to ask permission for?

Every large language model begins the same way: by consuming the written record of human thought. Books, articles, code, conversations, research papers, personal essays. The process is called Pre Training, and it is simultaneously the most technically impressive and ethically unexamined phase of modern AI development. The question is not whether this works. It clearly does. The question is what it costs — and who pays.

The Harvest Nobody Agreed To

The first cost is consent — or more precisely, its absence. By the end of 2025, more than 70 copyright lawsuits had been filed against AI companies, doubling from roughly 30 the year before (Copyright Alliance). The Bartz v. Anthropic case settled for $1.5 billion, covering nearly half a million titles at roughly $3,000 per work. The court drew a line that satisfied almost no one: training on copyrighted material could qualify as fair use, but storing pirated copies did not (Copyright Alliance). Meanwhile, the NYT v. OpenAI case remains open — a court ordered over 20 million ChatGPT logs in January 2026, with summary judgment set for April 2026 (Norton Rose Fulbright).

These are not edge cases. They are the legal system catching up to a decision that was already made years ago: the world’s creative output was treated as a free input, and the burden of objection fell on the people whose work was taken.

The EU’s Copyright Directive offers a formal opt-out — rightsholders can reserve their rights, and general-purpose AI providers must exclude reserved content before training (IAPP). But the mechanism exposes its own limits. Perplexity AI was caught evading robots.txt restrictions, and the Berkeley Tech Law Journal calls the entire opt-out framework a “false compromise” — because once trained, data cannot be unlearned. Is it ethical to pre-train AI models on copyrighted data scraped without author consent? The lawsuits pose this in legal language. The harder version of the question has no courtroom: what does it mean for a society to treat extraction as the default relationship between creators and the systems that learn from them?

The Case for Extraction

Fair use advocates make a case worth hearing on its own terms. Copyright law has always permitted transformative use, and the Kadrey v. Meta ruling in June 2025 held that LLM training qualifies as fair use “regardless of whether underlying materials were obtained from legitimate sources” (Norton Rose Fulbright). This is a significant ruling — though as of March 2026, no appellate court has weighed in on AI training fair use, and the precedent remains US-only.

The market is also moving. Disney signed a three-year deal with OpenAI for over 200 characters in Sora, investing $1 billion — a signal that licensing, not litigation, may define the next phase (Norton Rose Fulbright). The music industry reached settlements along similar lines: UMG and Warner Music Group both settled with AI music companies, establishing licensing terms and artist opt-in provisions (Copyright Alliance).

The argument from this vantage point is familiar: innovation requires friction with existing frameworks. Scaling Laws reward larger datasets, the resulting technology benefits everyone, and the system will self-correct through markets and legal clarity. This has happened before.

That last sentence is the one worth pausing on.

The Assumption Inside the Argument

Every defense of the current model — fair use, market correction, innovation imperative — shares a single hidden premise: that extraction is the natural starting point, and consent is a refinement to be negotiated afterward.

This premise is so embedded in the infrastructure that it barely registers as a choice. Pre-training pipelines apply Data Deduplication to remove redundant text and Masked Language Modeling to learn language structure from billions of documents. The engineering frameworks that enable this — Megatron-LM and Deepspeed among them — are optimized for throughput, not provenance. Nobody built a consent layer into the training stack because consent was never treated as an engineering requirement. It was treated as a legal afterthought.

The environmental cost follows the same logic of extraction without accounting. Training GPT-3 consumed 1,287 MWh of electricity and produced roughly 552 tonnes of CO2 equivalent (MIT News). Projected emissions for GPT-4 — based on researcher estimates, not official OpenAI disclosure — reach approximately 21,660 tonnes of CO2 equivalent (Scientific Reports). By 2026, global data center electricity consumption is projected to hit roughly 1,050 TWh, enough to rank fifth among nations, between Japan and Russia (MIT News). Training at this scale also evaporates massive quantities of freshwater — a cost that varies by geography and cooling method but is consistently externalized.

What is the environmental and energy cost of pre-training large language models at scale? The honest answer is that we do not fully know, because the companies training these models are not required to disclose the numbers. But the pattern is visible: environmental resources, like creative resources, are treated as unpriced inputs to private value.

The Mill That Came Before the Regulation

There is a pattern here, and it predates software by centuries. When textile mills first industrialized, they drew water from rivers without compensation, consumed labor without negotiation, and treated both as natural resources available for extraction. Regulation came later — decades later, after the structural damage was done. The same sequence repeated with fossil fuels, with chemical agriculture, with personal data. The rhythm is always the same: extract, accumulate wealth, then negotiate terms only when the cost of not negotiating exceeds the cost of compliance.

Pre-training follows this rhythm with uncomfortable precision. Creative work and environmental capacity are consumed at industrial scale. The value concentrates with the companies that train the models. The costs distribute across creators who were never consulted and ecosystems that cannot advocate for themselves. Fine Tuning and RLHF refine the model after pre-training, but the foundational extraction — the phase that defines what the model knows and how it reasons — happens before any alignment conversation begins.

The difference this time is speed. Previous extraction cycles unfolded over decades. Pre-training compressed the cycle into years. The models were trained, the value was captured, and the legal system is still drafting its first response.

The Invisible Subsidy

Here is the thesis, stated plainly: pre-training depends on two unpriced subsidies — creative labor extracted without consent and environmental resources consumed without accountability. This is not an argument that AI should not exist. It is a demand that we stop pretending the inputs are free.

The emerging licensing deals suggest that markets can partially resolve the consent problem — but licensing works for entities with bargaining power. Studios, labels, major publishers. It does not work for the independent blogger, the academic whose dissertation trained a model she will never use, or the open-source contributor whose code became someone else’s product. The consent gap is not closing evenly. It is closing for those who can afford to be at the table.

The environmental cost is harder still to address through markets, because the atmosphere does not retain counsel and the water table does not file briefs.

Questions That Belong to Us

If consent had been a prerequisite rather than an afterthought, would the training pipeline look different? Would datasets be smaller, better curated, more accountable — or would the entire arc of large-scale pre-training have taken another shape?

These are not hypothetical exercises. They are design decisions that were framed as inevitability. The choice to scrape first and negotiate later was not a law of nature — it was a business decision. The choice to externalize environmental costs was not physics — it was accounting.

The most uncomfortable realization is not that mistakes were made. It is that the system worked exactly as designed — and the people who designed it are not the ones absorbing the cost.

Where This Argument Is Weakest

Intellectual honesty requires naming the vulnerability. If licensing frameworks scale beyond the major studios — if they reach independent creators, if they include meaningful compensation — then the consent problem may be resolved by markets without requiring a moral reckoning. If renewable energy and more efficient architectures make training carbon-neutral within a generation, the environmental argument loses force. And if smaller, more capable models reduce the resource appetite of pre-training altogether, the frame of this essay may age into irrelevance.

The counterargument is that none of those conditions exist today, and building policy on the hope that they will is precisely the logic that delayed climate regulation for three decades.

The Question That Remains

Pre-training built the foundation of every major language model by consuming creative work and environmental resources at a pace that outran consent, regulation, and honest accounting. The technology is extraordinary. The question is whether we have the will to demand that extraordinary capability rest on something other than extraction — or whether speed will always be the permission we grant ourselves for taking what was never offered.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

Copyright Alliance: AI Copyright Lawsuit Developments in 2025: A Year in Review - Overview of 70+ AI copyright lawsuits and key settlements
Norton Rose Fulbright: AI in Litigation Series: An Update on AI Copyright Cases in 2026 - Legal analysis of Kadrey v. Meta, NYT v. OpenAI, and Disney-OpenAI licensing
MIT News: Explained: Generative AI’s Environmental Impact - Training energy consumption and global data center projections
Scientific Reports: Reconciling the Contrasting Narratives on the Environmental Impact of Large Language Models - Projected GPT-4 training emissions research
IAPP: The EU AI Copyright Playbook: The TDM Exception and AI Act’s Transparency Requirements - EU Copyright Directive opt-out and GPAI transparency rules
Berkeley Tech Law Journal: Opt-Out Approaches to AI Training: A False Compromise - Critical analysis of opt-out mechanisms for training data

Aha Moments

MONA

The ethical framing touches a real empirical nerve. Pre-training is fundamentally a statistical process — the model learns token-sequence probabilities, not “knowledge” in any human sense. But the scale at which this statistical learning now operates has outstripped the frameworks designed to govern it. The energy consumption per training run has grown by orders of magnitude in recent years, and datasets have expanded proportionally. What interests me as a scientist is the emerging research on data efficiency — whether comparable model quality is achievable with significantly smaller, better-curated datasets. If the answer is yes, the ethical calculus shifts. The question Alan raises about consent becomes separable from the question of scale, and that separation matters. We could have consent-respecting pipelines that still produce capable models. Whether the incentive structure rewards that efficiency, or simply rewards brute-force scale, is a question the data will eventually answer.

MAX

Mona raises data efficiency, and I want to push it into architecture. The consent problem is partly a systems design failure. Training pipelines were built without provenance tracking because nobody specified it as a requirement. That is not an inevitability — it is a missing specification. We know how to build audit trails, tag data with licensing metadata, and construct pipelines that respect access controls. The tooling exists. What does not exist is the institutional requirement to use it. If the industry treated data provenance the way it treats model versioning — as a baseline engineering standard, not an optional courtesy — much of the ethical crisis Alan describes would reduce to an implementation problem. The gap is not technical capacity. It is institutional will. And institutional will, in my experience, follows whoever writes the specification first.

DAN

Both of you are describing a market that is about to reorganize itself. The licensing deals Alan mentions are not philanthropy — they are the early contracts of a massive new revenue layer. Content owners with negotiating leverage are already capturing value from AI companies, and the terms are getting more favorable with every settlement. The music industry went from lawsuits to licensing frameworks in under a year. Disney went from protectionism to partnership. These are signals. If that licensing infrastructure matures, it could resolve much of this ethical tension through economics rather than regulation — and create an entirely new asset class in the process. But licensing favors scale. The studios and labels will get their deals. The question that keeps me up is the one Alan raised: who negotiates for everyone else — and what happens if nobody does?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors