Copyright, Carbon, and Consent: The Ethical Price of Training on Trillions of Tokens

Table of Contents
The Hard Truth
What if the most expensive thing about building a language model is not the compute, not the engineers, not the data centers — but the things we decided not to ask permission for?
Every large language model begins the same way: by consuming the written record of human thought. Books, articles, code, conversations, research papers, personal essays. The process is called Pre Training, and it is simultaneously the most technically impressive and ethically unexamined phase of modern AI development. The question is not whether this works. It clearly does. The question is what it costs — and who pays.
The Harvest Nobody Agreed To
The first cost is consent — or more precisely, its absence. By the end of 2025, more than 70 copyright lawsuits had been filed against AI companies, doubling from roughly 30 the year before (Copyright Alliance). The Bartz v. Anthropic case settled for $1.5 billion, covering nearly half a million titles at roughly $3,000 per work. The court drew a line that satisfied almost no one: training on copyrighted material could qualify as fair use, but storing pirated copies did not (Copyright Alliance). Meanwhile, the NYT v. OpenAI case remains open — a court ordered over 20 million ChatGPT logs in January 2026, with summary judgment set for April 2026 (Norton Rose Fulbright).
These are not edge cases. They are the legal system catching up to a decision that was already made years ago: the world’s creative output was treated as a free input, and the burden of objection fell on the people whose work was taken.
The EU’s Copyright Directive offers a formal opt-out — rightsholders can reserve their rights, and general-purpose AI providers must exclude reserved content before training (IAPP). But the mechanism exposes its own limits. Perplexity AI was caught evading robots.txt restrictions, and the Berkeley Tech Law Journal calls the entire opt-out framework a “false compromise” — because once trained, data cannot be unlearned. Is it ethical to pre-train AI models on copyrighted data scraped without author consent? The lawsuits pose this in legal language. The harder version of the question has no courtroom: what does it mean for a society to treat extraction as the default relationship between creators and the systems that learn from them?
The Case for Extraction
Fair use advocates make a case worth hearing on its own terms. Copyright law has always permitted transformative use, and the Kadrey v. Meta ruling in June 2025 held that LLM training qualifies as fair use “regardless of whether underlying materials were obtained from legitimate sources” (Norton Rose Fulbright). This is a significant ruling — though as of March 2026, no appellate court has weighed in on AI training fair use, and the precedent remains US-only.
The market is also moving. Disney signed a three-year deal with OpenAI for over 200 characters in Sora, investing $1 billion — a signal that licensing, not litigation, may define the next phase (Norton Rose Fulbright). The music industry reached settlements along similar lines: UMG and Warner Music Group both settled with AI music companies, establishing licensing terms and artist opt-in provisions (Copyright Alliance).
The argument from this vantage point is familiar: innovation requires friction with existing frameworks. Scaling Laws reward larger datasets, the resulting technology benefits everyone, and the system will self-correct through markets and legal clarity. This has happened before.
That last sentence is the one worth pausing on.
The Assumption Inside the Argument
Every defense of the current model — fair use, market correction, innovation imperative — shares a single hidden premise: that extraction is the natural starting point, and consent is a refinement to be negotiated afterward.
This premise is so embedded in the infrastructure that it barely registers as a choice. Pre-training pipelines apply Data Deduplication to remove redundant text and Masked Language Modeling to learn language structure from billions of documents. The engineering frameworks that enable this — Megatron-LM and Deepspeed among them — are optimized for throughput, not provenance. Nobody built a consent layer into the training stack because consent was never treated as an engineering requirement. It was treated as a legal afterthought.
The environmental cost follows the same logic of extraction without accounting. Training GPT-3 consumed 1,287 MWh of electricity and produced roughly 552 tonnes of CO2 equivalent (MIT News). Projected emissions for GPT-4 — based on researcher estimates, not official OpenAI disclosure — reach approximately 21,660 tonnes of CO2 equivalent (Scientific Reports). By 2026, global data center electricity consumption is projected to hit roughly 1,050 TWh, enough to rank fifth among nations, between Japan and Russia (MIT News). Training at this scale also evaporates massive quantities of freshwater — a cost that varies by geography and cooling method but is consistently externalized.
What is the environmental and energy cost of pre-training large language models at scale? The honest answer is that we do not fully know, because the companies training these models are not required to disclose the numbers. But the pattern is visible: environmental resources, like creative resources, are treated as unpriced inputs to private value.
The Mill That Came Before the Regulation
There is a pattern here, and it predates software by centuries. When textile mills first industrialized, they drew water from rivers without compensation, consumed labor without negotiation, and treated both as natural resources available for extraction. Regulation came later — decades later, after the structural damage was done. The same sequence repeated with fossil fuels, with chemical agriculture, with personal data. The rhythm is always the same: extract, accumulate wealth, then negotiate terms only when the cost of not negotiating exceeds the cost of compliance.
Pre-training follows this rhythm with uncomfortable precision. Creative work and environmental capacity are consumed at industrial scale. The value concentrates with the companies that train the models. The costs distribute across creators who were never consulted and ecosystems that cannot advocate for themselves. Fine Tuning and RLHF refine the model after pre-training, but the foundational extraction — the phase that defines what the model knows and how it reasons — happens before any alignment conversation begins.
The difference this time is speed. Previous extraction cycles unfolded over decades. Pre-training compressed the cycle into years. The models were trained, the value was captured, and the legal system is still drafting its first response.
The Invisible Subsidy
Here is the thesis, stated plainly: pre-training depends on two unpriced subsidies — creative labor extracted without consent and environmental resources consumed without accountability. This is not an argument that AI should not exist. It is a demand that we stop pretending the inputs are free.
The emerging licensing deals suggest that markets can partially resolve the consent problem — but licensing works for entities with bargaining power. Studios, labels, major publishers. It does not work for the independent blogger, the academic whose dissertation trained a model she will never use, or the open-source contributor whose code became someone else’s product. The consent gap is not closing evenly. It is closing for those who can afford to be at the table.
The environmental cost is harder still to address through markets, because the atmosphere does not retain counsel and the water table does not file briefs.
Questions That Belong to Us
If consent had been a prerequisite rather than an afterthought, would the training pipeline look different? Would datasets be smaller, better curated, more accountable — or would the entire arc of large-scale pre-training have taken another shape?
These are not hypothetical exercises. They are design decisions that were framed as inevitability. The choice to scrape first and negotiate later was not a law of nature — it was a business decision. The choice to externalize environmental costs was not physics — it was accounting.
The most uncomfortable realization is not that mistakes were made. It is that the system worked exactly as designed — and the people who designed it are not the ones absorbing the cost.
Where This Argument Is Weakest
Intellectual honesty requires naming the vulnerability. If licensing frameworks scale beyond the major studios — if they reach independent creators, if they include meaningful compensation — then the consent problem may be resolved by markets without requiring a moral reckoning. If renewable energy and more efficient architectures make training carbon-neutral within a generation, the environmental argument loses force. And if smaller, more capable models reduce the resource appetite of pre-training altogether, the frame of this essay may age into irrelevance.
The counterargument is that none of those conditions exist today, and building policy on the hope that they will is precisely the logic that delayed climate regulation for three decades.
The Question That Remains
Pre-training built the foundation of every major language model by consuming creative work and environmental resources at a pace that outran consent, regulation, and honest accounting. The technology is extraordinary. The question is whether we have the will to demand that extraordinary capability rest on something other than extraction — or whether speed will always be the permission we grant ourselves for taking what was never offered.
Disclaimer
This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors