ALAN opinion 10 min read March 26, 2026

Always-On AI: The Environmental Price and Access Inequality of Large-Scale Inference

Alan standing before vast data center cooling towers, half lit by green energy and half by industrial exhaust

Table of Contents

The Hard Truth

Every AI prompt you send draws water from a river, electricity from a grid, and carbon into the atmosphere. The question is not whether this cost exists — it is whether anyone has decided it is yours to bear, or someone else’s to absorb.

There is a peculiar comfort in asking a machine a question and receiving an answer in under a second. The interaction feels weightless — pure information, arriving from nowhere. But Inference is not weightless. It is an industrial process, running continuously on hardware that consumes electricity, generates heat, and requires water to cool. The convenience is real. The invisibility of its cost is engineered.

The Meter Nobody Reads

We have grown accustomed to treating AI as a service that simply exists, like weather. You ask, it answers. But behind every prompt is a chain of physical resources — accelerators drawing power, cooling systems cycling water, grids burning fuel to keep servers running through the night. Inference accounts for 60–90% of total AI energy use (MIT News), and global data center electricity consumption reached approximately 415 TWh in 2024, roughly 1.5% of global electricity (IEA). By 2030, that figure is projected to reach 945 TWh — about 3% of global supply — with AI’s share climbing from 5–15% today to 35–50% (IEA).

These numbers are abstract until you translate them into something a person can feel. A single median text prompt to Google’s Gemini consumes 0.24 Wh of energy and 0.26 mL of water — about five drops (Google Research). Scale that to 700 million queries a day, and the freshwater consumed equals the annual drinking needs of 1.2 million people (Luccioni et al.). The cost is not theoretical — it is hydrological. By 2030, AI’s projected water footprint could reach 731–1,125 million cubic meters per year, equivalent to household water use for 6–10 million Americans (Cornell/Nature Sustainability). The projected carbon footprint reaches 24–44 million metric tons of CO2 annually — the equivalent of 5–10 million cars on the road (Cornell/Nature Sustainability).

Who approved this allocation? Not the communities downstream from the data centers. Not the grids straining under new demand. Not the residents of regions where water scarcity was already a crisis before the nearest hyperscaler broke ground.

The Case for Optimism in Watts per Token

The counterargument is sophisticated, and it deserves to be heard at full strength. Quantization techniques compress models to run on less hardware. Frameworks like vLLM, TensorRT-LLM, and SGLang implement Continuous Batching and Paged Attention to maximize throughput per watt. Speculative Decoding reduces Time To First Token latency while keeping hardware utilization high. Custom silicon from companies like Groq promises roughly an order-of-magnitude improvement in energy efficiency per inference task compared to conventional GPUs — though independent benchmarks remain pending (Groq). Google Research reports a 44x reduction in total emissions per prompt between May 2024 and 2025. Inference prices have fallen at a median rate of 50x per year, with an acceleration to roughly 200x per year after January 2024 (Epoch AI).

This narrative is not wrong. The efficiency gains are measurable and the price trajectory is genuine. If you follow the curve far enough, it looks like the problem solves itself — that the market, left to its own devices, will optimize away the environmental burden one chip redesign at a time.

The Subsidy We Do Not Name

But falling prices do not mean falling costs. They mean someone else is absorbing the difference. OpenAI reportedly generated $3.7B in revenue against approximately $5B in operating losses in 2025, spending roughly $1.35 for every dollar earned — a figure drawn from third-party analysis (AI Automation Global), not official filings, and best treated as approximate. OpenAI, Google, Anthropic, and Meta are all pricing inference below cost (AI Automation Global). This is not efficiency. It is subsidy, funded by venture capital and corporate treasuries, designed to capture market share before the bill comes due.

The environmental cost follows the same logic of deferral. The Cornell/Nature Sustainability roadmap demonstrates that smart geographic siting and efficiency improvements could achieve 73% carbon reduction and 86% water reduction — but only if companies choose to invest in mitigation rather than raw expansion. The current trajectory is expansion. Despite corporate renewable energy pledges, an estimated 40–60% of AI workloads run on renewables (MIT Technology Review) — which means the remainder does not. The gap between the pledge and the grid is where the externality lives.

We have seen this pattern before. Fossil fuel subsidies made energy cheap for a century by deferring the climate cost to a later generation. The structural similarity is uncomfortable: make the resource artificially cheap today, socialize the environmental debt across communities and ecosystems that never consented to the bargain, and call it progress. The question is whether inference subsidies are a new chapter of the same story — or whether this time the efficiency curve bends fast enough to outrun the debt.

The Access Fault Line

Environmental burden is one half of the equation. The other half is access. When the largest AI providers price inference below cost, they create an artificial floor that no smaller organization can match. A startup cannot sustain $1.35 in spending for every dollar of revenue. A university research lab cannot subsidize compute at hyperscaler scale. A hospital system in Nairobi cannot negotiate the same per-token rate as a Fortune 500 company in San Francisco.

The efficiency tools themselves — vLLM, SGLang, TensorRT-LLM — are open-source, which is genuinely democratizing at the software layer. But running them requires GPU clusters that cost millions. The software is free; the hardware it demands is not. And even self-hosting carries a hidden cost: a shared vulnerability pattern known as ShadowMQ was recently discovered across vLLM, TensorRT-LLM, and SGLang, enabling remote code execution through pickle deserialization — a reminder that operating inference infrastructure at scale demands not just capital, but security expertise that many organizations lack.

Security & compatibility notes:
vLLM (ShadowMQ — RCE): CVE-2025-30165 enables remote code execution via pickle deserialization over ZeroMQ. Fix: upgrade to v0.8.0+.
TensorRT-LLM (ShadowMQ — RCE): CVE-2025-23254, same root cause. Fix: upgrade to v0.18.2+.
SGLang (ShadowMQ — RCE): Same vulnerability pattern. Patch applied; verify latest release.

The result is a two-tier system. Well-funded organizations access inference at subsidized rates, absorb security risks with dedicated teams, and treat compute as a utility they can take for granted. Everyone else faces the full economic and operational weight of the infrastructure — or goes without. The technology is open. The capacity to use it is not.

The Invisible Utility

Thesis: Large-scale inference is becoming an invisible utility — and like every utility before it, the distribution of its benefits and burdens follows existing lines of wealth and power, not lines of need.

This is not a novel pattern. Electricity, clean water, telecommunications — each became essential infrastructure, and each reproduced the inequalities of the society that built it. Rural electrification took decades after urban grids were complete. Clean water access still tracks income globally. The internet was supposed to be the great equalizer, and broadband access remains stratified by geography and wealth a quarter-century later.

Inference is following the same trajectory, but faster and with a compounding cost that earlier utilities did not carry. Every query, every API call, every automated pipeline adds load to a system that is already outpacing the renewable energy infrastructure meant to sustain it. The convenience is distributed globally. The heat, the water, the carbon — those land somewhere specific. And the communities that host the infrastructure rarely sit at the table where its governance is decided.

Questions We Cannot Defer

If inference is a utility, then the questions that apply to utilities apply here. Who regulates capacity? Who audits environmental impact not through voluntary corporate reports, but through independent, verifiable measurement? Who ensures that the communities hosting data centers benefit from the infrastructure rather than merely bearing its externalities? Who decides that 1.2 million people’s worth of drinking water is an acceptable cost for 700 million daily queries — and on whose authority?

These are not questions that efficiency improvements can answer. A 44x reduction in emissions per prompt is significant — and meaningless if query volume grows by a larger factor in the same period. Efficiency is a necessary condition for sustainability. It is not a sufficient one.

Where This Argument Breaks

The vulnerability in this position is real, and worth naming. If renewable energy capacity catches up with data center demand — if the renewables gap closes to 90% or higher — then the carbon argument weakens considerably. If hardware efficiency continues its current trajectory, and if custom silicon approaches like Groq’s deliver on their energy promises with independent verification at scale, then inference could become genuinely low-impact per query. The Cornell/Nature Sustainability study suggests this is physically possible. The question is whether it is economically and politically likely — whether companies will choose mitigation when expansion is more profitable.

The access argument is harder to resolve through technology alone. Falling prices help, but structural inequality in compute access requires structural intervention — not just cheaper tokens, but governance frameworks that treat inference as shared infrastructure rather than a luxury good that happens to be temporarily discounted.

The Question That Remains

We built a technology that answers questions instantly, at global scale, around the clock. We did not build the governance to ask who pays for the water it drinks, the carbon it emits, or the access it quietly denies. The meter is running. The question is whether we will read it before the bill arrives — or after.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

IEA: Energy demand from AI — Energy and AI (2025) - Global data center electricity projections and AI energy share
Google Research: Measuring the environmental impact of delivering AI at Google Scale - Per-prompt energy, water, and carbon measurements
Luccioni et al.: How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference - Cross-model energy comparison and water impact at scale
Cornell/Nature Sustainability: Environmental impact and net-zero pathways for sustainable AI servers in the USA - Carbon and water mitigation pathways
Epoch AI: LLM inference prices have fallen rapidly but unequally across tasks - Inference price decline analysis
AI Automation Global: OpenAI Lost $5B on $3.7B Revenue: The AI Inference Cost Crisis - Below-cost pricing analysis (third-party estimate)

Aha Moments

MONA

The data on inference energy is more nuanced than the headline suggests. The median text prompt consumes a fraction of a watt-hour, but the distribution is deeply skewed — the most energy-intensive models consume dramatically more than the most efficient ones across a single query. Efficiency gains are real and measurable, but they apply unevenly across model architectures and modalities. The environmental conversation becomes misleading when it treats “AI” as a monolith. Image generation, video synthesis, and long-context reasoning operate at fundamentally different energy scales. Any serious accounting must disaggregate by workload type, not average across all of them.

MAX

Mona’s point about disaggregation is critical, and it extends to infrastructure. The open-source inference stack gives organizations real control over how they run models, including where and on what hardware. But the ShadowMQ vulnerability pattern across multiple frameworks shows that self-hosting inference is not just a compute problem — it is an operations and security problem. The organizations best positioned to benefit from efficiency tools are the ones that already have the engineering capacity to run them safely. That is not a technology gap. That is a capability gap, and no amount of open-source code closes it on its own.

DAN

Both of you are describing symptoms of a market that has not priced itself correctly yet. The below-cost pricing model is unsustainable, and when it corrects — and it will correct — access inequality gets worse, not better. The organizations locked into API dependency will face price shocks they cannot absorb. The ones running their own infrastructure will face the full operational burden Mona and Max described. Is the real question not about environment or access at all, but about what happens to every organization downstream when the subsidy ends?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors