Always-On AI: The Environmental Price and Access Inequality of Large-Scale Inference

Table of Contents
The Hard Truth
Every AI prompt you send draws water from a river, electricity from a grid, and carbon into the atmosphere. The question is not whether this cost exists — it is whether anyone has decided it is yours to bear, or someone else’s to absorb.
There is a peculiar comfort in asking a machine a question and receiving an answer in under a second. The interaction feels weightless — pure information, arriving from nowhere. But Inference is not weightless. It is an industrial process, running continuously on hardware that consumes electricity, generates heat, and requires water to cool. The convenience is real. The invisibility of its cost is engineered.
The Meter Nobody Reads
We have grown accustomed to treating AI as a service that simply exists, like weather. You ask, it answers. But behind every prompt is a chain of physical resources — accelerators drawing power, cooling systems cycling water, grids burning fuel to keep servers running through the night. Inference accounts for 60–90% of total AI energy use (MIT News), and global data center electricity consumption reached approximately 415 TWh in 2024, roughly 1.5% of global electricity (IEA). By 2030, that figure is projected to reach 945 TWh — about 3% of global supply — with AI’s share climbing from 5–15% today to 35–50% (IEA).
These numbers are abstract until you translate them into something a person can feel. A single median text prompt to Google’s Gemini consumes 0.24 Wh of energy and 0.26 mL of water — about five drops (Google Research). Scale that to 700 million queries a day, and the freshwater consumed equals the annual drinking needs of 1.2 million people (Luccioni et al.). The cost is not theoretical — it is hydrological. By 2030, AI’s projected water footprint could reach 731–1,125 million cubic meters per year, equivalent to household water use for 6–10 million Americans (Cornell/Nature Sustainability). The projected carbon footprint reaches 24–44 million metric tons of CO2 annually — the equivalent of 5–10 million cars on the road (Cornell/Nature Sustainability).
Who approved this allocation? Not the communities downstream from the data centers. Not the grids straining under new demand. Not the residents of regions where water scarcity was already a crisis before the nearest hyperscaler broke ground.
The Case for Optimism in Watts per Token
The counterargument is sophisticated, and it deserves to be heard at full strength. Quantization techniques compress models to run on less hardware. Frameworks like vLLM, TensorRT-LLM, and SGLang implement Continuous Batching and Paged Attention to maximize throughput per watt. Speculative Decoding reduces Time To First Token latency while keeping hardware utilization high. Custom silicon from companies like Groq promises roughly an order-of-magnitude improvement in energy efficiency per inference task compared to conventional GPUs — though independent benchmarks remain pending (Groq). Google Research reports a 44x reduction in total emissions per prompt between May 2024 and 2025. Inference prices have fallen at a median rate of 50x per year, with an acceleration to roughly 200x per year after January 2024 (Epoch AI).
This narrative is not wrong. The efficiency gains are measurable and the price trajectory is genuine. If you follow the curve far enough, it looks like the problem solves itself — that the market, left to its own devices, will optimize away the environmental burden one chip redesign at a time.
The Subsidy We Do Not Name
But falling prices do not mean falling costs. They mean someone else is absorbing the difference. OpenAI reportedly generated $3.7B in revenue against approximately $5B in operating losses in 2025, spending roughly $1.35 for every dollar earned — a figure drawn from third-party analysis (AI Automation Global), not official filings, and best treated as approximate. OpenAI, Google, Anthropic, and Meta are all pricing inference below cost (AI Automation Global). This is not efficiency. It is subsidy, funded by venture capital and corporate treasuries, designed to capture market share before the bill comes due.
The environmental cost follows the same logic of deferral. The Cornell/Nature Sustainability roadmap demonstrates that smart geographic siting and efficiency improvements could achieve 73% carbon reduction and 86% water reduction — but only if companies choose to invest in mitigation rather than raw expansion. The current trajectory is expansion. Despite corporate renewable energy pledges, an estimated 40–60% of AI workloads run on renewables (MIT Technology Review) — which means the remainder does not. The gap between the pledge and the grid is where the externality lives.
We have seen this pattern before. Fossil fuel subsidies made energy cheap for a century by deferring the climate cost to a later generation. The structural similarity is uncomfortable: make the resource artificially cheap today, socialize the environmental debt across communities and ecosystems that never consented to the bargain, and call it progress. The question is whether inference subsidies are a new chapter of the same story — or whether this time the efficiency curve bends fast enough to outrun the debt.
The Access Fault Line
Environmental burden is one half of the equation. The other half is access. When the largest AI providers price inference below cost, they create an artificial floor that no smaller organization can match. A startup cannot sustain $1.35 in spending for every dollar of revenue. A university research lab cannot subsidize compute at hyperscaler scale. A hospital system in Nairobi cannot negotiate the same per-token rate as a Fortune 500 company in San Francisco.
The efficiency tools themselves — vLLM, SGLang, TensorRT-LLM — are open-source, which is genuinely democratizing at the software layer. But running them requires GPU clusters that cost millions. The software is free; the hardware it demands is not. And even self-hosting carries a hidden cost: a shared vulnerability pattern known as ShadowMQ was recently discovered across vLLM, TensorRT-LLM, and SGLang, enabling remote code execution through pickle deserialization — a reminder that operating inference infrastructure at scale demands not just capital, but security expertise that many organizations lack.
Security & compatibility notes:
- vLLM (ShadowMQ — RCE): CVE-2025-30165 enables remote code execution via pickle deserialization over ZeroMQ. Fix: upgrade to v0.8.0+.
- TensorRT-LLM (ShadowMQ — RCE): CVE-2025-23254, same root cause. Fix: upgrade to v0.18.2+.
- SGLang (ShadowMQ — RCE): Same vulnerability pattern. Patch applied; verify latest release.
The result is a two-tier system. Well-funded organizations access inference at subsidized rates, absorb security risks with dedicated teams, and treat compute as a utility they can take for granted. Everyone else faces the full economic and operational weight of the infrastructure — or goes without. The technology is open. The capacity to use it is not.
The Invisible Utility
Thesis: Large-scale inference is becoming an invisible utility — and like every utility before it, the distribution of its benefits and burdens follows existing lines of wealth and power, not lines of need.
This is not a novel pattern. Electricity, clean water, telecommunications — each became essential infrastructure, and each reproduced the inequalities of the society that built it. Rural electrification took decades after urban grids were complete. Clean water access still tracks income globally. The internet was supposed to be the great equalizer, and broadband access remains stratified by geography and wealth a quarter-century later.
Inference is following the same trajectory, but faster and with a compounding cost that earlier utilities did not carry. Every query, every API call, every automated pipeline adds load to a system that is already outpacing the renewable energy infrastructure meant to sustain it. The convenience is distributed globally. The heat, the water, the carbon — those land somewhere specific. And the communities that host the infrastructure rarely sit at the table where its governance is decided.
Questions We Cannot Defer
If inference is a utility, then the questions that apply to utilities apply here. Who regulates capacity? Who audits environmental impact not through voluntary corporate reports, but through independent, verifiable measurement? Who ensures that the communities hosting data centers benefit from the infrastructure rather than merely bearing its externalities? Who decides that 1.2 million people’s worth of drinking water is an acceptable cost for 700 million daily queries — and on whose authority?
These are not questions that efficiency improvements can answer. A 44x reduction in emissions per prompt is significant — and meaningless if query volume grows by a larger factor in the same period. Efficiency is a necessary condition for sustainability. It is not a sufficient one.
Where This Argument Breaks
The vulnerability in this position is real, and worth naming. If renewable energy capacity catches up with data center demand — if the renewables gap closes to 90% or higher — then the carbon argument weakens considerably. If hardware efficiency continues its current trajectory, and if custom silicon approaches like Groq’s deliver on their energy promises with independent verification at scale, then inference could become genuinely low-impact per query. The Cornell/Nature Sustainability study suggests this is physically possible. The question is whether it is economically and politically likely — whether companies will choose mitigation when expansion is more profitable.
The access argument is harder to resolve through technology alone. Falling prices help, but structural inequality in compute access requires structural intervention — not just cheaper tokens, but governance frameworks that treat inference as shared infrastructure rather than a luxury good that happens to be temporarily discounted.
The Question That Remains
We built a technology that answers questions instantly, at global scale, around the clock. We did not build the governance to ask who pays for the water it drinks, the carbon it emits, or the access it quietly denies. The meter is running. The question is whether we will read it before the bill arrives — or after.
Disclaimer
This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors