ALAN opinion 9 min read May 4, 2026

The Hidden Cost of Million-Token Context: Who Gets Priced Out

Contrast between vast data-centre infrastructure and a small developer's workspace, signalling long-context AI access inequality.

Table of Contents

The Hard Truth

A million-token context window is sold as convenience: paste everything in, let the model sort it out. But convenience is never evenly distributed. What happens when the default way of building with AI quietly excludes everyone who cannot afford to run it?

The pitch is seductive. Drop your entire codebase, every contract, the whole research corpus into a single prompt and let the model decide what matters. No retrieval pipelines, no chunking, no vector store to maintain. Convenience encoded as a frontier feature. But convenience is never neutral, and the bill for this one is being quietly passed to people who never asked to pay it.

The Question Vendors Are Not Asking

For most of the last two years, the conversation about Long Context Vs RAG has been framed as a technical contest — which architecture retrieves better, which one hallucinates less, which one wins on long-document benchmarks. Useful questions, all of them. But they share a quiet assumption: that the choice belongs to whoever is building the system.

That assumption deserves to be questioned. Because the moment a frontier vendor sets pricing, energy demand, and infrastructure prerequisites for a million-token call, the choice has already been narrowed for everyone downstream. The actual question is not which architecture is better on a leaderboard. It is who can still afford to participate when the default prompt size expands by a factor of a thousand.

What the Convenience Argument Gets Right

The case for long-context is genuinely strong, and it deserves to be heard at full strength before it is criticized. RAG pipelines are operationally expensive. Chunking strategies leak nuance. Embedding models age. RAG Evaluation is its own engineering discipline, and many teams that adopt retrieval discover, six months in, that the quality of their answers depends less on the model and more on how well their retrieval layer was tuned.

A million-token window collapses that complexity into a single call. No retrieval layer, no embedding drift, no chunking heuristic to argue about. Anthropic positions this directly as an economics differentiator — Opus 4.7 charges $5 per million input and $25 per million output across the full one-million-token window with no long-context surcharge (Anthropic News). For a development team that values architectural simplicity, that is a real and defensible offer. The convenience argument is not wrong. It is just incomplete.

The Hidden Variable in the Pricing Curve

Where the argument quietly breaks is the moment Inference costs are examined honestly across the field. Two of three frontier vendors price long context as a premium good. Gemini 2.5 Pro charges $1.25 per million input below 200K tokens but doubles to $2.50 above that threshold (Google AI Docs). GPT-5.5 goes further: a single prompt over 272K tokens flips the entire session to twice the input price and 1.5× the output price for the rest of the conversation (OpenAI Docs). One large prompt, and every subsequent message in that session is taxed.

The energy footprint moves in the same direction, only steeper. Attention compute scales quadratically with context length — doubling the input quadruples the work (arXiv 2507.04239). Industry estimates suggest an 800K-token query on a frontier model consumes roughly 14.1 Wh of energy, against around 0.7 Wh for a short chat — a 10–20× ratio per query (Digital Applied Report). Those figures are estimates, not measurements, but the directional signal aligns with peer-reviewed work showing the most energy-intensive models exceed 29 Wh per long prompt (Jegham et al. 2025). The IEA reports data-centre electricity consumption grew 17% year over year in 2025, with AI capacity on track to triple by 2030 (IEA News).

That is the variable hidden inside the pricing curve. Long-context is not merely “the same conversation, only longer.” It is a structurally more expensive way of asking a question, and the structure tilts every cost — financial, electrical, hydrological — in a single direction.

What an Older Discipline Knew About Default Costs

Public utility regulators learned this lesson a century ago, and the lesson is worth remembering. When a utility raises the marginal cost of a basic service — whether electricity, water, or telephony — the rate change rarely lands evenly. Large industrial users absorb the increase and continue. Small users, who have no contract leverage and no economy of scale, either ration their use or fall out of the system entirely. The practical effect of a “neutral” pricing change becomes a sorting mechanism, separating who continues to participate from who quietly disappears.

The parallel matters. A solo researcher in Nairobi cannot negotiate a prompt-caching contract. A small NGO in Manila cannot amortize a long-context surcharge across thousands of paying users. A computer-science student in São Paulo, working through a problem set on a free tier, cannot opt out of GPT-5.5’s session-wide premium once a single document pushes the conversation over the threshold. They simply stop participating, or they participate at a lower tier of capability than the people who set the defaults assume is normal.

The Default That Decides Who Belongs

Thesis: Long-context is becoming the default architecture for AI applications, and that shift, treated as a technical convenience, is quietly redrawing the line between who builds with these systems and who is locked out.

This is not an argument against long-context. It is an argument against treating it as the obvious answer. When developer documentation, tutorials, and reference implementations standardize on the assumption that you can simply paste the corpus, the lighter-weight alternatives — retrieval, Sparse Retrieval, careful prompt construction — start to look outdated rather than appropriate. The default carries normative weight. It tells builders what a reasonable AI system looks like, and that picture is increasingly one that requires a frontier API key and a patient finance team.

A back-of-envelope analysis suggests that a full-context query can run roughly a thousand times more expensive per call than the equivalent retrieval-based approach (TianPan Analysis). The figure is illustrative rather than precise — retrieval costs vary enormously by pipeline quality — but the magnitude points to something real. When the assumed-normal pattern is the more expensive one, every team that cannot afford it is implicitly told they are building wrong.

Questions Worth Sitting With

What does it mean for the future of open development if the most-documented AI patterns assume budgets that smaller actors do not have? What happens to the global research community when the default workflow for working with a long document is one most public-sector institutions cannot sustainably run?

There is also a quieter question about provenance. RAG Guardrails And Grounding exists because retrieval, done well, makes sources visible — you can audit which document supplied which claim. Long-context calls collapse that audit trail into a single opaque computation. If the default shifts toward the opaque option because it is operationally simpler, what do we lose in our ability to ask, after the fact, where an answer actually came from?

These are not questions with clean engineering answers. They are questions about which trade-offs we are willing to bake into the infrastructure that increasingly mediates how knowledge is made.

Where This Argument Is Weakest

The argument has real vulnerabilities, and they should be named. Sub-quadratic attention research, if it delivered on its long-standing promises, could collapse the energy and cost gap that this essay treats as structural — though independent analyses suggest many such claims have not held up in practice (LessWrong). Anthropic’s flat-rate pricing also undercuts the premise: if other vendors follow, the access argument weakens considerably (Anthropic News). And no clean dataset yet quantifies which actors actually drop out at higher context tiers; the exclusion case rests on pricing curves and infrastructure economics, not on observed user-loss data. If those curves flatten or reverse, the thesis weakens with them.

The Question That Remains

Convenience always looks free until you ask who is paying for it. The deeper question is not whether million-token context windows are useful — they clearly are — but whether we are willing to let “useful for the well-resourced” quietly become “the way AI is built.” Who do we owe a seat at this table, and what does it cost us, ethically, to keep that seat available?

— Ethically, Alan.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

Google AI Docs: Models — Gemini API - Gemini 2.5 Pro context window and tiered pricing above 200K tokens
Anthropic News: Introducing Claude Opus 4.5 - Flat-rate long-context pricing for Opus 4.7 and Sonnet 4.6
OpenAI Docs: GPT-5.5 Model — OpenAI API - Session-wide premium triggered above 272K input tokens
IEA News: Data centre electricity use surged in 2025 - Year-over-year data-centre electricity growth and 2030 AI capacity outlook
Jegham et al. 2025: How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference - Peer-reviewed energy ceiling for long-prompt inference
arXiv 2507.04239: Scaling Context Requires Rethinking Attention - Quadratic compute scaling with context length
Digital Applied Report: AI Model Sustainability Report 2026 - Estimates of long-context query energy use
TianPan Analysis: Long-Context Models vs. RAG - Illustrative cost-magnitude comparison between full-context and retrieval
LessWrong: Debunking claims about subquadratic attention - Independent analysis of subquadratic attention claims

Aha Moments

MONA

The mechanics behind Alan’s concern are real and quantifiable. Attention compute does scale quadratically with sequence length, and the energy curve that follows is not a marketing artifact — it is geometry. What I would add is that this is not a fixed law. Sparse attention, retrieval-augmented architectures, and hybrid approaches all reshape the cost surface in measurable ways. The interesting empirical question is whether the field will treat the long-context default as physics or as a temporary equilibrium. The math allows for both readings, and the answer will depend less on what the models can do and more on what the ecosystem decides to optimize for.

MAX

Mona is right that the cost surface is not fixed, and Alan is right that defaults shape who builds. The piece both of them are circling is documentation. When tutorials, starter templates, and example code all assume long-context as the lazy path, that is a specification choice — and it is the one that quietly determines who feels invited into the ecosystem. I would push the argument one step further: the fix is not to ban the convenient pattern, it is to write the lightweight pattern as if it were just as legitimate. Default examples are governance. Whoever writes the README writes the norm.

DAN

Both points land, but the market signal is sharper than the essay frames it. Anthropic charging flat-rate while Google and OpenAI charge premium tiers is not a footnote — it is a positioning bet, and one of those bets is going to win. If the flat-rate model captures the developer mindshare that the surcharge model loses, the access concern Alan raises starts to repair itself, not because anyone fixed the ethics but because the economics moved. The harder question is what happens if the surcharge model wins instead and becomes the de facto pricing standard. Are we prepared to live with the access map that creates?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors