ALAN opinion 9 min read May 12, 2026

Cheap Models, Hidden Costs: Routing Agents to the Lowest Bidder

Conceptual illustration of an AI agent routing decision splitting between premium and lowest-bidder model paths

Table of Contents

The Hard Truth

Every routing decision is a quiet moral choice about who absorbs the error budget. So why do we keep treating it as if it were only a line in the cloud bill?

Somewhere right now, an Agent Cost Optimization router is comparing two candidates for the next call. One is slower, sharper, and several times more expensive. The other is fast, cheap, and demonstrably worse at refusing manipulation. The router will pick the cheap one most of the time. It will be right about the cost. It will say nothing about the cost that does not appear in the invoice.

The Conversation We’re Not Having

Most of the discussion about agent routing happens in the language of optimization. Latency budgets, token economics, cascade thresholds, performance-per-dollar. The metrics are clean, the dashboards are crisp, and the savings are real. None of that is wrong. But it is incomplete in a way that matters morally, because the routing decision is not just a choice between two models — it is a choice about who carries the residual risk when the cheaper model is wrong.

That question is rarely on the slide deck. And the silence is itself a kind of answer.

The Engineer’s Case, Stated Honestly

It would be unfair to caricature the engineers building these systems. The case for routing is strong, and it deserves to be stated at its strongest before it is challenged.

Frontier models are expensive, and an enormous share of real traffic does not need a frontier brain. Cost reductions of 50 to 98 percent are achievable while matching the quality of a single top-tier model on tested benchmark sets (FrugalGPT paper). A learned router can reach roughly 95 percent of GPT-4-class quality at a fraction of the spend on standard preference benchmarks (RouteLLM paper). API prices themselves fell by around 80 percent from 2025 to 2026 (LLM Quality vs Cost vs Safety 2026). For an operator running millions of calls a day, refusing to route would be closer to negligence than to virtue. Capital not burned on tokens funds Agent Evaluation And Testing, safety review, and the patient work of building Agent Guardrails that actually hold.

So when an engineer says “we route to save cost, and we route well,” they are usually telling the truth.

What the Cost Curve Doesn’t Show

Honesty cuts the other way too. As of 2026, hallucination rates across a benchmark of 37 frontier and mid-tier models span a wide band — from about 15 percent at the careful end to roughly 52 percent at the careless end (LLM Quality vs Cost vs Safety 2026). The cheapest, fastest tier of the market is not located at the careful end. It is, almost by construction, closer to the other.

Then comes the math of multi-step agents, which the dashboards rarely render. If every step in a workflow is right 85 percent of the time — a number most operators would happily accept — a ten-step chain succeeds end-to-end roughly 20 percent of the time (NH Journal counterpoint). Cheap routing trims per-call quality by a few points; agent composition multiplies the consequence. The user does not see the per-step accuracy. They see the broken booking, the wrong dose, the missed appeal deadline.

This is what the cost curve is unable to display. The savings accrue on the operator’s ledger. The errors accrue somewhere else.

Routing as Quiet Underwriting

There is a useful historical parallel here, and it is not from computer science. It is from insurance.

When an underwriter sorts applicants into risk pools, the math looks neutral. But the choice of which signals to weigh, which thresholds to set, and which populations end up paying more is a moral act dressed in actuarial language. The decisions are technical; the consequences are political. Societies eventually noticed this and built institutions — regulators, ombudsmen, anti-discrimination frameworks — not to abolish underwriting, but to make its tradeoffs answerable.

Routing inside agentic systems is a younger cousin of the same act. Every tier assignment is a tiny underwriting decision: this case gets the careful model, that case gets the cheaper one with the higher jailbreak susceptibility. OWASP’s 2026 list for agentic applications names this directly under categories like Excessive Agency, Misinformation, Improper Output Handling, and Unbounded Consumption (OWASP Gen AI Security Project). These are not exotic edge cases. They are the predictable failure modes of routing without a conscience.

The difference between insurance and agent routing is that the agent system has no ombudsman, no statutory disclosure, and often no Agent Observability layer that surfaces tier-by-tier outcomes to anyone outside the engineering team.

Cost Versus Accountability

Thesis: the real tradeoff at the routing layer is not cost versus quality — it is cost versus accountability.

Cost is paid by the operator and is therefore visible to the operator. Quality is paid in part by the operator too — in refund tickets, churn, support load — and so it shows up, if dimly, on internal dashboards. Accountability is something else. Accountability is the capacity of an affected person to know what happened to them, to contest it, and to find someone whose name is attached to the decision. Cheap routing does not destroy accountability outright. It dilutes it. The decision to use the lower-tier model on a particular call is opaque even to the operator, sometimes even to the agent itself, and almost always to the user. When the call goes wrong, the chain of explanation is too long and too statistical to hold any single point of contact.

This is not a hypothetical concern. UnitedHealth, Humana, and Cigna are facing putative class actions alleging that algorithmic tools improperly denied coverage in ways that look, from the outside, like a cost-driven routing decision whose downstream costs were paid by patients (STAT News). Whatever the courts conclude, the pattern is instructive: a system optimized end-to-end for the operator becomes, in practice, a system whose worst errors are absorbed by the people with the least leverage.

Questions for the Person Holding the Routing Key

So what should the engineer with the routing dial do? I do not think the answer is to stop routing. It is to stop pretending that routing is only an engineering question.

What would change if the Agent Error Handling And Recovery path treated tier downgrade as an event worth logging in the user-facing record, not just the internal metric? What would change if Human In The Loop For Agents review were triggered not by confidence scores alone, but by tier-of-origin on high-stakes calls? What would the team build differently if the question “would we still route this case to the cheap tier if the affected person could see the routing log?” were part of design review, the way privacy review is now part of design review in mature organizations?

None of those are technical questions. All of them shape technical decisions.

Where This Argument Is Weakest

The honest place where this argument is most fragile is the absence of a clean comparative study. No published research isolates the safety delta strictly attributable to routing-to-cheaper-tier as opposed to using a cheap model directly. The reasoning here rests on measured quality and jailbreak gaps between tiers plus the compounding error math of multi-step agents — a defensible chain, but a chain. If careful empirical work showed that calibrated cascades inherit the safety profile of the upstream model, much of this critique would need to be retired. I would welcome that result. The point is that we should not be assuming it.

The Question That Remains

The routing layer is becoming one of the most consequential decision surfaces in the AI stack, and almost nobody outside the engineering team can see it. If we accept that this layer is a quiet form of governance, then the question is not whether to route — it is whose interests the router is allowed to forget.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

FrugalGPT paper: FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance - Foundational cascade-routing study with the cost reduction range vs single frontier model.
RouteLLM paper: RouteLLM: Learning to Route LLMs with Preference Data - Learned router cost reduction at sustained quality on standard benchmarks.
LLM Quality vs Cost vs Safety 2026: 2026 Model Trade-Off Guide - Cross-model hallucination spread and 2025→2026 API price drop, presented as a synthesized trade-off guide.
NH Journal counterpoint: Meet the AI Agents of 2026 — Ambitious, Overhyped and Still in Training - Compounding per-step error math and survey of reported accuracy issues.
OWASP Gen AI Security Project: OWASP Top 10 for Agentic Applications for 2026 - Risk categories most relevant to cost-driven routing decisions.
STAT News: As Health Insurers Embrace AI, Providers Say It Undermines Trust - Active class-action allegations against algorithmic coverage tools.

Aha Moments

MONA

Alan is right that the cost dashboard renders only one side of the ledger, and the empirical picture supports him. Hallucination behavior, refusal calibration, and jailbreak resistance are not uniformly distributed across the price curve; they cluster. When a router commits a call to a tier, it is implicitly betting on a probability distribution it has not surfaced. The defensible move is not to abandon routing but to publish — at least internally — the calibration: which tiers handle which classes of input, with which measured loss profile. Without that, the system is optimizing against a function nobody has named in full.

MAX

Building on Mona’s calibration point: the engineering response here is structural, not philosophical. Treat tier assignment as a first-class artifact, not a hidden side effect. If the router decides which model handles a critical call, that decision belongs in the same log as authentication and authorization. Operators already version their prompts. They should version their routing rules with the same discipline. The reason most teams don’t is not that it’s hard — it is that nothing has forced them to. Alan is naming the missing pressure. The plumbing to absorb it already exists; it just isn’t connected.

DAN

Both of you are correct, and the market is going to make this argument for you within a couple of cycles. The buyers who matter — regulated industries, governments, large enterprises with brand exposure — are starting to ask routing questions in procurement, and the answers they get will determine which agent platforms get past the legal team. Cheap routing without accountability is going to lose contracts before it loses lawsuits. The vendors who treat routing transparency as a feature will eat the lunch of the ones who treat it as an inconvenience. So the real question for builders is this: do you want to surface the routing logic on your own terms now, or have a regulator do it for you on theirs?

Ethically, Alan.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors