Cheap Models, Hidden Costs: Routing Agents to the Lowest Bidder

Table of Contents
The Hard Truth
Every routing decision is a quiet moral choice about who absorbs the error budget. So why do we keep treating it as if it were only a line in the cloud bill?
Somewhere right now, an Agent Cost Optimization router is comparing two candidates for the next call. One is slower, sharper, and several times more expensive. The other is fast, cheap, and demonstrably worse at refusing manipulation. The router will pick the cheap one most of the time. It will be right about the cost. It will say nothing about the cost that does not appear in the invoice.
The Conversation We’re Not Having
Most of the discussion about agent routing happens in the language of optimization. Latency budgets, token economics, cascade thresholds, performance-per-dollar. The metrics are clean, the dashboards are crisp, and the savings are real. None of that is wrong. But it is incomplete in a way that matters morally, because the routing decision is not just a choice between two models — it is a choice about who carries the residual risk when the cheaper model is wrong.
That question is rarely on the slide deck. And the silence is itself a kind of answer.
The Engineer’s Case, Stated Honestly
It would be unfair to caricature the engineers building these systems. The case for routing is strong, and it deserves to be stated at its strongest before it is challenged.
Frontier models are expensive, and an enormous share of real traffic does not need a frontier brain. Cost reductions of 50 to 98 percent are achievable while matching the quality of a single top-tier model on tested benchmark sets (FrugalGPT paper). A learned router can reach roughly 95 percent of GPT-4-class quality at a fraction of the spend on standard preference benchmarks (RouteLLM paper). API prices themselves fell by around 80 percent from 2025 to 2026 (LLM Quality vs Cost vs Safety 2026). For an operator running millions of calls a day, refusing to route would be closer to negligence than to virtue. Capital not burned on tokens funds Agent Evaluation And Testing, safety review, and the patient work of building Agent Guardrails that actually hold.
So when an engineer says “we route to save cost, and we route well,” they are usually telling the truth.
What the Cost Curve Doesn’t Show
Honesty cuts the other way too. As of 2026, hallucination rates across a benchmark of 37 frontier and mid-tier models span a wide band — from about 15 percent at the careful end to roughly 52 percent at the careless end (LLM Quality vs Cost vs Safety 2026). The cheapest, fastest tier of the market is not located at the careful end. It is, almost by construction, closer to the other.
Then comes the math of multi-step agents, which the dashboards rarely render. If every step in a workflow is right 85 percent of the time — a number most operators would happily accept — a ten-step chain succeeds end-to-end roughly 20 percent of the time (NH Journal counterpoint). Cheap routing trims per-call quality by a few points; agent composition multiplies the consequence. The user does not see the per-step accuracy. They see the broken booking, the wrong dose, the missed appeal deadline.
This is what the cost curve is unable to display. The savings accrue on the operator’s ledger. The errors accrue somewhere else.
Routing as Quiet Underwriting
There is a useful historical parallel here, and it is not from computer science. It is from insurance.
When an underwriter sorts applicants into risk pools, the math looks neutral. But the choice of which signals to weigh, which thresholds to set, and which populations end up paying more is a moral act dressed in actuarial language. The decisions are technical; the consequences are political. Societies eventually noticed this and built institutions — regulators, ombudsmen, anti-discrimination frameworks — not to abolish underwriting, but to make its tradeoffs answerable.
Routing inside agentic systems is a younger cousin of the same act. Every tier assignment is a tiny underwriting decision: this case gets the careful model, that case gets the cheaper one with the higher jailbreak susceptibility. OWASP’s 2026 list for agentic applications names this directly under categories like Excessive Agency, Misinformation, Improper Output Handling, and Unbounded Consumption (OWASP Gen AI Security Project). These are not exotic edge cases. They are the predictable failure modes of routing without a conscience.
The difference between insurance and agent routing is that the agent system has no ombudsman, no statutory disclosure, and often no Agent Observability layer that surfaces tier-by-tier outcomes to anyone outside the engineering team.
Cost Versus Accountability
Thesis: the real tradeoff at the routing layer is not cost versus quality — it is cost versus accountability.
Cost is paid by the operator and is therefore visible to the operator. Quality is paid in part by the operator too — in refund tickets, churn, support load — and so it shows up, if dimly, on internal dashboards. Accountability is something else. Accountability is the capacity of an affected person to know what happened to them, to contest it, and to find someone whose name is attached to the decision. Cheap routing does not destroy accountability outright. It dilutes it. The decision to use the lower-tier model on a particular call is opaque even to the operator, sometimes even to the agent itself, and almost always to the user. When the call goes wrong, the chain of explanation is too long and too statistical to hold any single point of contact.
This is not a hypothetical concern. UnitedHealth, Humana, and Cigna are facing putative class actions alleging that algorithmic tools improperly denied coverage in ways that look, from the outside, like a cost-driven routing decision whose downstream costs were paid by patients (STAT News). Whatever the courts conclude, the pattern is instructive: a system optimized end-to-end for the operator becomes, in practice, a system whose worst errors are absorbed by the people with the least leverage.
Questions for the Person Holding the Routing Key
So what should the engineer with the routing dial do? I do not think the answer is to stop routing. It is to stop pretending that routing is only an engineering question.
What would change if the Agent Error Handling And Recovery path treated tier downgrade as an event worth logging in the user-facing record, not just the internal metric? What would change if Human In The Loop For Agents review were triggered not by confidence scores alone, but by tier-of-origin on high-stakes calls? What would the team build differently if the question “would we still route this case to the cheap tier if the affected person could see the routing log?” were part of design review, the way privacy review is now part of design review in mature organizations?
None of those are technical questions. All of them shape technical decisions.
Where This Argument Is Weakest
The honest place where this argument is most fragile is the absence of a clean comparative study. No published research isolates the safety delta strictly attributable to routing-to-cheaper-tier as opposed to using a cheap model directly. The reasoning here rests on measured quality and jailbreak gaps between tiers plus the compounding error math of multi-step agents — a defensible chain, but a chain. If careful empirical work showed that calibrated cascades inherit the safety profile of the upstream model, much of this critique would need to be retired. I would welcome that result. The point is that we should not be assuming it.
The Question That Remains
The routing layer is becoming one of the most consequential decision surfaces in the AI stack, and almost nobody outside the engineering team can see it. If we accept that this layer is a quiet form of governance, then the question is not whether to route — it is whose interests the router is allowed to forget.
Disclaimer
This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors