MONA explainer 9 min read May 12, 2026

Agent Cost Optimization Prerequisites: Pricing, Latency, Caching Limits

Diagram of three agent cost vectors: pricing asymmetry, prefill vs decode latency, prompt cache preconditions

Table of Contents

ELI5

Before tuning agent costs, learn three things: tokens are priced asymmetrically (output is far more expensive than input), latency comes from two different mechanics, and caches help only when their preconditions hold exactly.

An agent run is not one API call. It is a chain — sometimes dozens — where the next prompt is built from the last response, the tool outputs, and a slowly growing scratch of context. Most teams optimize this by trimming words, which feels productive until the bill arrives. The cost surface of an agent is shaped by mechanics most prompts never expose: a pricing asymmetry between input and output, two distinct latency regimes, and a caching layer that only works under conditions you usually violate without noticing.

The Token Bill Is Geometry, Not a Subscription

Treating model cost as a flat per-call expense is the most common mistake in Agent Cost Optimization. The number on the invoice is the product of three vectors: model tier, output share, and prefill volume. Move any one of them and the cost surface deforms.

What do you need to understand before optimizing agent costs?

Start with the price sheet, because the price sheet is not symmetric. Output tokens cost roughly 5–10x more than input tokens across the current frontier. Claude Opus 4.7 sits at $5 / $25 per million input/output tokens, Claude Sonnet 4.6 at $3 / $15, Claude Haiku 4.5 at $1 / $5 (Anthropic Docs). GPT-5.5 is $5 / $30 (OpenAI Pricing). A reasoning-heavy agent that emits long traces will burn output budget regardless of how clean its prompts look.

The second vector is decode latency. Prefill — the phase where the model reads your prompt and computes attention over every token — is parallelizable across the GPU. Decode, where the model emits the actual response one token at a time, is autoregressive and runs 20–400x slower per token than prefill (BentoML Inference Handbook). This is why time-to-first-token scales with input length while total response time scales with output length. An agent that retrieves a 30K-token document and produces a one-line answer pays prefill cost on the input but cheap decode on the output. An agent that produces a 4K-token plan pays the inverse — and decode latency is what users actually feel as slowness.

The third vector is prefill volume — and this is where agents accidentally light money on fire. Every tool result, every intermediate reasoning step, every cached scratchpad gets folded into the next prompt. By turn ten, the “input” might be 50K tokens of conversation history that the model re-prefills on every call. On GPT-5.5, crossing 272K input tokens triggers long-context billing of 2x input and 1.5x output for the entire session (OpenAI Pricing). A single accidental tool dump can re-price the rest of the agent loop.

The geometry, then, is not “make the prompt shorter.” It is: keep output disciplined, keep the prefix stable, and choose model tier per turn — small models for routing, mid models for reasoning, frontier models for the one step where reasoning matters most.

The Caching Layer That Looks Free Until It Isn’t

Caching is the lever everyone reaches for, and the promise is real — Anthropic cache reads cost 0.1x base input price (Anthropic Docs), and OpenAI reports up to 80% latency reduction on cached prefixes (OpenAI Docs). What the marketing pages do not surface is that caches fail silently. The request still completes. The bill just doesn’t go down.

What are the technical limitations of prompt caching and semantic caching for agents?

Prompt caching has hard preconditions. Anthropic requires a minimum of 4,096 tokens before Opus 4.5+ and Haiku 4.5 will cache anything; Sonnet 4.6 needs 2,048; older models needed only 1,024 (Anthropic Docs). Below the floor, your request runs normally and is silently not cached. OpenAI’s automatic caching has a 1,024-token minimum and matches in 128-token increments (OpenAI Docs). The cache also requires an exact-match prefix — a single timestamp, request ID, or tool-call UUID injected near the front of the prompt invalidates everything behind it. Agents that prepend volatile state to the system prompt break the cache on every turn while still paying the cache-write surcharge of 1.25x to 2x base input price.

Semantic caching tries to escape exact-match by storing embeddings of past prompts and returning past responses when a new prompt is similar enough. The trade-off is mathematical, not engineering. Below a similarity threshold of around 0.8, hit rates rise but false positives degrade output accuracy (arXiv 2411.05276 (Regmi et al.)); above 0.8, hits collapse to a few percent. Hit rate is also workload-dependent: code queries see 40–60% hit rates because syntax is repetitive, conversational queries see 5–15% because users phrase the same intent fifty different ways (arXiv 2509.17360 (Asteria)).

Worse, cache hits in agents are not benign. A recent study found cache-hit accuracy significantly below cache-miss accuracy — the false positives systematically degrade agent reliability (arXiv 2506.14852 (Agentic Plan Caching)). Long multi-turn histories amplify this: two unrelated sessions with similar conversational scaffolding can collide in embedding space, and the agent returns a plan built for a different user’s problem.

Not a glitch. An emergent property of similarity-based retrieval.

The mechanism is geometric. Embedding space compresses semantic meaning, but it also compresses noise. Two requests that look 0.87 similar to an embedding model can have different intents, different tool requirements, and different correct answers. The cache cannot tell the difference. It only knows which point in latent space is closer.

Compatibility notes for 2026:
Anthropic prompt caching minimums shifted upward. Opus 4.5+ and Haiku 4.5 require 4,096 tokens; Sonnet 4.6 requires 2,048. Older code targeting 1,024 will silently skip caching on the newer models.
Anthropic workspace isolation. As of Feb 5, 2026, caches are isolated per workspace on Claude API, AWS Bedrock, and Microsoft Foundry. Cross-workspace cache sharing no longer works — re-architect any pooled-key setup that relied on it.
GPT-5.5 long-context billing. Prompts above 272K input tokens are billed at 2x input and 1.5x output for the entire session. Trivially triggered by accumulated tool dumps in long agent loops.

Diagram showing the three cost vectors of an agent run: token pricing asymmetry between input and output, prefill versus decode latency, and prompt cache preconditions — The three vectors that decide your agent bill: pricing asymmetry, the prefill/decode latency split, and cache preconditions.

What the Geometry Predicts

The three mechanisms above are not independent. They compose, and the composition produces predictable failure modes that turn passive understanding into something operational.

If output tokens dominate your bill, the lever is model routing, not prompt trimming. Reasoning happens during decode; cutting input by half changes nothing about the cost of a 4K-token response.
If TTFT is your latency complaint, the lever is prefill volume and caching — not output speed.
If your cache hit rate is unexpectedly low, look at what changes between calls. A timestamp, a session ID, a per-user tool-call sequence — anything volatile at the prefix kills the cache.
If your semantic cache hit rate looks healthy but quality complaints rise, the false-positive trap is active. You need Agent Evaluation And Testing that compares cached responses against fresh generations, not just an aggregate quality metric.

You also need infrastructure the optimization quietly assumes. Without Agent Observability, you cannot see which calls hit the cache, which calls trigger long-context billing, or which tools amplify prefill. Without Agent Guardrails that cap tool-call depth, a runaway loop can multiply token cost before any human notices. Without Agent Error Handling And Recovery, retries on transient failures silently double cost — and uncached retries triple it. For decisions where wrong-but-fast is worse than slow-and-correct, leave Human In The Loop For Agents review in place; cache-driven speedups should not bypass the checkpoints that exist for accuracy reasons.

Rule of thumb: Cost shrinks when you change which model handles which turn, not when you tighten the wording of any single prompt.

When it breaks: Caching collapses when prefixes are unstable, when the cache minimum is not met, or when semantic similarity quietly substitutes a wrong answer for a right one. None of these failures appear in the response — only on the invoice or in the quality regression that nobody traces back to the cache.

The Data Says

The cost surface of an agent is shaped by three asymmetries: output tokens cost several times more than input, decode latency runs 20–400x slower per token than prefill, and caches return value only when their preconditions hold exactly. Prompts that ignore these asymmetries get expensive in ways the prompt itself cannot reveal.

Sources

Anthropic Docs: Pricing — Claude API Docs - Per-token pricing for Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5
Anthropic Docs: Prompt caching — Claude API Docs - Cache minimums, TTL pricing, exact-match prefix requirement, workspace isolation
OpenAI Pricing: OpenAI API Pricing - GPT-5.5 pricing and long-context billing rules above 272K tokens
OpenAI Docs: Prompt caching — OpenAI API - Automatic caching minimum, increment matching, latency savings
BentoML Inference Handbook: Key metrics for LLM inference - Prefill vs decode latency mechanics
arXiv 2411.05276 (Regmi et al.): GPT Semantic Cache - Similarity threshold trade-off in semantic caching
arXiv 2509.17360 (Asteria): Asteria: Semantic-Aware Cross-Region Caching - Workload-dependent hit rates for semantic caching
arXiv 2506.14852 (Agentic Plan Caching): Agentic Plan Caching - Cache-hit accuracy gap and multi-turn collision risk

Aha Moments

MAX

The asymmetry Mona describes is exactly why agent cost specs fail. Teams write a budget like “a flat token ceiling per call” and call it done. That number is meaningless without naming which tokens — input or output — and which model handled them. A clean spec separates the prefill budget from the decode budget per role: routing model gets a small input ceiling and almost no output, reasoning model gets a generous input via cache and a hard decode cap. Caching preconditions belong in the spec too — minimum prefix length, the explicit list of variables that must never appear before the cached block. Without that, the cache is a hope, not an architecture decision.

DAN

Max is right about the spec, and the business angle is sharper than the engineering one. Output tokens are the line item nobody models on the way in. A team prototypes with a verbose reasoning model because it works, then ships a workload where most of the spend is generated text the user never reads — internal scratchpads, retry traces, tool argument dumps. That bill scales with usage, not with revenue. The teams who win this cycle route most turns through cheap small models and reserve the expensive ones for the steps that actually need them. Treat model selection as a per-turn pricing decision, not a project-level default.

ALAN

The mechanics are clear, but the assumption underneath bothers me. When caching saves money on a routing call, fine — the answer space is narrow and the cost of a false match is low. When caching saves money on a plan for someone’s loan application, medical triage, or hiring decision, the cost of a false match is borne by the person whose query collided in embedding space with someone else’s. They will never know that the answer they received was generated for a slightly different question. Who audits a cache hit? Who logs when an agent returned an embedding-near answer instead of the one it would have generated fresh? If we cannot answer that, the savings are not free — they are externalized onto the people the agent is supposed to serve.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors