Prompt Caching

Also known as: prefix caching, context caching, KV cache reuse

Prompt Caching: Prompt caching is an API feature that stores processed attention states of repeated prompt prefixes — system prompts, large documents, or conversation history — so subsequent calls reuse them instead of reprocessing, reducing input token costs by up to 90%.

Prompt caching stores processed attention states of system prompts and documents so subsequent API calls reuse them instead of reprocessing — cutting input token costs by up to 90%.

What It Is

Every time you call an LLM API, the model processes two phases: a prefill phase — reading and computing attention across all input tokens — and a decoding phase that generates the response. The prefill phase is where input-side cost and latency accumulate. For most production workflows, a substantial chunk of every call is identical: the same system prompt, the same loaded document, the same growing conversation history.

Prompt caching exploits this repetition. Instead of rerunning the prefill computation for the static portion of your prompt, the API stores the resulting attention states — the key-value (KV) pairs the model computed during prefill — and reuses them on subsequent calls. Think of it like caching compiled object files: the first build pays full compilation cost, but subsequent builds skip recompilation for unchanged files entirely.

According to arXiv KV Cache, KV pairs encode what every token in the prefix “knows” about every other token — the full attention context for the cached portion. Reusing them is mathematically equivalent to recomputing them. The model doesn’t approximate; it receives the exact internal representation it would have computed fresh.

The cost structure rewards repetition. The first call writes the KV states at a write fee. Every subsequent call within the cache window reads from that stored state at a fraction of the base token price. This makes prompt caching the most direct cost lever in optimizing LLM API spend — the discount scales with call volume and context length. Long, frequently reused contexts see the largest savings.

Anthropic’s implementation: According to Anthropic Docs, Anthropic offers 5-minute and 1-hour TTL options, with per-model minimum cacheable token thresholds. The API uses a cache_control field at the request level or at explicit content-block breakpoints. Since February 2026, caches are isolated per workspace.

OpenAI’s implementation: According to OpenAI Docs, caching is automatic once the prompt exceeds 1,024 tokens and produces hits in 128-token increments. The standard TTL is 5–10 minutes, with longer windows available for some newer models.

Google’s implementation: According to Google AI Docs, Gemini context caching charges storage time and retrieval separately, with per-hour storage costs in addition to the per-token retrieval price.

How It’s Used in Practice

The highest-impact scenario: a shared system prompt that every user session reloads. If your application sends the same large system prompt on every API call, caching it means every call after the first reads from the stored state instead of processing from scratch. In an agentic workflow where the same instructions fire dozens of times per task, the savings compound across the entire run.

A second common pattern: loading a reference document once per session. If your workflow attaches a product manual, codebase context, or policy document to each API call, that document is an ideal cache candidate — it changes rarely and is typically large.

Pro Tip: Cache hits depend on exact prefix matching. Structure your prompts so everything stable — system instructions, documents, few-shot examples — comes before everything variable (the user message, dynamic data). Inserting even a single token into the stable portion forces a full cache re-write and resets your savings for that window.

When to Use / When Not

Scenario	Use	Avoid
Long system prompt reused across many user sessions	✅
One-off single API call with a unique prompt		❌
Large reference document attached to every request	✅
Prompt prefix that changes significantly per call		❌
Multi-turn conversation with a stable system prompt	✅
Short prompts under the provider’s minimum token threshold		❌

Common Misconception

Myth: Prompt caching saves the raw text of your prompt and replays it on subsequent calls.

Reality: The cache stores KV attention states — the mathematical representations the model computed during the prefill phase. The text is never re-sent or replayed. The model skips the computation phase entirely and receives the pre-computed attention context directly, producing results identical to a full recompute.

One Sentence to Remember

If your workflow sends the same context repeatedly, prompt caching is the most direct cost lever available — the first call writes the cache, and every call within the time window reads back at a fraction of the price.

FAQ

Q: Does prompt caching change the quality or accuracy of model responses? A: No. Cached attention states are mathematically identical to freshly computed ones. The model produces the same output as if it had processed the full prompt from scratch on every call.

Q: How long does a cached prompt stay valid? A: According to Anthropic Docs, Anthropic offers a 5-minute or 1-hour TTL depending on the write option chosen. According to OpenAI Docs, OpenAI’s standard TTL is 5–10 minutes for most models, with extended windows for some newer ones.

Q: Does caching work across different users or API keys? A: According to Anthropic Docs, since February 2026 Anthropic isolates caches per workspace — not per organization. Two users in different workspaces don’t share a cache even with identical prompts.

Sources

Anthropic Docs: Claude API — Prompt Caching - implementation guide with TTL options, minimum token thresholds, and workspace isolation details
arXiv KV Cache: KV Cache Optimization Strategies for Scalable and Efficient LLM Inference - mechanism behind KV attention state storage and reuse

Expert Takes

MONA

The cache stores KV attention pairs — the intermediate mathematical representations from the transformer’s prefill phase. These pairs encode what each token knows about every other token in the prefix. Reusing them means the model skips the prefill computation entirely for the cached portion — the latency and compute savings are real, not approximations. The mechanism is identical to what runs inside a single model call; prompt caching externalizes it between calls.

MAX

If you’re building any workflow where the same context appears across multiple agent calls — shared instructions, reference documents, tool schemas — caching that prefix is the single most impactful cost-reduction move available. Structure your prompts so stable content sits at the top and variable content follows. According to Anthropic Docs, each model has a minimum cacheable token threshold; anything shorter won’t qualify and the write fee will be charged with no benefit.

DAN

The race to zero inference cost just hit a concrete inflection point. Deep discounts on repeated context are not a feature — they’re a pricing signal. Providers who don’t offer them lose multi-agent deployments where the same system prompt fires thousands of times per day. Caching is now table stakes for any serious API budget.

ALAN

Prompt caching is efficient, but it creates an invisible layer between your instruction and the model’s processing. When you cache a system prompt containing policy rules or safety guardrails, you’re trusting that a stored computation from minutes ago still accurately represents your current intent. If your policy changes between calls, stale cache can mean stale compliance — a timing gap most teams never track.

Back to Glossary