KV Cache

Also known as: Key-Value Cache, KV-Cache, Key Value Cache

KV Cache: A memory optimization technique in transformer models that stores previously computed attention keys and values during text generation, eliminating redundant computation and reducing the time to produce each new token.

KV cache is a memory optimization that stores previously computed key and value tensors during text generation, allowing transformer models to avoid redundant calculations and produce tokens faster.

What It Is

When a language model generates a response — whether answering your question in a chatbot or autocompleting code in an editor — it produces text one token at a time. Each new token requires the model to reference everything that came before it. Without optimization, the model would recalculate the same intermediate values for every previous token, each time it generates a new one. That’s like rereading an entire book from page one every time you turn to the next page.

KV cache solves this by storing two specific sets of values from the model’s attention mechanism: the “keys” and “values” for each token already processed. Think of it as a running set of notes. Instead of re-reading every previous page, the model checks its notes and only does fresh work on the newest token. This is what makes real-time conversations with AI assistants feel responsive rather than painfully slow.

KV cache matters especially for decoder-only architectures — the design behind GPT, Claude, and Llama. These models generate text purely left-to-right, where every token depends on all previous tokens, making the cache critical for practical speed. Encoder-decoder models can process the input side in parallel, but their decoder still needs KV cache for generation. Decoder-only models, by unifying everything into one sequential pass, lean on KV cache even more heavily. This is partly why so much of the research into decoder-only scaling focuses on making the cache smaller and cheaper to maintain.

The trade-off is memory. KV cache grows with both sequence length and model size. For long conversations or large documents, the cache can consume gigabytes of GPU memory. This has driven innovations like Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA), both designed to shrink KV cache without hurting output quality. According to Ainslie et al., GQA uses fewer key-value heads than query heads, needing only 5% of the compute to convert from standard multi-head attention. According to the DeepSeek-V2 paper, MLA achieved a 93.3% reduction in KV cache size compared to its predecessor.

How It’s Used in Practice

When you use an AI assistant and notice that the first few words of a response arrive slowly but the rest streams quickly, you’re seeing KV cache in action. The initial delay happens because the model processes your entire prompt, building the cache from scratch. Once that cache exists, each new token only needs to be processed against the stored keys and values, which is much faster.

For teams running AI-powered applications, KV cache behavior directly affects cost and latency. Longer prompts mean larger caches and higher memory usage per request. This is why many API providers charge differently for input tokens (which build the cache) versus output tokens (which read from it). Understanding this distinction helps when deciding between one long conversation thread and several shorter exchanges.

Pro Tip: If your AI application feels slow during generation, figure out whether the bottleneck is prompt processing (building the cache) or token generation (reading from it). Long system prompts repeated on every request are a common source of unnecessary cache rebuilding. Some providers offer prompt caching features that persist the KV cache across requests, saving both time and cost.

When to Use / When Not

Scenario	Use	Avoid
Long multi-turn conversations where response latency matters	✅
Offline batch processing where speed is secondary to throughput		❌
Production API serving with many concurrent users	✅
Single-token classification tasks with no sequential generation		❌
Models processing long documents within large context windows	✅

Common Misconception

Myth: KV cache stores the model’s “memory” of past conversations, like a database that persists between sessions. Reality: KV cache is temporary. It exists only during a single generation pass and holds mathematical representations (key and value tensors) from the attention mechanism, not semantic memories. When a request ends, the cache is discarded. What feels like “memory” in multi-turn chats comes from re-sending the conversation history in the prompt, which rebuilds the cache from scratch each time.

One Sentence to Remember

KV cache is the reason AI models generate text at conversational speed — it trades GPU memory for the ability to skip redundant computation on every token, and the race to shrink it is shaping how decoder-only architectures evolve.

FAQ

Q: Does KV cache make AI models more accurate or just faster? A: Just faster. KV cache produces mathematically identical outputs to running without it. The optimization eliminates redundant computation without changing the results.

Q: Why do longer prompts cost more with AI APIs? A: Longer prompts require building a larger KV cache, consuming more GPU memory and compute per request. Many providers price input tokens higher to reflect this overhead.

Q: What happens to the KV cache between separate API requests? A: It’s discarded by default. Each new request rebuilds the cache from scratch, though some providers now offer prompt caching to persist frequently used cache segments across calls.

Sources

Ainslie et al.: GQA: Training Generalized Multi-Query Transformer Models - Proposes Grouped-Query Attention to reduce KV cache overhead during inference
DeepSeek-V2 paper: DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model - Introduces Multi-Head Latent Attention for KV cache compression

Expert Takes

MONA

KV cache is a direct consequence of the autoregressive constraint. In decoder-only models, each token attends to all predecessors, making naive generation quadratic in sequence length. The cache converts this to linear by persisting key-value projections across steps. Recent compression methods like Grouped-Query Attention and Multi-Head Latent Attention reduce cache footprint by orders of magnitude, creating a design space where memory efficiency shapes architectural choices as much as parameter count does.

MAX

Most developers interact with KV cache without knowing it. That slow first-token latency in your API call? That’s cache construction. The faster streaming afterward? That’s cache reads. If you’re building applications with long system prompts, you’re paying to rebuild that cache on every request. Prompt caching features exist for exactly this reason — reusing stored key-value pairs across calls to cut both latency and cost. Understand the mechanism, and you’ll make better engineering decisions about prompt design.

DAN

KV cache efficiency is becoming a competitive differentiator in the AI serving market. Companies handling millions of concurrent requests live and die by how much memory each request consumes. Smaller caches mean more users per GPU, lower serving costs, and faster responses. The architectural innovations compressing cache size — GQA in Llama, MLA in DeepSeek — are not academic exercises. They are the engineering choices that determine which providers can offer the strongest price-to-performance ratio at scale.

ALAN

There is something worth examining about how KV cache shapes the economics of access. Longer conversations require more memory, and memory costs money. This creates a quiet incentive structure: keep interactions short, limit context windows, or charge more for extended reasoning. The technical constraint becomes a business boundary. When we ask who gets access to AI and at what depth, cache size is one of the invisible gatekeepers deciding whether a conversation continues or gets cut off.

Back to Glossary