Paged Attention
Also known as: PagedAttention, paged KV cache, vLLM paging
- Paged Attention
- A memory management technique for large language model inference that partitions key-value caches into fixed-size blocks, eliminating wasted GPU memory and allowing more concurrent requests to be served.
Paged Attention is a memory management algorithm that splits the key-value cache used during LLM inference into fixed-size blocks, dramatically reducing GPU memory waste and allowing servers to handle more requests at once.
What It Is
Every time a large language model generates a response, it builds a temporary data structure called a key-value (KV) cache — a running record of the tokens processed so far. The model needs this cache to avoid recomputing earlier work at each generation step. The problem: traditional memory allocation reserves one large, contiguous block of GPU memory per request, sized for the maximum possible sequence length. Most requests never fill that block. The result is vast stretches of reserved-but-unused GPU memory, which directly limits how many users a single server can handle — a central constraint in the engineering economics of LLM inference.
Paged Attention solves this by borrowing a well-tested idea from operating systems: virtual memory paging. Instead of one continuous memory slab per request, the KV cache is divided into small, fixed-size pages that can be stored anywhere in GPU memory. Think of it like a library that shelves books wherever there’s space and keeps an index card to track each one, rather than reserving an entire bookcase per visitor. A page table maps logical cache positions to their physical locations, so the model accesses them the same way it would a contiguous block — but the underlying storage is scattered and tightly packed.
According to the PagedAttention Paper, Kwon et al. introduced this technique at SOSP 2023, drawing directly on the virtual memory abstraction that operating systems have used for decades. According to the vLLM Blog, PagedAttention reduced KV cache memory waste from 60-80% to under 4%. That reclaimed memory translates directly into throughput: the same GPU can serve more concurrent users. For anyone studying why inference costs remain among the hardest engineering constraints in LLM deployment — the memory wall problem — KV cache management is the core battleground.
Paged Attention also borrowed another operating system trick: copy-on-write. When multiple requests share the same prompt prefix — common in batch processing and parallel sampling — the shared portion of the KV cache is stored only once, with each request referencing the same physical pages. According to the vLLM Blog, this can cut memory usage by up to 55% for parallel sampling workloads. In the context of memory walls and inference scaling, these savings compound: less waste per request plus shared storage across requests means the gap between theoretical hardware capacity and actual usable throughput shrinks substantially.
How It’s Used in Practice
The most common place you encounter Paged Attention is inside LLM serving frameworks. If your organization runs self-hosted models through vLLM, SGLang, or TensorRT-LLM, Paged Attention (called “paged KV cache” in NVIDIA’s terminology) operates behind the scenes. It enables continuous batching — the ability to add new requests to a running batch without waiting for all current requests to finish — which directly affects how quickly users receive their first tokens.
Even if you never touch infrastructure, Paged Attention shapes the experience of every AI product built on these serving stacks. The responsiveness you feel when using an AI coding assistant or chatbot depends in part on whether the backend uses paged memory to serve more users per GPU.
Pro Tip: When evaluating LLM serving frameworks, confirm they support paged KV caches. This is not optional for production workloads — without it, you hit memory walls at a fraction of the concurrent users your hardware could handle. It is the single biggest determinant of how far your GPU budget stretches.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Serving LLMs to multiple concurrent users in production | ✅ | |
| Running a single local model for personal experimentation | ❌ | |
| Long-context requests (large documents, full code repositories) | ✅ | |
| Batch offline processing with no latency requirements | ❌ | |
| Production API with variable-length inputs from many clients | ✅ | |
| Small models on CPU where GPU memory is not a constraint | ❌ |
Common Misconception
Myth: Paged Attention makes the model itself faster — it speeds up the actual computation of attention. Reality: Paged Attention does not change how attention is computed. It changes how memory is allocated for the KV cache. The speed gains come from fitting more requests into the same GPU memory, not from faster math. The computation is identical; only the memory layout improves.
One Sentence to Remember
Paged Attention treats GPU memory like a filing system instead of a warehouse — small, indexed pages rather than huge reserved blocks — and that single change is what lets one GPU serve dozens of users instead of a handful.
FAQ
Q: Does Paged Attention work with all LLM architectures? A: It works with transformer-based models that use key-value caches during generation. Most modern LLMs fit this description, including decoder-only architectures like GPT-style and Llama-style models.
Q: Do I need to configure Paged Attention manually? A: In most serving frameworks like vLLM and SGLang, paged KV caching is enabled by default. You typically don’t need to set it up separately — the framework handles allocation automatically.
Q: How does Paged Attention relate to context window limits? A: Longer context windows produce larger KV caches. Paged Attention makes those larger caches practical by eliminating memory waste, which is why it is central to serving models with extended context lengths.
Sources
- PagedAttention Paper: Efficient Memory Management for LLM Serving with PagedAttention - Original paper introducing the algorithm, published at SOSP 2023
- vLLM Blog: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Technical overview with performance benchmarks from the vLLM team
Expert Takes
The insight behind Paged Attention is that KV cache allocations share the same fragmentation problem operating systems solved decades ago with virtual memory. The fix is structurally identical: indirection through a page table, non-contiguous physical storage, and on-demand allocation. What changed was not the algorithm — it was recognizing that GPU memory management for inference had regressed to pre-paging-era inefficiency. The correction was conceptual, not computational.
If you are setting up any production LLM serving stack, paged KV caches are non-negotiable. Every major framework now includes this by default, and it directly determines how many concurrent users your hardware supports. The practical difference between a paged and unpaged system is not marginal — it is the gap between hitting memory walls at a fraction of your capacity and actually using the GPU memory you paid for.
Paged Attention turned GPU memory from the tightest bottleneck in LLM inference into something manageable. Before it, scaling meant buying more hardware. After it, the same hardware handles significantly more traffic. For any organization running self-hosted models, this single optimization shapes the economics of the entire deployment — it is the reason serving costs dropped enough to make many AI products financially viable.
Efficiency gains in inference infrastructure tend to be treated as pure wins, but they carry a quieter consequence: they lower the barrier to deploying larger models with less scrutiny. When memory stops being the limiting factor, what replaces it as the constraint on what gets deployed, and who decides? The engineering problem was solved. The governance question that follows it remains wide open.