Context Compression

Also known as: context summarization, context pruning, context distillation

Context Compression
Context compression reduces the token count in an AI conversation by summarizing or removing older content, allowing long sessions to continue without exceeding the model’s fixed context window limit.

Context compression is a technique that trims or summarizes older parts of a conversation so the total token count stays within a model’s context window limit.

What It Is

Every AI model has a maximum context window — a cap on how many tokens (roughly, words and punctuation) it can process in a single request. In a short exchange, this limit is invisible. But in long debugging sessions, multi-step research, or agentic workflows — sequences where the model executes a chain of tool calls to complete a task — conversations grow. Eventually, the accumulated text — your messages, the model’s replies, system instructions, and tool outputs — pushes toward that ceiling.

Context compression is the answer to that problem. Instead of stopping the conversation or silently dropping the oldest messages, compression produces a condensed version of what came before. The model gets the essential information — key decisions, constraints, conclusions, open questions — without the token overhead of the full verbatim transcript.

Think of it like a meeting summary. After three hours of discussion, no one reads the full transcript. The action items document is what drives the next meeting. Context compression does the same: it distills a long exchange into something the model can act on, at a fraction of the token cost.

There are three main approaches:

Summarization generates a prose summary of earlier turns, capturing what was decided, what was tried, and what remains open. It preserves meaning but loses the exact wording of earlier exchanges.

Truncation simply cuts off the oldest messages. Fast and deterministic, but it discards content without considering its relevance. A critical constraint mentioned at turn 3 disappears at turn 50, even if it still applies.

Selective retention scores messages by relevance to the current task and keeps only the highest-scoring ones. More accurate than truncation, but it requires either a second model call (adding latency) or a simpler heuristic scoring function.

Most production systems combine these: summarize the oldest portion, retain recent turns verbatim, and always keep the system prompt (the standing instructions that define the model’s behavior and persona) intact. The specific balance depends on the use case, the model’s context window size, and how tolerant the task is to information loss.

In the context of context window management — the broader practice of staying within token limits across every LLM interaction — compression is one of the two main strategies. The other is chunking: splitting input into smaller pieces and processing them sequentially rather than all at once. Compression applies within a single conversation session; chunking applies to large documents or datasets processed in stages.

How It’s Used in Practice

The most common place you’ll encounter context compression is in AI coding assistants. When you start a session in a tool like Claude Code or Cursor, it begins accumulating context: the system prompt, your questions, code snippets, tool call results, file contents. After an extended session — debugging a complex feature across many files, for example — the tool often triggers compression automatically. You might notice a brief pause, then a message indicating the session is continuing with a summarized context. The thread of what you were working on stays intact, but the token count drops back to a manageable level.

In agentic pipelines — systems where an AI model loops through a series of tool calls to complete a long-horizon task — compression is essential. Each tool call appends its output to the context. Without compression, the context grows with every step and hits the ceiling before the task completes.

Pro Tip: If an AI assistant seems less aware of something you said early in a long session, compression likely dropped or summarized that exchange. Repeat critical constraints explicitly when you notice this — especially instructions about output format, persona, or scope that were set at the start of the conversation.

When to Use / When Not

ScenarioUseAvoid
Long-running chat session approaching the model’s token limit
Fresh conversation with minimal context accumulated so far
Agentic workflow accumulating tool call results across many steps
Task where exact wording of earlier instructions must be preserved verbatim
Reducing per-request API token costs on long or repeated sessions
Legal or compliance contexts requiring a full, unaltered conversation record

Common Misconception

Myth: Context compression permanently deletes your conversation history.

Reality: Compression only affects what gets sent to the model in each request. The full conversation history typically remains stored client-side or in a separate database. You can still access earlier messages — the model simply does not receive them verbatim once they have been compressed. What is lost is the model’s direct access to the original wording, not the data itself.

One Sentence to Remember

Context compression lets AI tools maintain coherence through long sessions by summarizing what came before — so the next request still fits inside the model’s token limit without starting the conversation from scratch.

FAQ

Q: Does context compression cause the AI to forget important details? A: It can, if the compression logic is unsophisticated. Well-designed compression retains key decisions, open constraints, and conclusions while trimming filler exchanges and reasoning threads that led nowhere.

Q: Is context compression automatic in tools like Claude or ChatGPT? A: Most AI assistants handle it without user input. Claude Code and similar tools trigger compression when a session approaches capacity, then continue without interruption — you may notice a brief pause as the summary generates.

Q: How is context compression different from truncation? A: Truncation cuts off the oldest messages based on position alone. Compression first summarizes them, preserving the substance and key decisions while reducing the token count — so the model retains meaning without retaining every word.

Expert Takes

Context compression is an information-theoretic problem. Every conversation carries signal (decisions, constraints, key findings) and noise (pleasantries, dead-end reasoning, repetition). Good compression maximizes the signal-to-token ratio. Summarization models learn to distinguish the two; truncation ignores the distinction entirely. The theoretical lower bound on useful compression is set by how much actual information the conversation holds — and for most sessions, that fits in far fewer tokens than the raw transcript.

In specification-driven pipelines, context compression is a reliability concern, not just a cost concern. A compressed context that drops a constraint from the system prompt causes behavior drift — the model starts violating rules it followed earlier. The reliable pattern: pin system-prompt content as ineligible for compression, summarize only the turn history. Trigger compression well before the context ceiling, not at the very end, so there is room for the summary itself.

Context compression is where token economics become real. Every uncapped session costs more as it grows — each request resends the entire accumulated context. Compression cuts that cost directly. For teams running AI features at scale, it is the difference between a prototype that works and a product that stays affordable. The teams ignoring compression now will hit usage bill shock the moment their user base grows.

Context compression raises an uncomfortable question: who decides what matters enough to keep? The model or the compression system makes that choice, and it is never neutral. Therapeutic conversations, legal discussions, statements of consent — these contain content where forgetting a detail is not a minor inconvenience. The efficiency gain is real, but “lossy” and “inconsequential” are not synonyms. Every conversation compressed is a conversation where the system chose your history for you.