Prompt Compression

Also known as: context compression, token compression, prompt pruning

Prompt Compression
Prompt compression reduces the token count of a prompt before it reaches a language model, removing redundant or low-value tokens while preserving the information needed for accurate responses. It cuts inference cost and latency without changing the model.

Prompt compression reduces the token count of an LLM prompt before inference by stripping redundant text, lowering cost and latency without retraining or modifying the underlying model.

What It Is

Token cost is the most direct lever on inference cost. Every API call charges by the token, and real-world prompts often carry far more tokens than the model actually needs—retrieval results, conversation history, and verbose instructions pad prompts to thousands of tokens, even when only a fraction of the text shapes the answer. Prompt compression removes that excess before the prompt reaches the model: a preprocessing step that runs outside the model, cuts the input size, and leaves the model’s weights, fine-tuning, and output behavior unchanged. In automated prompt tuning workflows, it sits at a distinct layer from techniques like few-shot selection or chain-of-thought elicitation—it doesn’t change what the prompt instructs the model to do, only how many tokens deliver those instructions.

Think of it as editing a brief before sending it to a consultant. You cut the filler, keep the signal, and the consultant still produces the right answer—but you pay for less of their time.

Two main approaches exist. Soft compression adds lightweight trainable components between the prompt and the model, condensing text into fewer but denser internal representations. This requires fine-tuning and only works with the specific model it was trained against—making it model-specific and harder to deploy across providers. Hard compression (also called token-level or lexical compression) takes a different path: it drops individual tokens from the original text, producing a shorter string that any LLM can read. The output often looks clipped or garbled to humans, but the model processes it correctly because the most information-dense tokens survive.

The leading hard-compression method is the LLMLingua family from Microsoft Research. According to Microsoft Research, LLMLingua was introduced at EMNLP 2023 and uses a small language model to score each token’s informativeness—tokens scoring lowest get dropped. LLMLingua-2, the current state-of-the-art for lossy compression as of 2024 according to Microsoft Research, trains a token-classification model on distilled data, making the approach task-agnostic and faster than its predecessor. According to Microsoft Research, compression ratios of up to 20× are achievable at the token level, though at that ratio the surviving text becomes unreadable to a human—the model processes it correctly because the highest-informativeness tokens remain. For long-document scenarios, LongLLMLingua uses a question-aware, coarse-to-fine strategy: it first ranks source documents by relevance to the query, then compresses each one, so the most relevant passages receive the highest fidelity after compression.

How It’s Used in Practice

The most common encounter with prompt compression is inside RAG (retrieval-augmented generation) pipelines. RAG retrieves several documents or chunks and feeds them all into the prompt as context. Even with a large context window, more tokens means higher cost and slower responses per call. Prompt compression cuts the retrieved context before it enters the model—keeping the parts most relevant to the query, discarding the rest.

According to Microsoft Research, LLMLingua is available as an integration in LlamaIndex, one of the most widely used RAG frameworks, making it accessible without building a custom compression layer from scratch. In practice, this means wiring it as a preprocessing step between your retriever and your LLM call: retrieved chunks in, compressed text out.

Pro Tip: Start at conservative compression ratios (3–5×) on a held-out test set before pushing to 10× or higher. Accuracy degradation is non-linear at high ratios—benchmarks in research papers reflect average cases across tasks, not your specific domain or query distribution.

When to Use / When Not

ScenarioUseAvoid
Long RAG context with many retrieved chunks
Short, precision-critical prompts (legal, medical, compliance)
High-volume API calls where token cost is a concern
Prompts containing structured data like code blocks or tables
Long-context summarization over multiple documents
Safety-critical instructions that must survive verbatim

Common Misconception

Myth: Prompt compression is the same as summarization.

Reality: Summarization rewrites content in fewer words using the LLM itself. Prompt compression removes tokens from the original text without rewriting—the surviving tokens are verbatim from the source. The result often reads as clipped or fragmented to a human, but the model processes it accurately because the most informative tokens remain.

One Sentence to Remember

Prompt compression is a pre-inference preprocessing step that strips token waste from the prompt before the model sees it—a cost and latency optimization that lives entirely outside the model, with accuracy tradeoffs that scale with compression ratio.

FAQ

Q: Does prompt compression degrade output quality?

A: At low ratios (2–5×), quality loss is typically minor. At high ratios (10–20×), accuracy on complex tasks can degrade noticeably. Testing on your specific task and data before deployment is essential.

Q: Does the LLM need to be retrained to handle compressed prompts?

A: For hard compression (token removal), no—the compressed text is plain text that any LLM can read. Soft compression does require a fine-tuned adapter layer, making it model-specific.

Q: How does prompt compression differ from prompt caching?

A: Prompt caching stores a previous prompt’s computation to avoid reprocessing on repeat calls. Compression shortens the prompt itself on every call. They address different costs and can be used together.

Sources

Expert Takes

Token-level compression exploits a measurable property of language: not all tokens contribute equally to the output distribution. A small model scores each token’s informativeness relative to a task signal, then discards the lowest-scoring ones. The surviving sequence preserves enough conditional probability structure for the large model to recover the intended output. The mechanism is lossy—information is genuinely dropped—and the accuracy impact is a function of which tokens were removed, not a rounding error.

In a RAG spec, prompt compression sits between the retriever and the LLM call—a preprocessing function that runs on the assembled context before you dispatch the request. Wire it as a pure transform: retrieved chunks in, compressed text out. Treat compression ratio as a configurable parameter, not a hardcoded constant. Benchmark output quality against your specific query distribution before shipping to production, especially on precision-sensitive domains.

Inference cost is real, and long-context prompts compound fast. Prompt compression is one of the few optimizations that sits entirely outside the model—no fine-tuning, no provider negotiation, no architecture change. You deploy it as a preprocessing step, and it cuts your token bill on every call. That’s the kind of lever that pays for itself quickly in production traffic.

Prompt compression optimizes for what the model needs, not what the user sent. That gap matters. When a compressed prompt drops a sentence the model didn’t need to answer correctly in tests—but that sentence was a safety constraint or an ethical boundary—the accuracy metric looks fine; the actual behavior may not. Compression benchmarks measure task accuracy. They don’t measure which parts of the prompt stopped the model from doing something it shouldn’t.