Semantic Caching
Also known as: similarity caching, vector-based response caching, LLM response cache
- Semantic Caching
- Semantic caching intercepts LLM API calls and returns a stored response when an incoming query is semantically similar to a previously answered one, using vector embeddings and cosine similarity to cut API calls without changing the model or prompt.
Semantic caching sits between your application and the LLM API, storing previous responses and returning them when a new query is semantically close to one already answered — without making an API call.
What It Is
Every LLM API call has a cost: you pay per token for input and output. In many real-world applications — customer support bots, internal FAQ tools, AI assistants answering developer questions — the same question arrives in dozens of phrasings. “What’s the refund policy?” and “How do I get my money back?” mean the same thing. Traditional exact-match caching misses the connection. Semantic caching doesn’t.
Semantic caching embeds each incoming query into a vector — a numerical representation that captures meaning — and compares it against a store of prior query embeddings using cosine similarity. If the similarity score clears a configurable threshold, the system returns the cached response from the matching prior query. The LLM never sees the new request.
Think of it as pattern recognition for intent: instead of checking whether two strings are identical, it checks whether two questions mean the same thing. The decision happens at the application layer, before any tokens leave your infrastructure.
The main open-source implementation is GPTCache, maintained by Zilliz. According to GPTCache GitHub, it supports multiple vector backends — Milvus, Faiss, Redis, and Qdrant — and integrates with LangChain and LlamaIndex. A Redis-based alternative using LangChain’s RedisSemanticCache and RediSearch for vector similarity matching is documented by Spheron Blog.
Semantic caching is distinct from the KV cache (prompt caching) that cloud providers offer on their model APIs. Prompt caching is server-side: the provider saves intermediate computation for a repeated token prefix, reducing recomputation cost. Semantic caching is client-side: it avoids the API call entirely. According to arXiv semantic caching, workloads with repeated or paraphrased queries can see a 61.6–68.8% reduction in API calls. In the context of optimizing LLM spend — alongside strategies like token budgeting, model routing, and batch processing — semantic caching targets the most direct cost lever: not paying for answers you already have.
How It’s Used in Practice
The most common deployment is a customer-facing chatbot or FAQ bot built on an LLM. Support teams handle hundreds of daily queries that cluster around a handful of topics: billing, cancellation, troubleshooting steps. Once the first representative answer is cached, all paraphrased variants of that question return the cached response in milliseconds, at near-zero marginal cost.
A developer integrating GPTCache into a LangChain pipeline connects the cache between the prompt template and the LLM call. Any ChatOpenAI or compatible call goes through the cache layer first. On a cache miss, the response is stored alongside the query embedding. On a hit, it returns immediately. The application code doesn’t change — only the cache configuration.
Pro Tip: Start with a high similarity threshold and lower it gradually while monitoring cache hit quality. A threshold that’s too low will return semantically adjacent but not equivalent answers — which is often worse than no caching at all.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| FAQ or support bot with predictable, recurring question patterns | ✅ | |
| Creative or generative tool where each response should be original | ❌ | |
| Internal knowledge assistant answering repetitive policy questions | ✅ | |
| Queries that depend on real-time data, user state, or session context | ❌ | |
| Batch processing of structurally similar prompts or documents | ✅ | |
| Multi-turn conversations where prior messages shape the answer | ❌ |
Common Misconception
Myth: Semantic caching and prompt caching (KV cache) are the same optimization with different names.
Reality: They operate at different layers and solve different problems. Prompt caching is server-side: the model provider caches intermediate KV states for a repeated context prefix, reducing recomputation cost. Semantic caching is client-side: it intercepts the query before it reaches the API and returns a stored response when meaning matches, so no tokens are sent to the provider at all. You can — and often should — use both together.
One Sentence to Remember
Semantic caching trades a vector lookup for an LLM API call — the economics work when your queries repeat, and break down the moment every query is genuinely unique.
FAQ
Q: How is semantic caching different from standard HTTP response caching? A: HTTP caching requires an exact URL or request match. Semantic caching compares meaning using vector embeddings, so paraphrased or reworded queries can return a cached response even when the text differs completely.
Q: What happens when a cached response goes stale? A: Most implementations support a time-to-live (TTL) setting per cached entry. When the TTL expires, the next matching query triggers a fresh API call and the response is re-cached. Without TTL configuration, stale answers can persist indefinitely.
Q: Is semantic caching safe for personalized queries? A: Not by default. If users need different answers based on account state or permissions, a shared cache can return incorrect responses. Scope the cache by user or session for any personalized use case to prevent data leakage between users.
Sources
- arXiv semantic caching: From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings (arXiv 2603.03301) — empirical analysis of hit rates and API call reduction across workload types
- GPTCache GitHub: GPTCache: Semantic Cache for LLMs (Zilliz) — open-source library documentation, supported backends, and integration guides
Expert Takes
Semantic caching applies nearest-neighbor search to a deceptively hard question: when are two natural language queries equivalent? Cosine similarity over embedding vectors answers this probabilistically, not definitively. The similarity threshold is a precision-recall dial — moving it shifts the tradeoff between returning incorrect cached answers and missing obvious paraphrases. The quality of the underlying embedding model determines how faithfully meaning is captured, which is the hidden variable most implementations never measure.
In an LLM middleware stack, semantic caching belongs at the outermost layer — before token counting, routing, or model selection. The cache intercept should be stateless and keyed to the query embedding, not the full prompt including system context. If the system prompt varies per user, either include it in the cache key or scope the cache to contexts with identical system prompts. Mixing contexts in a shared cache is the most common implementation mistake.
The cost case for semantic caching is direct: if your application receives the same questions repeatedly, you’re paying full API price for answers you already have. The ROI disappears the moment queries diversify — a cache with a low hit rate adds latency and infrastructure overhead with no return. Know your query distribution before committing to a caching layer.
The “close enough” assumption in semantic caching carries a transparency cost: it makes system behavior less predictable for users. Two users who ask nearly identical questions may get different responses depending on cache state — or get the same response when they needed different ones. The opaque nature of vector similarity means neither user nor developer can easily audit why a particular cached answer was served. That’s a transparency gap worth designing around explicitly.