MONA explainer 9 min read May 3, 2026

Contextual Retrieval: How Prepended Context Reduces RAG Failures

Diagram of document chunks with prepended context strings flowing into a hybrid retrieval index

Table of Contents

ELI5

Contextual retrieval is a Retrieval Augmented Generation technique that prepends a short, LLM-generated explanation to each chunk before embedding and indexing — so retrieval sees what the chunk is about, not just what it says.

A user asks: “What was Acme’s revenue in Q2 2023?” The vector store dutifully retrieves a chunk containing the sentence “The company reported revenue of $314M, an 18% year-over-year increase.” Plausible. Confident. Wrong company, wrong quarter.

The chunk was correct in its source document. The chunk became wrong the moment it lost its source document.

This is the failure mode contextual retrieval was built to fix.

What Standard Chunking Forgets

Vector retrieval starts from a clean assumption: chunks are self-describing. Embed the words; similarity will find what matters. The assumption breaks the moment a chunk relies on its parent document for meaning — which, for any document longer than a paragraph, is most chunks.

What is contextual retrieval in RAG?

Contextual retrieval is a recipe published by Anthropic in September 2024 (Anthropic Blog) that enriches each chunk with a short, document-aware explanation before embedding and before BM25 indexing. The chunk is no longer just the raw text. It is the raw text plus 50 to 100 tokens of LLM-generated context: “This chunk is from Acme Corp’s Q2 2023 earnings report; the prior section discussed segment-level revenue growth.” Then embed that.

The change is small. The consequence is geometric. The chunk’s vector now lives in a region of Hybrid Search space anchored by its parent document, not adrift among every other chunk that happens to mention “$314M” or “18% growth.”

Not a clever embedding model. A pre-processing step that gives the embedding model something worth embedding.

How does contextual retrieval work to enrich chunks before indexing?

The pipeline runs offline, once per document. For each chunk, the recipe is:

Pass the whole document plus the target chunk to an LLM.
Ask: in 50–100 tokens, situate this chunk within the document — what is it about, and how does it relate to what surrounds it?
Prepend that generated context to the chunk text.
Send the enriched chunk to the embedding model.
Send the same enriched chunk to a hybrid-search BM25 index.

Anthropic calls the two arms of this “Contextual Embeddings” and “Contextual BM25” (Anthropic Blog). A precise reading: BM25 itself is unchanged. Plain BM25, run over context-enriched chunks. The index is the same. The input is richer.

The expensive part is step 1 — calling an LLM once per chunk, multiplied by every chunk in the corpus. Anthropic reports the cost falling to roughly $1.02 per million document tokens when prompt caching is enabled, since the same parent document is reused across all its chunks (Anthropic Blog). Caching cuts the bill by up to ~90% relative to uncached generation. Treat the figure as order-of-magnitude — it was measured against then-current Haiku pricing in late 2024, and today’s exact cost depends on the model and caching tier you actually run.

Three Components Stacked Together

The technique is rarely run alone. The interesting result is what happens when contextual retrieval is layered with two other steps the RAG community already understood: hybrid retrieval and Reranking.

What are the components of a contextual retrieval pipeline?

A full contextual retrieval pipeline has four stages, each addressing a distinct failure mode:

Stage	What it adds	What it fixes
1. Contextual enrichment	LLM-generated 50–100 token preface per chunk	Chunks that depend on their parent document
2. Contextual Embeddings	Dense semantic retrieval over enriched chunks	Misses on paraphrase and conceptual similarity
3. Contextual BM25	Lexical (sparse) retrieval over the same enriched chunks	Misses on rare names, IDs, exact strings
4. Reranking	Cross-encoder scoring of top candidates, then top-20 to the model	Order quality among already-relevant candidates

Anthropic’s reference implementation uses Voyage and Gemini embeddings paired with Cohere Rerank for the final ordering (Anthropic Blog). The original 2024 post named a specific Gemini embedding version — Google’s text-embedding family has shifted since, so treat embedding model choice as a benchmark-it-yourself decision rather than a hard recommendation.

The shape of the pipeline is the point: each stage corrects a different category of error, and the gains compound rather than substitute. Note also what this pipeline does not address — it improves the index, not the question. Reformulating the user’s input is the job of Query Transformation, and dynamic retrieval logic that decides what to fetch belongs to Agentic RAG systems. Different stages, different concerns; layer them together.

Four-stage contextual retrieval pipeline showing chunks enriched with LLM context, indexed in parallel into vector and BM25 stores, then reranked before the top 20 reach the model — Each stage corrects a different failure mode; the gains stack.

What the Layers Predict When You Stack Them

Here is where the geometry gets interesting. Anthropic’s evaluation reports retrieval-failure rates across the stack, measured against a 5.7% baseline (Anthropic Blog):

Contextual Embeddings alone: 35% reduction in retrieval failures, dropping the failure rate from 5.7% to 3.7%.
Contextual Embeddings plus Contextual BM25 (hybrid): 49% reduction, dropping to 2.9%.
All three plus reranking: 67% reduction, dropping to 1.9%.

A clarification the third-party blogosphere keeps mangling: a 67% reduction in retrieval failures is not a 67% accuracy lift. The headline is “1.9% of queries still fail to surface the right chunk in the top 20.” The improvement is enormous in relative terms and modest in absolute terms — both readings are true. These numbers also come from Anthropic’s own internal evaluation. They have been widely cited but, as of the time of writing, have not been independently reproduced at the same scale.

If you change one variable at a time, the predictions are reasonably clean:

If your corpus is mostly long, structurally cohesive documents — legal filings, financial reports, scientific papers — contextual enrichment should help more than on a corpus of standalone snippets. The fix is to chunks that need their parent; standalone chunks have nothing to gain.
If your queries lean on rare identifiers — product SKUs, error codes, person names — the BM25 leg matters disproportionately. Embeddings smooth over exact tokens; BM25 doesn’t.
If your top-k is small (k=5), reranking matters more than retrieval recall, because there is less room to be wrong before the right chunk falls out of the window.

Rule of thumb: before tuning embedding models, check whether your chunks know what document they came from.

When it breaks: the technique assumes a corpus that fits the offline-enrichment economics. Generating context for every chunk costs real money on multi-billion-token corpora, and the index must be re-enriched whenever the source documents change. For high-churn or massive corpora, the LLM-context step is the bottleneck — not the embedding model.

A Detail Most Implementations Miss

The token budget on the prepended context is not a hyperparameter to over-engineer. The recipe specifies 50 to 100 tokens for a reason: short enough that the chunk text still dominates the embedding, long enough to anchor the chunk to its document. Push the context to 300 tokens and the embedding starts representing the document summary more than the chunk content — relevance scores degrade, retrieval shifts toward whichever chunks share parent-document framing rather than whichever chunks actually answer the query.

The pre-processing is solving a specific problem: chunks that are ambiguous without their document. It is not solving “embeddings are bad.”

This is also why two newer approaches sit alongside the technique rather than replacing it. Voyage AI’s voyage-context-3, released July 23, 2025, embeds chunks with global document context built into the model itself — no separate LLM enrichment step (Voyage AI Blog). Jina’s late chunking takes a different route entirely: embed the whole document with a long-context encoder first, then mean-pool chunks after token-level embedding (Jina AI). Different architectures, same diagnosis. The original failure mode — chunks losing their document — is now a recognized class of problem, not a quirk of one recipe.

What this tells us, indirectly, is that “Contextual Retrieval” (Anthropic’s branded September 2024 recipe) and “contextualised chunk embeddings” (the broader technique family) are not synonyms. The recipe is one implementation. The category is older and wider.

The Data Says

Contextual retrieval’s contribution is not a new index or a new embedding model. It is the recognition that the embedding model is being asked to represent something the chunker stripped away. Prepend the missing context, and Anthropic reports retrieval failures drop by roughly half before reranking, and by two-thirds with reranking — measured against their own internal eval, on a baseline that was already only 5.7% wrong.

Aha Moments

MAX

The thing Mona understated: the spec change is tiny. An additional pre-processing pass, a prompt template, a cache-key strategy. Most teams I have watched rebuild their entire retrieval stack instead — new vector DB, new embedding model, new chunking scheme — when the actual gap is a missing context string at the top of every chunk. The architectural lesson is to fix the input format before retraining the system that consumes it. Contextual retrieval is, structurally, a contract change between the chunker and the embedder. That is the cheapest kind of fix in any pipeline. Run it. Then decide if you still need everything else you were planning to rebuild.

DAN

Max is right that the fix is small, but the strategic read is bigger. The market signal here is that the hard part of RAG is not the model — it is the data preparation. Voyage shipping voyage-context-3 shortly after Anthropic’s post tells you the category is consolidating fast. Vendors are racing to absorb this pre-processing into the embedding model itself, because nobody wants to operate the LLM-context-generation step in production indefinitely. Whoever owns “embeddings that already understand context” owns a meaningful slice of the retrieval stack. The recipe Mona described is the open-source version. The packaged version is what enterprises will actually buy.

ALAN

Both correct, both narrow. Notice what we have not asked: who decides what “context” the LLM prepends? An earnings chunk gets prefaced with “this discusses revenue growth for Acme Corp.” A controversial policy document gets prefaced with — what, exactly? The same model that generates context decides how to frame each chunk. Framing is not neutral. We have moved an editorial decision from the document author to the embedding pipeline, and made it invisible to the user who searches. The retrieval failure rate went down. Did the retrieval bias go up — and would we recognize it if it had?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors