Context Recall
Also known as: retrieval recall, RAG context recall, Ragas context recall
- Context Recall
- Context Recall is a retrieval-side RAG evaluation metric that measures how completely the retrieved documents cover the information required to produce the ideal answer, scored against a human-labeled ground truth.
Context Recall is a retrieval-side RAG evaluation metric that measures whether the documents fetched by the retriever contain all the information needed to produce the ideal answer to a user query.
What It Is
When you build a Retrieval-Augmented Generation (RAG) pipeline, the model’s answer is only as good as the documents it gets handed. Context Recall is the metric that asks a simple question: did your retriever find everything it needed? If half the relevant facts never made it into the prompt, the model cannot produce a complete answer — no matter how capable that model is.
Context Recall sits on the retrieval side of RAG evaluation. It compares the documents your retriever returned against a human-written reference answer (the ground truth). According to Ragas Docs, the metric decomposes that reference answer into individual claims and checks how many of those claims are attributable to the retrieved context. According to Ragas Docs, the score ranges from 0 to 1, where 1 means every claim in the ideal answer can be supported by something the retriever fetched.
In practice, the scoring is done by an LLM judge: each claim from the reference answer is checked against the retrieved chunks, and the judge marks it as supported or not. The recall score is the fraction of supported claims. Earlier formulations decomposed the retrieved context into sentences instead — current Ragas versions decompose the reference answer into claims, which is more aligned with what the metric is actually trying to measure: did the retrieval cover everything the ideal answer needed?
This is what makes Context Recall different from its sibling, Context Precision. Precision asks “of the documents you retrieved, how many were relevant?” Recall asks “of the documents you needed, how many did you actually get?” Both run on the retrieval side, but they catch opposite failure modes — a retriever that grabs too much noise versus one that misses key evidence.
According to Ragas Docs, Context Recall is the only one of the four core Ragas metrics that requires a human-labeled ground truth. Faithfulness, Answer Relevancy, and Context Precision can all be evaluated reference-free using LLM judges. Context Recall cannot — somebody has to write the ideal answer first.
How It’s Used in Practice
Most teams encounter Context Recall when they build their first RAG evaluation harness. You assemble a small test set — typically a few dozen representative queries with hand-written reference answers — and run Ragas (or an equivalent framework) on every change to the retrieval stack. When you swap embedding models, change chunk size, or adjust the top-k cutoff, the Context Recall score on that test set tells you whether the change helped or hurt coverage.
In day-to-day product work, the score becomes a guardrail. If your support chatbot starts giving thin answers in production, Context Recall on a regression set usually flags the cause: the retriever is no longer pulling the documents that contain the key facts. Generation looks fine; retrieval has degraded. Without this metric, the team is left guessing whether the model, the prompt, or the retriever is at fault.
Pro Tip: Build your evaluation set before you start tuning. Once you have already optimized the retriever against your gut feel, your reference answers will silently match what the system can already do — and Context Recall will look great while telling you nothing useful. Write the reference answers first, then measure.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Comparing embedding models on a fixed test set | ✅ | |
| You have no human-written reference answers available | ❌ | |
| Diagnosing why production answers feel incomplete | ✅ | |
| You only care about hallucination, not coverage | ❌ | |
| Tuning chunk size, top-k, or retrieval prompts | ✅ | |
| Open-ended creative queries with no “correct” answer | ❌ |
Common Misconception
Myth: A high Context Recall score means the RAG system is producing good answers. Reality: Context Recall only evaluates retrieval. The model can still ignore the retrieved evidence, paraphrase it incorrectly, or hallucinate around it. Pair Context Recall with Faithfulness (does the answer stick to the context?) and Answer Relevancy (does the answer address the question?) to get a full picture of RAG quality.
One Sentence to Remember
Context Recall tells you whether your retriever found the right material — not whether the model used it correctly — so treat it as one signal in a battery of RAG metrics, never as a standalone verdict.
FAQ
Q: What is the difference between Context Recall and Context Precision? A: Recall measures coverage — did you retrieve all the relevant documents? Precision measures noise — of what you retrieved, how much was actually relevant? They catch opposite retrieval failure modes.
Q: Why does Context Recall need a ground-truth answer when other RAG metrics do not? A: Because “did you retrieve everything needed?” requires knowing what was needed. Without a reference answer to extract claims from, there is no way to check whether the retriever missed something important.
Q: Is Context Recall enough to validate a RAG system on its own? A: No. It only measures retrieval coverage. You also need Faithfulness for grounding, Answer Relevancy for question alignment, and ideally Context Precision to check retrieval noise.
Sources
- Ragas Docs: List of available metrics - Official Ragas documentation defining Context Recall, its computation, score range, and required inputs.
- RAGAS paper: Ragas: Automated Evaluation of Retrieval Augmented Generation - Original paper introducing the Ragas framework and the precision/recall trade-off on the retrieval side.
Expert Takes
Recall as a concept in information retrieval predates RAG by decades. What Ragas added is the trick of letting an LLM extract claims from a reference answer, then checking each claim’s attributability against the retrieved context. The metric inherits a clean theoretical property: it isolates the retrieval failure mode. If the relevant evidence never reaches the model, no amount of prompt engineering downstream can recover the missing facts.
The interesting design choice in Context Recall is that it forces you to write down what “done” looks like before you measure. A reference answer is a specification of the ideal output. Most teams skip this step and end up with retrieval scores that drift to match whatever the system already produces. Spec first, then measure. The metric itself is fine; the discipline it imposes is the actual product.
Every team building a RAG product hits the same wall: the demo works, production answers feel thin, nobody knows whether retrieval or generation is at fault. Context Recall splits that question cleanly. Buyers are starting to ask vendors for evaluation harnesses, not just accuracy claims. If you cannot show retrieval scores against a labeled set, you are competing on vibes — and vibes lose enterprise deals.
A bounded numerical score feels objective, but Context Recall depends entirely on whose reference answer you trust. Whoever writes the ground truth defines what “complete” means — and in domains like medical or legal question answering, that choice is consequential. The metric is honest about needing human labels. The harder question is who gets to label, what biases their answers carry, and whose questions never make it into the test set at all.