MONA explainer 13 min read May 4, 2026

RAG Evaluation Explained: Faithfulness, Relevance, Context Metrics

MONA presenting a split RAG pipeline diagram where retrieval and generation stages are scored by separate evaluation metrics

Table of Contents

ELI5

RAG evaluation measures a Retrieval-Augmented Generation pipeline as two separable subsystems — the retriever that fetches documents and the generator that writes the answer — using LLM-as-a-judge metrics like Faithfulness, Answer Relevancy, Context Precision, and Context Recall.

A team puts a Retrieval-Augmented Generation chatbot into production. The answers read beautifully. Confidence is high. Then a domain expert asks an uncomfortable question — are these numbers actually in the source documents, or did the model invent them? — and nobody on the team can produce a defensible answer. The pipeline returns text. It does not return evidence about whether that text is grounded, on-topic, or sourced from the right material at all. RAG evaluation is the discipline that turns those three opaque questions into numbers.

The hidden assumption that fluency equals correctness

Most teams audit their RAG pipeline by reading outputs and nodding. The assumption underneath that habit is that an answer which sounds coherent must be drawing on the retrieved context, and that the retrieved context must be the right context. Both halves of that belief can fail independently — and a coherent-sounding answer is exactly the kind of failure that survives review.

Not intuition. Decomposition.

The Ragas paper (Es et al., 2023), which crystallised the modern vocabulary, makes the move explicit: stop scoring the pipeline as one black box, and start scoring it as two subsystems wired in series. The retriever is judged on what it pulls out of the index. The generator is judged on what it does with what the retriever handed it. A failure in one looks nothing like a failure in the other, and a single quality score that mixes them tells you neither where to look nor what to fix.

Two subsystems, two scoreboards

The first conceptual move in any serious RAG evaluation framework is to refuse to evaluate the system end-to-end as a single artifact. Retrieval and generation have different failure modes; they need different metrics. Once that split is in place, the metric vocabulary becomes legible.

What is RAG evaluation?

RAG evaluation is the practice of measuring a Retrieval-Augmented Generation pipeline as two separable subsystems — a retriever judged by how well it surfaces relevant evidence, and a generator judged by how faithfully and relevantly it answers using that evidence. The vocabulary is anchored in the Ragas paper (Es et al., 2024) and the parallel TruLens “RAG Triad” framework, both of which converged on the same insight: end-to-end accuracy hides where the pipeline is breaking.

The Ragas formulation, which has become the de facto reference set, uses four core metrics — Faithfulness, Answer Relevancy, Context Precision, and Context Recall (Ragas Docs). Two evaluate the retriever. Two evaluate the generator. A pipeline can score 0.95 on one pair and 0.42 on the other, and that asymmetry is the whole point — it tells you which subsystem to invest in.

What makes this measurable in practice is a property the Ragas paper introduced and called reference-free: most of these metrics can be computed from the question, the retrieved context, and the response alone, without a hand-labelled ground-truth answer (Ragas Docs). An LLM acts as the judge, breaking the response into atomic claims, attributing each one back to the context, and counting matches. The scoring is not free of subjectivity — but the subjectivity is moved out of the human reviewer’s head and into a procedure you can run on a thousand questions before lunch.

The TruLens project arrived at a near-identical decomposition under different names: Context Relevance for the retriever’s output, Groundedness for whether the answer is supported by that context, and Answer Relevance for whether the answer addresses the question (TruLens Docs). The mapping is close enough that teams routinely move between the two vocabularies — Context Relevance maps to Context Precision, Groundedness maps to Faithfulness, Answer Relevance maps to Answer Relevancy — but the names are not identical, and conflating them silently produces noisy comparisons.

The mechanics of measurement

Every one of these metrics resolves, eventually, to a fraction. Once you see that, the framework stops feeling like vibes-as-a-service and starts feeling like a small library of well-defined estimators.

How does RAG evaluation work across retrieval and generation stages?

The pipeline produces three artifacts for every query: the question, the retrieved context (a list of chunks), and the response. RAG evaluation runs different probes against different combinations of those three.

On the retrieval side, the question is what evidence got pulled out of the index. Context Precision asks: of the chunks the retriever returned in the top-K, were the relevant ones ranked early? The Ragas formula is Context Precision@K = Σ (Precision@k × vₖ) / total relevant items in top-K, where vₖ ∈ {0,1} flags whether chunk k is relevant (Ragas Docs). The shape of that formula matters — it punishes a retriever that buries the right answer at rank 9 even if it surfaces it at all. Context Precision captures signal-to-noise at the top of the ranked list, not just whether relevant material is anywhere in the result set.

Context Recall asks the complementary question: did the retriever pull back enough of the right evidence to support the ideal answer? The DeepEval formulation has an LLM judge attribute each statement of the ground-truth answer back to the retrieved chunks, and scores the fraction that can be sourced (DeepEval Docs (Contextual Recall)). Context Recall is the one Ragas metric that genuinely needs a reference answer — recall is undefined without something to recall against. Treat any pipeline that claims to measure recall without ground truth with suspicion.

On the generation side, the probes pivot. Faithfulness asks: do the claims in the response actually appear in the retrieved context? The Ragas formula is (# claims supported by retrieved context) / (total # claims in response), scored between 0 and 1, with higher meaning more grounded (Ragas Docs (Faithfulness)). DeepEval implements the same idea — # truthful claims / total claims — and ships a default pass threshold of 0.5 with an optional strict mode that forces a binary 0/1 verdict (DeepEval Docs).

Answer Relevancy reverses the arrow. The Ragas implementation generates N synthetic questions (default 3) from the response and computes the mean cosine similarity between embeddings of those reverse-engineered questions and the original query (Ragas Docs (Response Relevancy)). The intuition is mechanical: if the response is on-topic, you should be able to reconstruct something close to the original question by reading only the answer. If the answer drifts, the reconstructed questions drift with it, and similarity collapses.

The score range here has a quiet wrinkle worth flagging — cosine similarity is theoretically bounded between −1 and 1, even though sensible RAG outputs almost always land between 0 and 1. A 0.3 here is not the same as a 0.3 on a clamped scale.

What are the core components of a RAG evaluation framework?

A working RAG evaluation framework, as it has standardised across Ragas, DeepEval, and TruLens, has five moving parts.

Component	What it does	Examples
Metric library	Defines the scoring functions for retriever and generator	Ragas (4 metrics), DeepEval (5 RAG metrics), TruLens (3-metric Triad)
LLM-as-a-judge	The scoring engine that decomposes claims, judges relevance, attributes evidence	GPT-class models by default; calibrate against human labels
Eval dataset	A set of questions, optionally with ground-truth answers, used to drive the metrics	Reference-free for Faithfulness/Answer Relevancy/LLM-Context-Precision; reference-needed for Context Recall
Test runner	Executes the pipeline against the dataset, captures traces, computes metrics	DeepEval (“Pytest for LLMs”), Ragas evaluate(), LangSmith eval runs
Observability layer	Stores traces, metric scores, and embeddings over time so regressions are visible	LangSmith (native LangChain integration), Arize Phoenix (OpenTelemetry-native, framework-agnostic)

The metric library and the LLM judge together produce a number; the dataset and runner make those numbers comparable across runs; the observability layer makes them comparable across time. A team that has the first two but skips the second two is doing one-shot evaluation, not continuous evaluation, and a RAG pipeline that drifts silently after a corpus update is exactly the kind of system continuous evaluation exists to catch.

Phoenix’s distinguishing move at the observability layer is to project the embedding space of retrieved chunks into 2D or 3D so that retrieval drift becomes literally visible — a cluster that used to overlap with the query embeddings starts pulling away. Numbers tell you something is wrong; geometry sometimes tells you what.

Diagram of a RAG pipeline split into retriever and generator subsystems with Context Precision, Context Recall, Faithfulness, and Answer Relevancy metrics scoring each stage — Each metric is bound to a specific subsystem — retriever scores diagnose what your index returns, generator scores diagnose what the model does with it.

What the scores predict about your failures

Once the four metrics are in place, the cross-table of high and low scores becomes a diagnostic chart. The mechanism is the math; the value is the failure mode each pattern points at.

If Faithfulness is low but Answer Relevancy is high, the model is fluently fabricating — generating on-topic statements the retrieved context does not support. Tighten generation: lower temperature, add explicit instructions to refuse when context is insufficient.
If Context Precision is high but Context Recall is low, the retriever is conservative — what it returns is on-topic, but it is missing chunks the answer needs. Increase top-K, revisit chunking strategy, or upgrade the embedding model.
If Context Recall is high but Context Precision is low, the retriever is dragging in noise. The generator either gets confused or wastes context window. Add a reranker or tighten the relevance threshold.
If Faithfulness and Answer Relevancy are both high but users still complain, suspect Context Recall — the retrieved evidence may be self-consistent but incomplete, and the model is faithfully answering from a partial picture.

A common practical heuristic, surfaced in the Redis on Ragas guide, is that scores at or above 0.8 on the four core metrics typically signal production-ready quality, though the exact threshold drifts by domain (Redis on Ragas).

Rule of thumb: Always read the four scores as a vector, not as an average. Averaging Faithfulness with Context Recall throws away the diagnostic signal that justified the framework in the first place.

When it breaks: Faithfulness measures contradiction between the response and the retrieved context — not against the world. If the retrieved context itself is wrong, a “faithful” answer can still be factually false; the metric will quietly score the pipeline well while the user gets a confident, well-cited lie.

Compatibility notes:
Ragas v0.1.x metric APIs: Pre-v0.2 instantiation patterns were restructured around v0.2 (Ragas Docs). Legacy code still runs in v0.4.x, but newer tutorials assume the class-based API. Action: prefer the v0.2+ class-based instantiation when starting a new project.
LangChain legacy chains: LLMChain and other legacy chains used in older RAG eval tutorials moved to langchain-classic in LangChain 1.0 (released Oct/Nov 2025). LangSmith eval examples have been rewritten around create_agent and LCEL. Action: avoid copy-pasting LLMChain-based eval scripts from pre-2026 blog posts.

The judge needs its own audit

There is a deeper consequence baked into the framework: the LLM acting as a judge is itself a model with biases, and the metric is only as reliable as the judge’s calibration. Snowflake’s own benchmarking of the RAG Triad shows that judge prompts often need calibration against human labels before their agreement with human raters reaches usable levels (Snowflake Engineering Blog). Treat any RAG metric score as a measurement, not as a truth — a number with an error bar that depends on which model you used to compute it and how its prompt was tuned.

This is not a flaw of the framework. It is the unavoidable cost of automating quality judgement. The reason RAG evaluation works at all — that an LLM can tear a paragraph into claims and check them — is the same reason its scores require their own validation pass.

The Data Says

RAG evaluation works because retrieval and generation fail in different ways and demand different probes. The Ragas four — Faithfulness, Answer Relevancy, Context Precision, Context Recall — read together as a diagnostic vector that points at the broken subsystem. Read them as an average and you reintroduce the black box you were trying to open.

Sources

Es et al. (2023): Ragas: Automated Evaluation of Retrieval Augmented Generation - Foundational paper defining the reference-free LLM-as-a-judge approach for RAG metrics.
Ragas Docs: Available Metrics — Ragas Documentation - Canonical reference for the four-metric framework and the reference-free property.
Ragas Docs (Faithfulness): Faithfulness — Ragas - Formula and implementation details for the Faithfulness metric.
Ragas Docs (Context Precision): Context Precision — Ragas - Formula for Context Precision@K with rank-aware weighting.
Ragas Docs (Response Relevancy): Answer/Response Relevancy — Ragas - Reverse-question generation and cosine-similarity scoring mechanism.
DeepEval Docs: Faithfulness Metric — DeepEval - Alternative Faithfulness implementation with default 0.5 pass threshold.
DeepEval Docs (Contextual Recall): Contextual Recall — DeepEval - Reference-based recall scoring via LLM attribution.
TruLens Docs: RAG Triad — TruLens - Parallel three-metric framework for retrieval, grounding, and answer quality.
Snowflake Engineering Blog: Benchmarking LLM-as-a-Judge for the RAG Triad Metrics - Evidence that judge prompts require human-label calibration.
Redis on Ragas: Get better RAG responses with Ragas - Practical pass-threshold heuristics for production RAG systems.

Aha Moments

MAX

The metric vector Mona describes is the closest thing this field has to a real test suite. Treat each score as an assertion in CI: Faithfulness below threshold fails the build, Context Recall below threshold fails the build, and you get the diagnostic the moment you change a chunking strategy or swap an embedding model. Where teams trip is skipping the eval dataset spec — they reach for Ragas, point it at last week’s chat logs, and call it coverage. It is not. The dataset is the contract; the metrics are the assertions. Without an explicit, versioned eval set with intent annotations on each query, every score you compute is a number against a moving target. Pin the dataset, version it like code, and the diagnostic vector becomes actionable instead of decorative.

DAN

Mona and Max are both right that the framework is sound, but the strategic shift here is bigger than tooling. Once your RAG pipeline has continuous metrics, vendor selection changes — you can A/B an embedding model against your own corpus and read the answer in Context Precision instead of in vibes. That collapses procurement cycles that used to take quarters into something measurable in a week. Teams that have set this up are already pulling ahead because they can absorb model upgrades faster than competitors who still ship on intuition. The platforms — LangSmith, Phoenix, DeepEval — are racing to be the default substrate, and the team that owns continuous RAG evaluation owns the iteration loop. The iteration loop is where the moat actually lives now.

ALAN

Both of you are treating the metrics as instruments. I want to sit longer with what Mona flagged at the end — the judge is a model, the model has biases, and a confident-sounding score with no error bar is exactly the failure mode the framework was supposed to cure. We have moved the subjectivity from a human reviewer to an LLM and called it objectivity. That is a category error worth naming. The deeper question is who audits the judge, and on whose behalf. If the same vendor sells the pipeline, the eval framework, and the judge model, what does an independent quality signal even look like anymore?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors