MONA explainer 12 min read May 4, 2026

Prerequisites for RAG Grounding: Retrieval Quality, the RAG Triad, and Faithfulness Metrics

Diagram of a RAG pipeline split into three measurement points — retrieval relevance, generation faithfulness, answer relevance — with a triangle overlay

Table of Contents

ELI5

** RAG Guardrails And Grounding** is not a wall you bolt onto the output of a retrieval-augmented system. It is a measurement framework with three checkpoints — was the right context retrieved, did the answer stay faithful to it, and did it address the user’s question?

The first time most teams add Guardrails to a RAG system, they reach for a hallucination filter, wire it to the output, and call the work done. Six weeks later the same teams discover the hallucinations were never quite the bug they thought. The retrieval was returning loosely related chunks, the generator was paraphrasing those chunks faithfully, and the filter was silently approving every confident-sounding answer that happened to be grounded in the wrong evidence. Guardrails do not sit on top of RAG. They sit underneath — in the metrics that decide whether each stage of the pipeline is even functioning.

The Mental Models You Need Before Guarding a RAG System

A RAG pipeline has three failure surfaces, not one. Retrieval can return irrelevant or incomplete chunks. Generation can ignore the chunks it was given and free-associate from the model’s pretraining. The final answer can be technically supported by the retrieved context yet still miss what the user actually asked. Guardrails are interventions on these three surfaces; you cannot intervene on a surface you cannot measure.

The mental shift is small but consequential: stop thinking of grounding as a property of the answer and start thinking of it as a property of the pipeline.

What do you need to understand before implementing RAG guardrails?

Three background threads, in order.

First, retrieval evaluation. If you cannot score the retriever in isolation, you cannot tell whether downstream errors are coming from bad context or bad generation. The standard metrics split into two families. Order-unaware metrics — Precision@k = relevant in top-k / k, and Recall@k = relevant in top-k / total relevant in the corpus — measure whether the right documents made it into the cutoff at all (Weaviate Blog). Order-aware metrics — Mean Reciprocal Rank, where MRR=1 means the first result is always correct, and Normalized Discounted Cumulative Gain, which rewards placing more relevant docs higher and is preferred when relevance is graded rather than binary — measure whether the retriever ranked the good results above the noise (Weaviate Blog; Towards Data Science). Without one metric from each family, you are flying half-blind.

Second, the Sparse Retrieval versus dense retrieval split. The retriever you ship determines what kinds of failures the guardrails downstream have to catch. Sparse methods like BM25 fail on paraphrase; dense embedding methods fail on rare named entities. Hybrid retrieval is now the default precisely because the two failure modes are orthogonal. The guardrail layer cannot fix a retrieval method that is structurally blind to the user’s query.

Third, the distinction between Grounding and retrieval quality. These two phrases are used almost interchangeably in vendor blog posts, and the conflation matters. Strictly, grounding refers to generation faithfulness — whether the answer’s claims can be traced back to the retrieved context. Retrieval quality refers to whether the right context was retrieved in the first place. A perfectly grounded answer that cites the wrong document is still wrong; a poorly grounded answer over the right document is also wrong. The two failures look identical to the end user and have completely different fixes.

Why isn’t a hallucination filter alone enough?

A hallucination filter operates on the (answer, context) pair after both have been produced. It can catch one specific failure: the generator inventing facts not present in the retrieved context. It cannot catch the retrieval that returned irrelevant chunks, because by the time the filter runs the irrelevant chunks have already been promoted to “ground truth” for the faithfulness check. It cannot catch the answer that is faithful to the context but unresponsive to the question, because faithfulness and relevance are independent dimensions.

Not a filter. A framework.

The filter is one component of a measurement framework that also has to score the retriever and score the answer-question alignment. Without those two upstream checks, the filter is verifying a fluent paraphrase of whatever happened to land in the context window — an audit of the wrong artifact.

The RAG Triad and Its Reference Implementations

The cleanest articulation of this three-checkpoint framework is the RAG Triad, introduced in 2023 by the TruEra team in the TruLens open-source project and computed via LLM-as-a-Judge feedback functions (TruEra). The same three checks appear under different names in other libraries — Ragas calls them Context Precision, Faithfulness, and Answer Relevancy; DeepEval uses similar splits — so the Triad is best understood as a naming convention for an idea that the field converged on independently.

The shift the Triad demands is from one number to three.

What is the RAG Triad and why is it the prerequisite framework for grounding?

The Triad scores a RAG response on three independent axes (TruLens Docs):

Context Relevance — verifies that the retrieved chunks are relevant to the query, before generation runs. This is a guardrail on the retriever.
Groundedness — measures the extent to which claims in the LLM response can be attributed back to source text. The response is split into atomic claims, and each claim is independently checked against the retrieved context. This is the guardrail on the generator.
Answer Relevance — confirms the final response actually addresses the user input, regardless of whether it was grounded in the right evidence.

The three axes catch failure classes the others cannot see. A high-groundedness, low-context-relevance answer is a faithful summary of the wrong document. A high-context-relevance, low-groundedness answer is a confident hallucination on top of correct evidence. A high-relevance-on-both-axes, low-answer-relevance response is the model answering a question the user did not ask. Any guardrail strategy that does not measure all three is, by construction, blind to two of these failure modes.

The implementations matter because each picks a different mechanism for the central question — how do you decide if a claim is supported by a chunk of text?

The RAG Evaluation ecosystem has settled on three families of approach.

LLM-as-a-Judge. TruLens, Ragas, and DeepEval all default to prompting a separate LLM to judge entailment between answer and context. Ragas formalizes its Faithfulness score as (claims in answer that can be inferred from retrieved context) / (total claims in answer), with the calculation done in two steps: extract atomic claims from the generated answer, then cross-check each claim against the retrieved context for entailment (Ragas Docs). The score lives in the 0–1 range, higher is better. The judge can be GPT-4-class, Claude, or any model capable of structured entailment judgments — and that flexibility is also the method’s main weakness, because the score depends on which judge you pick.

Specialized hallucination-detection models. Vectara HHEM ships HHEM-2.1-Open as a T5-based open-weights classifier on Hugging Face — under 600 MB at 32-bit precision, roughly 1.5 seconds for a 2k-token input on a modern x86 CPU (Vectara Hugging Face). The closed-weight successor HHEM-2.3 powers Vectara’s public hallucination leaderboard via API, while the open variant can be plugged into Ragas as the entailment classifier in the second step of the Faithfulness calculation, replacing the LLM judge with a smaller, deterministic, locally-runnable model (Ragas Docs).

Reasoning-grade detector models. Patronus Lynx fine-tunes Llama-3-70B-Instruct (also released as an 8B variant) specifically for hallucination detection. As of its July 2024 release, Lynx scored 87.4% accuracy on HaluBench against GPT-4o’s 86.5%, becoming the first open-source detector to surpass GPT-4o on that benchmark; HaluBench itself is a 15,000-sample benchmark drawn from real-world domains including Finance and PubMedQA medical QA, released alongside the model (Patronus AI; arXiv Lynx paper). Lynx’s differentiator is that it produces a score and the reasoning for the verdict, which makes its judgments auditable in a way that black-box scores are not.

How do production guardrails wire these scores into the response loop?

A score is not yet a guardrail. The guardrail is the action the system takes when the score crosses a threshold. Nemo Guardrails is the reference open-source framework for this wiring. Its fact-checking rail offers two built-in approaches: a Self-Check approach where an LLM verifies entailment between the response and the evidence and is allowed to abstain, and an AlignScore approach using a built-in RoBERTa-based factual-consistency scorer (NVIDIA NeMo Guardrails Docs). The framework exposes a $relevant_chunks context variable that holds the retrieved evidence used for the fact-checking step — making the wiring between the retriever’s output and the guardrail’s input explicit rather than implicit.

The 0.20.0 release (January 2026) is the current canonical entrypoint, and the repository now lives at github.com/NVIDIA-NeMo/Guardrails after the move into the NVIDIA-NeMo organization (NVIDIA NeMo Guardrails Repo). Older tutorials pointing at the previous repo path still resolve via redirect, but new integrations should target the new canonical path.

Three-panel diagram showing the RAG Triad: Context Relevance scoring the retriever, Groundedness scoring the generator via claim decomposition, and Answer Relevance scoring the final response against the user query — The RAG Triad isolates three independent failure surfaces — retrieval, generation, and answer relevance — and assigns one metric to each.

What the Triad Predicts — and Where It Breaks

Once you hold the three-axis structure, the catalogue of guardrail tools stops looking like a market of competing products and starts looking like engineered responses to one of three failure surfaces. The framework gives you a diagnostic discipline: when a RAG system misbehaves, you can localize the fault before you start changing components.

If Context Relevance is low, expect downstream Groundedness to look misleadingly fine — the generator will faithfully ground its answer in the wrong evidence. The fix is upstream, in the retriever or the chunking strategy, not in a stronger hallucination filter.

If Context Relevance is high but Groundedness is low, expect that the generator is over-relying on its pretraining priors. The fix is in the generation prompt or in adding a dedicated detector model on the output, not in tightening the retriever.

If both Context Relevance and Groundedness are high but Answer Relevance is low, expect the failure to be in question understanding — query rewriting, intent classification, or a missing step that decomposes multi-part questions. No improvement to retrieval or generation will fix this.

Pick the metric to the failure surface; do not pick the loudest tool to the loudest symptom. A faithfulness-focused guardrail on a retrieval-broken pipeline buys you nothing but a more confident description of the same wrong answer.

Rule of thumb: measure all three Triad axes before you choose any guardrail product. The shape of the failure tells you which layer to harden, and most teams discover that two of the three axes are already healthy and only one needs intervention.

When it breaks: every score in this framework is judge-dependent. Ragas, TruLens, and DeepEval all default to LLM-as-a-Judge for the entailment check, and the resulting scores vary across judge models, prompt versions, and even temperature settings. A faithfulness score of 0.78 from GPT-4 is not directly comparable to 0.78 from Claude or from HHEM-2.1-Open. Treat the Triad as a trend signal within a fixed judge configuration, not as a portable absolute number you can quote across systems or vendors.

The Data Says

RAG guardrails are not a hallucination problem solved at the output. They are a measurement discipline applied at three checkpoints — retrieval, generation, and answer relevance — with each checkpoint catching failures invisible to the others. The reference implementations of this discipline (TruLens, Ragas, HHEM, Lynx, NeMo Guardrails) differ in how they score entailment, not in whether the three checkpoints are needed. That part the field already settled.

Sources

TruLens Docs: RAG Triad — TruLens core concepts - Defines Context Relevance, Groundedness, and Answer Relevance as the three independent feedback functions of the Triad.
TruEra: What is the RAG Triad? - Origin story for the Triad, introduced in 2023 by the TruEra team in TruLens and computed via LLM-as-a-Judge.
Ragas Docs: Faithfulness metric documentation - Formalizes the claim-decomposition + entailment formula for Faithfulness scoring and discusses pluggable judge models.
Vectara Hugging Face: HHEM-2.1-Open model card - Open-weights T5-based hallucination classifier with documented memory and latency characteristics.
Patronus AI: Lynx: State-of-the-Art Open Source Hallucination Detection - Lynx model release, HaluBench benchmark, and the explainable-verdict differentiator.
NVIDIA NeMo Guardrails Docs: Hallucinations & Fact-Checking — Library Developer Guide - Reference for Self-Check and AlignScore fact-checking rails and the $relevant_chunks evidence variable.
Weaviate Blog: Evaluation Metrics for Search and Recommendation Systems - Source for the order-unaware (Precision@k, Recall@k) and order-aware (MRR) retrieval metric definitions.

Aha Moments

MAX

Mona just framed RAG guardrails as a measurement framework, and the framing is the fix. Most pipelines I see fail because the spec only describes the happy path — the team writes “use a hallucination filter” and considers the requirement closed. The Triad turns one vague requirement into three checkable contracts: the retriever owes you a Context Relevance score, the generator owes you a Groundedness score, the orchestrator owes you an Answer Relevance score. Write those three obligations into the architecture spec before you choose a single library. The downstream benefit is enormous — when something breaks in production, the on-call engineer sees which axis dropped and knows which component to instrument. You go from “the answers feel wrong” to “Context Relevance fell on Tuesday after the index rebuild.” Specifications turn vibes into tickets.

DAN

Max is right that the spec discipline matters, and the market timing makes it more urgent. The vendor market has fractured into measurement libraries, detector models, and rail frameworks — three categories that did not exist as separate purchasing decisions a year ago. Teams that bought a single “RAG safety” platform are now discovering they bought one axis of the Triad and are blind on the other two. The teams that win the next cycle are the ones writing their guardrail strategy against the measurement framework first, then sourcing each axis from the best provider for that axis — open detector for groundedness, judge model for context relevance, rail framework for the wiring. Procurement decisions made before the framework is understood will lock production teams into single-vendor blind spots they will spend years migrating out of.

ALAN

Max writes the spec, Dan structures the procurement; I want to raise what both leave out. Every score in this framework is rendered by a judge model — either an LLM acting as critic, or a fine-tuned detector trained on a benchmark that someone curated. The judge has its own training distribution, its own blind spots, its own cultural priors about what counts as a “supported” claim. When a guardrail blocks an answer, who is the authority that decided the answer was wrong — the generator, the retriever, or a third model nobody on the team trained, audited, or even fully read the documentation for? Auditing the system that audits your system is the part of guardrail engineering nobody is procuring yet. Whose judgment ends up shipping in the loop, and who answers for it when the judge itself was the one that hallucinated?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors