Faithfulness

Also known as: groundedness, factual consistency, source-faithfulness

Faithfulness
Faithfulness measures whether a RAG system’s generated answer is factually consistent with the retrieved context. It is calculated as the ratio of claims in the response that are supported by source documents to total claims, producing a score between 0 and 1.

Faithfulness is a RAG evaluation metric that measures how well a generated answer stays grounded in the retrieved context — every claim should be traceable to a source document, not invented by the model.

What It Is

When a RAG (Retrieval-Augmented Generation) chatbot answers a customer question, two things can go wrong: it can pull the wrong documents, or it can invent details that aren’t in the documents it pulled. Faithfulness catches the second failure. It scores whether the model stuck to what the retrieved sources actually said, or whether it slipped into making things up.

Think of faithfulness like a citation check on a student essay. If the student writes “the experiment ran for six weeks” but the cited paper says “four weeks,” that claim isn’t faithful — even if it sounds plausible and even if other parts of the essay are correct. Faithfulness asks: for every factual statement in the answer, is it backed by the source the system was given?

The score works the same way across the major evaluation frameworks. According to Ragas Docs (Faithfulness), it is calculated as the number of claims in the response supported by retrieved context, divided by the total number of claims. A perfect 1.0 means every statement traces back to a source; 0.5 means half the answer was invented or contradicted. The computation runs in three steps: extract atomic claims from the response, verify each claim against the retrieved chunks, then compute the ratio.

Different tools name it differently but measure the same thing. According to TruLens Docs, the framework calls this metric “Groundedness” inside its RAG Triad. According to DeepEval Docs, the default pass threshold is 0.5, with a strict mode that scores binary pass or fail. The judge can be an LLM evaluating each claim, or a lightweight classifier — according to Ragas Docs (Faithfulness), the Vectara HHEM-2.1-Open model can replace the LLM judge for cheaper production-scale scoring.

How It’s Used in Practice

Most teams hit faithfulness when they ship an internal RAG chatbot — a documentation assistant, a policy lookup tool, a customer support bot — and start hearing complaints that “the answers sound right but are wrong in the details.” The fix isn’t always better retrieval. Sometimes the right documents are coming back, and the model is still embellishing. Faithfulness separates those two problems.

In a typical evaluation loop, the team builds a test set of representative questions paired with the chunks the retriever returns. They run the answers through a tool like Ragas, DeepEval, or LangSmith and get a faithfulness score per answer plus an aggregate. Answers that score below the threshold get reviewed by hand to spot the pattern: is the model adding numbers, inventing names, or combining facts that weren’t combined in the source? Each pattern points to a different fix — a tighter prompt, a stricter system message, or a smaller, more focused context window.

Pro Tip: Don’t trust an aggregate faithfulness score above 0.9 without spot-checking. The judge LLM rarely flags subtle drift like swapped dates or merged entities, so sample twenty answers from your top-scoring bucket and read them. That’s where the silent hallucinations hide.

When to Use / When Not

ScenarioUseAvoid
Evaluating a customer support chatbot grounded in a knowledge base
Testing a creative writing assistant with no source documents
Comparing two prompt versions on the same retrieved context
Measuring whether facts are true in the world rather than true to source
Catching hallucinations in a regulated-domain Q&A bot (legal, medical, financial)
Running on every production query without sampling

Common Misconception

Myth: A high faithfulness score means the answer is correct. Reality: Faithfulness only measures consistency with the retrieved chunks, not truth. If the retriever pulls a wrong or outdated document, the model can quote it perfectly and score 1.0 while telling the user something false. Faithfulness is paired with Context Precision and Context Recall for a reason — together they cover both halves of the problem.

One Sentence to Remember

Faithfulness tells you whether your model stayed inside the lines the retriever drew; it does not tell you whether those lines pointed at the truth, so always read it next to your context metrics.

FAQ

Q: What’s the difference between faithfulness and a hallucination metric? A: Faithfulness checks consistency with retrieved context only. A general hallucination metric — like DeepEval’s separate HallucinationMetric — compares the answer to ground-truth knowledge, regardless of what was retrieved.

Q: What’s a good faithfulness score for production? A: According to DeepEval Docs, the default pass threshold is 0.5. Most teams treat that as a floor and aim well above it for regulated domains, with the exact bar set by domain risk tolerance.

Q: Why does the same answer get different faithfulness scores from different tools? A: Each framework uses a different judge model and prompt to verify claims. Ragas, DeepEval, and TruLens-Groundedness can disagree on the same input because they extract and check claims differently.

Sources

Expert Takes

Faithfulness is not a truth check. It is a constraint check. The model’s job under this metric is to map every output token back onto evidence already in the prompt — nothing more. When you confuse it with hallucination detection, you build the wrong intuition. The model can be perfectly faithful to a wrong document and perfectly unfaithful to a correct one. The metric measures discipline, not correctness.

Faithfulness becomes useful the moment you stop reading aggregate scores and start reading the failures. Pull the worst answers, line them up against their retrieved chunks, and the pattern shows up fast — usually a too-broad prompt or a context window that swallowed too many sources. The metric is a debugging surface, not a dashboard number. Wire it into your eval set first, your CI second, and your prompt iteration loop third.

Every team shipping a RAG product is going to publish a faithfulness number eventually. Buyers in regulated industries will start asking for it the way they already ask for security audits. If your evaluation story is “we tested it manually,” you are losing the procurement conversation. Get the metric into your pitch deck and your trust page before your competitor does. This stops being optional the moment regulators start enforcing AI rules with real teeth.

A high faithfulness score is comforting, and that comfort is dangerous. The metric is honest about a narrow question — did the model echo the sources? — but it cannot answer the question users actually have, which is whether they should believe the answer. Who curates the source set? Who audits the retriever’s silent omissions? Reporting faithfulness without reporting what was excluded from retrieval is a half-truth dressed as a number.