Answer Relevancy
Also known as: Response Relevancy, Answer Relevance
- Answer Relevancy
- Answer Relevancy is a generation-side RAG evaluation metric that measures how directly a system’s response addresses the user’s original question. Scores fall between 0 and 1; the metric does not check factual correctness, only whether the answer stays on-topic and avoids irrelevant padding.
Answer Relevancy is a RAG evaluation metric that measures how directly a generated response addresses the user’s original question, scoring on a 0–1 scale where higher means better alignment.
What It Is
RAG systems can fail in a particularly frustrating way. The retrieval step finds the right documents, the model grounds its answer in those documents, and the response is technically accurate — but it doesn’t actually address what the user asked. The model rambles, hedges, partially answers, or pivots to a related but different topic. Answer Relevancy exists to catch this class of failure, where the answer is correct in isolation but wrong as a response to the question. It is one of the three core RAG evaluation metrics, alongside Faithfulness and the context-side metrics.
The metric works by reverse engineering. A judge model reads the generated answer and produces several hypothetical questions that the answer could plausibly be a response to. Each generated question is converted to an embedding — a numerical representation of meaning — and these embeddings are compared to the embedding of the original user question using cosine similarity. The mean similarity score is the relevancy result, falling between 0 and 1, where higher values mean the answer aligns with what was asked. According to Ragas Docs (Response Relevancy), the default setup generates 3 hypothetical questions per answer. The same metric is sometimes called Response Relevancy in newer documentation and Answer Relevance in the TruLens framework — same idea, different label.
Crucially, Answer Relevancy says nothing about factual correctness. A response can be perfectly on-topic and entirely hallucinated; it will still score high. According to Ragas Docs (Response Relevancy), the metric is reference-free — it needs no ground-truth answer to compute, only the original question and the generated response. That makes it cheap to run continuously over real production traffic, not just on a curated test set. The trade-off is real: relevancy must always sit next to a faithfulness metric. One catches drift away from the question; the other catches drift away from the source documents. Treat them as a pair, never as substitutes.
How It’s Used in Practice
Most teams encounter Answer Relevancy through a RAG evaluation framework. Ragas, DeepEval, and TruLens all expose it as a built-in metric you can run over a batch of question/answer pairs collected from your pipeline. The typical workflow looks like this: the team logs real or synthetic queries, runs them through the RAG pipeline, captures the generated answers, then feeds the (question, answer) pairs to the evaluator. The evaluator returns a per-example score and an aggregate average across the batch.
In a CI-style setup, that average becomes a guardrail: if a prompt change or a model swap drops the relevancy score below a threshold, the build fails and the change does not ship. In production, the same scoring runs as a sampled background job — every Nth interaction gets evaluated and surfaces in a dashboard, so the team sees relevancy degradation in days, not after the first wave of customer complaints. The score is also useful as a debugging signal during prompt iteration, where you can attribute a drop to a specific change in the system prompt or retrieval configuration.
Pro Tip: Pin a faithfulness metric next to relevancy in the same eval run. Relevancy alone will give you a confident green light on answers that are perfectly on-topic and completely made up. The two metrics catch different failure modes, and running them together is the cheapest insurance you can buy.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Shipping a customer-facing RAG chatbot and you need a guardrail against off-topic answers | ✅ | |
| You want to verify a response is factually grounded in the retrieved documents | ❌ | |
| A/B testing two prompt templates and you need a quantitative signal for which one stays on-task | ✅ | |
| Looking for a single metric that certifies your RAG pipeline as “production ready” | ❌ | |
| Sampling production traffic and you want a cheap, reference-free quality signal | ✅ | |
| Your use case involves intentionally exploratory, open-ended responses where staying tightly on-question is undesired | ❌ |
Common Misconception
Myth: A high Answer Relevancy score means the answer is correct. Reality: Answer Relevancy only measures whether the response stays on-topic with the question. A confidently relevant answer can still be entirely hallucinated. Pair it with Faithfulness to check whether the answer is actually grounded in the retrieved source documents.
One Sentence to Remember
Answer Relevancy asks one narrow question — did this response answer what was asked — and gives no opinion on whether the answer is true; treat it as one third of the RAG evaluation triangle, never as the whole picture.
FAQ
Q: What is the difference between Answer Relevancy and Faithfulness? A: Answer Relevancy checks whether the response is on-topic with the user’s question. Faithfulness checks whether the response is supported by the retrieved documents. Different failure modes; both are needed.
Q: Does Answer Relevancy require a ground-truth answer? A: No. It is reference-free — it only needs the original question and the generated response. The judge model reverse-generates hypothetical questions from the answer and compares them to the original.
Q: What counts as a good Answer Relevancy score? A: Scores closer to 1 indicate strong alignment between answer and question. There is no universal threshold — the right cutoff depends on your domain, your embedding model, and how much off-topic drift your users tolerate.
Sources
- Ragas Docs (Response Relevancy): Answer/Response Relevancy — Ragas - Reference implementation and formula for the metric.
- TruLens Docs: RAG Triad — TruLens - Defines Answer Relevance as one of the three core RAG evaluation metrics.
Expert Takes
What’s elegant here is the inversion. Instead of asking whether the answer matches the question, the metric asks: if we reverse-engineered questions from this answer, would they cluster near the original? That’s a proxy for relevance, not a direct measurement. It works because off-topic answers reverse-engineer into off-topic questions. Just remember the proxy assumes your embedding model captures the semantics that matter for your domain — which is not always a safe bet.
Treat Answer Relevancy as a regression test on your prompt template. When the score drops, the generator started drifting — usually because the system prompt left room for the model to wander, or the retrieved chunks confused the response shape. Pin it to your CI for the prompts you ship. The fix is rarely the model itself; it’s a clearer specification of what “answer the question” means in your domain. Tighter context, tighter output.
Every team shipping RAG to customers is one bad answer away from a support ticket they didn’t see coming. Answer Relevancy is the metric your eval dashboard needs before users start filing them. The companies winning with retrieval right now aren’t the ones with the fanciest models — they’re the ones who instrumented relevancy from week one and treated each drop as a product bug, not a vibes problem. If you can’t measure off-topic, you’ll keep shipping it.
Answer Relevancy gives a tidy number for something deeply slippery: did the system understand what the user was actually asking? The metric measures surface alignment between question and response, not whether the user’s real intent was honoured. A confidently relevant answer to the wrong interpretation of the question scores high. Worth asking who decides what “relevant” means in your evaluation set, and whose questions never made it into the test data in the first place.