TruLens
Also known as: TruLens-Eval, TruLens RAG Triad, TruEra TruLens
- TruLens
- TruLens is an open-source evaluation and tracing framework for LLM applications and agents, built around the RAG Triad — Context Relevance, Groundedness, and Answer Relevance — three feedback functions that score retrieval quality, grounding to source documents, and how well the answer addresses the question.
TruLens is an open-source framework that evaluates RAG systems and LLM agents using the RAG Triad — Context Relevance, Groundedness, and Answer Relevance — backed by OpenTelemetry-based tracing of every retrieval and tool call.
What It Is
Build a RAG chatbot, and the first scary question lands within a week: how do we know it isn’t making things up? TruLens answers that question with numbers instead of hope. It is the open-source framework that turned “is the answer grounded in the source documents” into a feedback function any developer can drop into their pipeline.
TruLens started inside the TruEra team, who in 2023 published the framing that gave the project its identity: the RAG Triad. According to TruEra, three questions cover almost everything that can go wrong with a retrieval-augmented answer. Did the retriever pull relevant chunks? That is Context Relevance. Did the model stick to those chunks instead of inventing details? That is Groundedness. Did the final answer actually address what the user asked? That is Answer Relevance. Each question gets its own score, and a regression in any one of them points at a specific failure mode.
Groundedness is the function most developers reach for first when they care about guardrails. According to TruLens Docs, it works by decomposing the model’s response into atomic claims — short factual sentences — and then checking each claim against the retrieved context. A claim that has direct support in the retrieved chunks scores high. A claim that drifts from the source, or worse, fabricates a citation, scores low. The output is not a single fuzzy similarity number; it is a per-claim audit trail you can read like a code review.
The 2025–2026 generation of TruLens added agent tracing on top of the feedback functions. According to TruLens Site, the framework instruments retrievals, tool calls, and planning steps using OpenTelemetry, so a Groundedness regression can be traced back to the exact span of an agent run that produced it.
How It’s Used in Practice
The mainstream pattern looks like this: a team building a RAG chatbot or a customer-support agent imports TruLens alongside their existing pipeline, registers the three RAG Triad feedback functions against an evaluation set, and runs the suite on every meaningful change. Adjusted the chunk size? Re-run. Added a reranker? Re-run. Swapped the system prompt? Re-run. The dashboard shows whether Context Relevance moved, whether Groundedness held, and whether Answer Relevance survived the change.
Teams shipping into production typically pair TruLens with a CI step. A pull request that drops Groundedness below the agreed threshold gets blocked the same way a pull request that breaks unit tests gets blocked. The OpenTelemetry traces feed the same observability stack the rest of the application already uses, so an SRE looking at a latency spike and a developer looking at a Groundedness regression are reading from the same span tree.
Pro Tip: Don’t treat the three RAG Triad scores as a single average. Groundedness and Answer Relevance can move in opposite directions — a model that paraphrases retrieved text very closely scores high on grounding and low on relevance because it never actually answers the question. Read the three numbers as a diagnostic panel, not a leaderboard.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Production RAG chatbot or customer-support agent | ✅ | |
| One-off prototype that will never see real users | ❌ | |
| Catching grounding regressions before release | ✅ | |
| Single-turn classification with no retrieval step | ❌ | |
| Tracing agent runs across retrievals and tool calls | ✅ | |
| Domains where every answer is human-reviewed anyway | ❌ |
Common Misconception
Myth: A high Groundedness score means the answer is correct. Reality: Groundedness only measures whether the answer is supported by the retrieved context. If the retriever fetched the wrong documents — or if the source documents themselves are outdated — the model can produce a perfectly grounded answer that is also factually wrong. That is why Context Relevance sits alongside Groundedness in the RAG Triad: it catches the case where the generator is loyal to bad sources.
One Sentence to Remember
TruLens turns “does the bot hallucinate” into three measurable feedback functions you can wire into CI — and once Groundedness lives on a dashboard, grounding stops being a hope and becomes a contract.
FAQ
Q: How is TruLens different from Ragas or DeepEval? A: All three implement reference-free LLM-as-a-judge metrics for RAG. TruLens is built around the RAG Triad framing and pairs feedback functions with OpenTelemetry tracing for agents. Pick by stack, not by feature checklist.
Q: Do I need ground-truth answers to use TruLens? A: For Context Relevance, Groundedness, and Answer Relevance — no. The three RAG Triad functions score the question, the retrieved context, and the produced answer against each other, no human-labelled key required.
Q: Is TruLens open source? A: Yes. According to TruLens Site, the framework lives at github.com/truera/trulens as an independent open-source project, separate from any commercial offering. You can install it from PyPI and run feedback functions locally.
Sources
- TruLens Docs: RAG Triad — TruLens core concepts - canonical definition of the three feedback functions and the atomic-claim Groundedness method.
- TruLens Site: TruLens: Evals and Tracing for Agents - project home, repository link, and OpenTelemetry agent-tracing positioning.
Expert Takes
Not orchestration. Decomposition. The Groundedness function takes a generated answer apart claim by atomic claim, then checks each against the retrieved context. That is what makes the score actionable. A free-text similarity number tells you nothing useful when an answer mixes one true sentence with one fabricated one. Per-claim verification surfaces the fabrication. The RAG Triad is three small experiments, not one global judgment.
Wire the feedback functions into the same repo as your agent. When OpenTelemetry traces each retrieval and tool call, a regression in Groundedness points at a specific span — the chunk that was fetched, the prompt that summarized it, the tool that returned the wrong document. The diagnosis stops being “the bot hallucinates.” It becomes “this retriever stopped returning the right section after we changed the chunker.” That is fixable in an afternoon.
Eval frameworks are no longer optional add-ons. Customers ask whether the bot hallucinates and the only credible answer is a number from a documented framework. TruLens, Ragas, and DeepEval all carry that weight, and TruLens earned its spot by being the team that named the RAG Triad. Buyers now read the eval section of an architecture doc the way they used to read the security section. Skip it and lose the deal.
Groundedness scored by a language model judging another language model is a hall of mirrors. The judge inherits the same training quirks as the writer, the same cultural blind spots, the same confidence in plausible-sounding wrong answers. When the score says the answer is grounded, ask grounded according to whom. The RAG Triad measures internal consistency between retrieved text and produced text. Whether the retrieved text was true in the first place is a different question, and a harder one.