ALAN opinion 10 min read

Judging the Judges: Bias and Ethics of LLM-Based RAG Evaluation

Critical examination of bias and accountability gaps when LLM models grade other LLM outputs in RAG evaluation pipelines
Before you dive in

This article is a specific deep-dive within our broader topic of RAG Evaluation.

This article assumes familiarity with:

Coming from software engineering? Read the bridge first: RAG Quality for Developers: What Testing Instincts Still Apply →

The Hard Truth

We built the judges to scale evaluation. Then we forgot to ask who scales accountability when the judge gets it wrong. What does it mean when an opaque model grades another opaque model — and a passing score becomes the only artifact a regulator, a compliance reviewer, or a downstream team will ever audit?

A retrieval pipeline goes into production. It scores well on faithfulness, well on answer relevancy, well on context precision and context recall. The dashboard turns green. Nobody mentions that the green came from a closed-source language model that was asked to grade the work of another closed-source language model. The grade is real. The reviewer is not.

The Question We Stopped Asking About RAG Quality Scores

When teams talk about RAG Evaluation today, they talk about metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall. The conversation is technical, almost domestic. Which threshold counts as good. Whether the dashboard refreshes nightly. Which framework integrates with the test harness.

What rarely surfaces is the substrate of every one of those numbers. Each metric, in its modern implementation, is computed by another large language model — a judge, prompted to read the answer and decide whether it is grounded, relevant, complete. The judge is the audit. The score is the receipt. And the receipt is increasingly the only document a downstream reader, an internal compliance reviewer, or an external auditor will ever see.

So the quiet question is this: when the receipt is generated by the same class of system being audited, what exactly are we auditing?

Why Reaching for the Judge Was Reasonable

The case for LLM-as-judge is genuinely strong, and treating it as a strawman misses why thoughtful engineers adopted it. Manual evaluation does not scale. Crowd-rated golden datasets cost money, take weeks to refresh, and grow stale the moment the model behind the application changes. A language model that can grade thousands of outputs an hour, at a fraction of a cent each, transforms what evaluation can mean inside a release cycle.

The empirical foundation is also more solid than skeptics often grant. GPT-4-class judges reach above 80% agreement with human annotators on chat-quality preference — comparable to human-to-human inter-annotator agreement, per Zheng et al. (MT-Bench). That is not a trivial result. It is also a result about chat preference, not specifically about whether a RAG answer is grounded in retrieved context — a distinction that often goes missing when the headline number gets cited. Frameworks like Ragas formalised the practice into well-documented metrics: faithfulness, for instance, decomposes the answer into atomic statements and verifies each against the retrieved context, producing a score between zero and one (Ragas Docs).

NIST’s Generative AI Profile, published as AI 600-1 in July 2024, treats automated evaluation as part of the Measure function, not as a prohibited shortcut. The thoughtful position is not “do not use LLM judges.” It is “calibrate them carefully, hedge their outputs, and keep humans in the loop on stakes that matter.” That position is reasonable. It is also incomplete.

Twelve Biases Hidden Inside a Single Score

The hidden assumption is that a judge can be biased on individual cases yet unbiased in aggregate — that errors cancel out at scale. This is the assumption every dashboard quietly relies on, and the assumption the bias literature has been steadily dismantling.

Justice or Prejudice catalogues twelve distinct biases that LLM judges exhibit systematically — verbosity bias, authority bias, sentiment bias, beauty bias, position bias, and others. These are not noise. They are directional preferences that survive averaging. A judge prefers longer answers. It prefers answers that quote authority. It prefers prose that flows. None of these correlate cleanly with whether the answer is actually grounded in the retrieved context.

Then there is the deeper problem. A judge cannot be neutral about itself. The Self-Preference Bias paper shows that GPT-4, when judging GPT-4 outputs against alternatives, exhibits a measurable preference for its own family — and the preference correlates with how confidently the model itself generated the text. The judge rewards stylistic familiarity. It mistakes its own voice for correctness.

Reproducibility cracks the foundation further. OpenAI’s and Anthropic’s own documentation acknowledges that outputs are not fully deterministic across runs and model versions, even at temperature zero with a fixed seed (Reliability of LLM-as-a-Judge). The vendor can update the underlying model without notice. Yesterday’s score is not strictly reproducible tomorrow. The dashboard rests on a denominator that nobody outside the vendor controls.

Peer Review Earned Authority by Adding Friction

Modern science did not gain its authority by speeding up evaluation. It gained authority by slowing it down — institutionalising peer review, demanding methods sections another lab could replicate, requiring conflicts of interest to be disclosed, and treating the reviewer as someone external to the work. The friction was the point. A reviewer drawn from the same lab as the author was, by definition, compromised.

Compare that to the LLM-as-judge stack as it currently sits in production. The reviewer is not external — it shares training data, architecture, and often vendor with the system under review. The methods section is a system prompt that may be considered a trade secret. The disclosure of conflict is, at best, a paragraph in a model card. The replication requirement is silently negotiated against an API endpoint that may have shifted since last week.

Evaluation historically paid for its authority through transparency, separation, and replicability — and the current practice quietly skips all three. The metric did not earn the trust we extend to it. It inherited the trust by occupying the same slot on the dashboard that human review used to occupy.

What This Argument Actually Concludes

Thesis (one sentence, required): LLM-as-judge is acceptable as instrumentation, not as accountability — and conflating the two is the ethical failure most production systems are currently making.

The distinction matters. Instrumentation tells you something has changed. It does not tell you whether the change is right. A faithfulness score that drops between releases is a useful signal that something in the retrieval layer has shifted. It is not a verdict that the system is now unsafe, and not a substitute for a human reviewing whether a particular answer misled a particular user.

Accountability requires a chain of human decisions someone can be held to: what was tested, by whom, against what reference, with what acknowledged limitations. NIST’s AI RMF 1.0 makes this concrete in the Govern function, which establishes organizational accountability, policies, and oversight for AI risk. A pipeline whose only audit trail is a score generated by another opaque model has instrumentation but no Govern layer worth the name. It has a number. It does not have a person.

The discomfort is that this is not visible from inside the dashboard. Everything looks fine until something breaks badly enough to require a postmortem — and the postmortem discovers there was never anyone in a position to take the call.

The Questions Worth Sitting With

So what is left to do, if outsourcing judgment to another model is the wrong frame? The most promising practice — hybrid evaluation, where human-verified golden datasets calibrate the automated judge, and subject-matter experts arbitrate sampled outputs (Evidently AI) — is not a technical innovation. It is the deliberate refusal to let a metric stand alone.

That refusal raises questions worth holding open. Who in your organisation has the authority to override a green dashboard? What happens when a passing score conflicts with a user complaint? Which decisions about the judge — its prompt, its model, its threshold — are documented somewhere a successor could find them? When the model behind the API silently updates, who notices, and who carries responsibility for re-running the calibration?

These are not engineering questions. They are questions about institutional accountability dressed up in technical clothing.

Where This Argument Could Break

The argument weakens if open-weight, deterministic judges with audited prompts and stable benchmarks become the default, and if the bias gap between the judge model and the production model can be measured and reported alongside every score. If a judge becomes reproducible, transparent, and clearly external to the system under review, much of the accountability critique above dissolves. The current practice is not inevitable. It is just what we have settled for.

The Question That Remains

If the green dashboard is the only artifact downstream readers ever see, and the green came from a model whose biases we know, whose training data we cannot inspect, and whose behaviour we cannot reliably reproduce — what exactly have we delegated, and to whom?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors