ALAN opinion 10 min read

When Agent Evals Lie: The Ethics of LLM-as-Judge Scoring

Silhouette of a judge replaced by a mirrored language model, raising questions about who evaluates AI agents
Before you dive in

This article is a specific deep-dive within our broader topic of Agent Evaluation and Testing.

This article assumes familiarity with:

The Hard Truth

What if the most consequential judgment in your AI pipeline is being made by a system that systematically prefers its own family of models — and nobody on your team is checking? The judge you trust may be quietly marking its own homework. And the evaluation scores you carry to stakeholders may be a confident fiction.

Every quarter, more engineering teams adopt LLM-as-Judge scoring as the default mechanism for Agent Evaluation And Testing. The reasoning is seductive — humans are slow, model judges scale, and the early agreement numbers between automated judges and human raters looked respectable enough to anchor an entire methodology. But the moment one model becomes the arbiter of another, we have created a new institution with judicial power, and we have built it without any of the procedural protections that real institutions accumulated over centuries.

The Question Behind the Convenience

The question almost nobody asks out loud is this: why are we comfortable letting one opaque system grade another, and treating the result as truth? In the original LLM-as-a-Judge paper (Zheng et al. 2023), the authors reported greater than 80% agreement between a GPT-4 judge and human preferences on MT-Bench — comparable to the agreement humans reach among themselves. That single result has been quoted as a kind of permission slip ever since. If a model can match human consensus most of the time, why bother with humans at all?

The answer is not that LLM judges are useless. The answer is that “most of the time” is not the part of the distribution that matters in safety-critical evaluation. The cases where judgment matters most — the close calls, the unfamiliar domains, the adversarial inputs — are precisely the cases where automated judges drift the furthest from anything resembling truth.

The Case for the Machine Judge

It is worth stating the steelman in its strongest form. LLM judges are cheap. They are fast. They never get tired. They produce structured outputs that integrate into continuous integration pipelines. They scale to millions of evaluations a month for a fraction of the cost of a human review team, and they make iterative agent development feasible at all. For many teams, a flawed automated judge is genuinely better than no evaluation at all — and the alternative, in practice, is no evaluation at all.

This is not a trivial argument. Replace the LLM judge with a panel of human raters and you have not solved bias — you have introduced a different distribution of biases, slower feedback, and a hiring bottleneck. The conventional position is reasonable: imperfect measurement at scale is more valuable than perfect measurement that never happens.

What the Agreement Number Hides

The trouble is that the agreement number is not what it appears to be. The CALM framework (Ye et al. 2024) systematically quantified twelve distinct biases inside LLM-as-a-Judge methodology — position bias, verbosity bias, authority bias, self-enhancement, and others. Many of these persist even when the source of the answer is hidden from the judge. A judge that prefers its own outputs when the labels are stripped is not making an aesthetic choice. It is recognizing its own statistical fingerprint and rewarding it.

That fingerprint problem has a name in the literature: self-preference. Judges assign higher scores to outputs that are more familiar to them in distributional terms (Wataoka et al. 2024). And there is a related contamination called preference leakage — when the model that generated training data and the model evaluating outputs come from the same family, scores inflate in ways that have nothing to do with quality (LLMs-as-Judges Survey 2024). In practical terms: if you fine-tune on outputs from one frontier model and then grade your agent with a sibling model, the leaderboard you produce reflects family resemblance more than merit.

The deeper structural issue is that judges degrade exactly where they need to be sharpest. A systematic study of position bias across fifteen judges and roughly forty solution-generating models (Shi et al. IJCNLP-AACL 2025) found that judge inconsistency rises sharply when candidate answers are close in quality. The closer the comparison, the less reliable the verdict. The implication for any team using LLM-as-Judge to compare two versions of the same agent is uncomfortable — the regime in which you most need a discriminating judge is the regime in which the judge is least discriminating.

Judges Without Juries, Without Appeals

Step back from the technical layer for a moment. Every functional judicial institution humans have built — court systems, peer review, financial audit, clinical trial review — accumulated procedural protections after painful experience. Recusal when a judge has a conflict of interest. Appeals when a verdict seems unsound. Transparency about reasoning. Independent oversight.

LLM-as-Judge as currently practiced has none of these. There is no recusal mechanism that prevents a model from grading outputs derived from its own family. There is no appeals process for a candidate model that believes it was unfairly scored. There is rarely transparency about which judge model graded which run, on what version, with what system prompt. The closest we have to oversight is that the same engineers who built the evaluation pipeline also choose the judge.

Frameworks like the NIST AI Risk Management Framework articulate the principles that ought to govern this — accountability, transparency, validity, fairness with harmful bias managed (NIST AI RMF 1.0). But NIST RMF is voluntary. The EU AI Act is more demanding — providers of high-risk systems are required to establish testing procedures that verify performance, accuracy, and bias mitigation, and general-purpose model providers with systemic risk have evaluation duties already in force (European Commission). High-risk obligations have proposed deferral to December 2027 under the Digital AI Omnibus, but the direction is clear. The question is whether voluntary good intentions arrive before the audit shows up.

The Thesis

Thesis: When an LLM judges another LLM, the score we report is not an objective measurement of quality. It is a record of one model’s preferences applied at scale — and treating that record as ground truth is an act of governance, not an act of engineering.

This is uncomfortable because it shifts the moral weight of an evaluation pipeline. If the judge is just an instrument, then a wrong score is a measurement error. If the judge is a governance layer that encodes preferences and amplifies them across thousands of decisions, then a wrong score is a small policy choice multiplied. The same agent that scores well on Judge A may score poorly on Judge B, not because the agent changed, but because the implicit value system of the evaluator did. The team that picks the judge picks the policy.

The Questions We Owe Ourselves

What does it mean to “trust” an evaluation when no human reviewed any of the cases that produced it? What does informed consent look like for end users whose interactions with an agent were graded by a system whose biases were never disclosed? How do we treat a vendor’s published benchmark differently when we know the evaluation was conducted with a judge from the same model family as the candidate? And what is owed to the user who suffered a downstream harm because a judge approved a behavior that a thoughtful human reviewer would have caught?

These are not questions a tool can answer. They are questions an institution must answer — and the institution does not yet exist.

Where This Argument Is Weakest

The honest counterweight is that the field is moving. Agent-as-a-Judge methodologies, in which multiple specialized agents verify each other against tools and external evidence rather than offering single-pass scores, may dissolve some of the worst pathologies of single-judge evaluation (Survey on Agent-as-a-Judge 2025). Inter-rater reliability is increasingly measured with Cohen’s Kappa rather than raw correlation, which catches the failure mode where a judge is consistent but systematically harsh. And empirical work showing that prompt-level interventions can reduce certain biases suggests that the design space is not yet exhausted.

If hybrid pipelines combining model judges with human review on a sampled subset of edge cases become standard practice, and if judge–human agreement is measured on Cohen’s Kappa with disclosure of model family relationships, then the argument here weakens. The thesis is not that LLM-as-Judge is irredeemable. It is that, today, the practice has outpaced the procedural protections it requires.

The Question That Remains

When the judge of your AI is itself an AI, the bias is not a bug to be patched — it is the worldview the judge inherited from its training and now scales across every interaction it grades. Who, then, is responsible when the verdict is confidently wrong — the engineer who chose the judge, the model provider who released it without disclosing its biases, or the user who trusted a score that nobody had any institutional reason to question?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors