ALAN opinion 10 min read May 8, 2026

When Agent Evals Lie: The Ethics of LLM-as-Judge Scoring

Silhouette of a judge replaced by a mirrored language model, raising questions about who evaluates AI agents

Table of Contents

The Hard Truth

What if the most consequential judgment in your AI pipeline is being made by a system that systematically prefers its own family of models — and nobody on your team is checking? The judge you trust may be quietly marking its own homework. And the evaluation scores you carry to stakeholders may be a confident fiction.

Every quarter, more engineering teams adopt LLM-as-Judge scoring as the default mechanism for Agent Evaluation And Testing. The reasoning is seductive — humans are slow, model judges scale, and the early agreement numbers between automated judges and human raters looked respectable enough to anchor an entire methodology. But the moment one model becomes the arbiter of another, we have created a new institution with judicial power, and we have built it without any of the procedural protections that real institutions accumulated over centuries.

The Question Behind the Convenience

The question almost nobody asks out loud is this: why are we comfortable letting one opaque system grade another, and treating the result as truth? In the original LLM-as-a-Judge paper (Zheng et al. 2023), the authors reported greater than 80% agreement between a GPT-4 judge and human preferences on MT-Bench — comparable to the agreement humans reach among themselves. That single result has been quoted as a kind of permission slip ever since. If a model can match human consensus most of the time, why bother with humans at all?

The answer is not that LLM judges are useless. The answer is that “most of the time” is not the part of the distribution that matters in safety-critical evaluation. The cases where judgment matters most — the close calls, the unfamiliar domains, the adversarial inputs — are precisely the cases where automated judges drift the furthest from anything resembling truth.

The Case for the Machine Judge

It is worth stating the steelman in its strongest form. LLM judges are cheap. They are fast. They never get tired. They produce structured outputs that integrate into continuous integration pipelines. They scale to millions of evaluations a month for a fraction of the cost of a human review team, and they make iterative agent development feasible at all. For many teams, a flawed automated judge is genuinely better than no evaluation at all — and the alternative, in practice, is no evaluation at all.

This is not a trivial argument. Replace the LLM judge with a panel of human raters and you have not solved bias — you have introduced a different distribution of biases, slower feedback, and a hiring bottleneck. The conventional position is reasonable: imperfect measurement at scale is more valuable than perfect measurement that never happens.

What the Agreement Number Hides

The trouble is that the agreement number is not what it appears to be. The CALM framework (Ye et al. 2024) systematically quantified twelve distinct biases inside LLM-as-a-Judge methodology — position bias, verbosity bias, authority bias, self-enhancement, and others. Many of these persist even when the source of the answer is hidden from the judge. A judge that prefers its own outputs when the labels are stripped is not making an aesthetic choice. It is recognizing its own statistical fingerprint and rewarding it.

That fingerprint problem has a name in the literature: self-preference. Judges assign higher scores to outputs that are more familiar to them in distributional terms (Wataoka et al. 2024). And there is a related contamination called preference leakage — when the model that generated training data and the model evaluating outputs come from the same family, scores inflate in ways that have nothing to do with quality (LLMs-as-Judges Survey 2024). In practical terms: if you fine-tune on outputs from one frontier model and then grade your agent with a sibling model, the leaderboard you produce reflects family resemblance more than merit.

The deeper structural issue is that judges degrade exactly where they need to be sharpest. A systematic study of position bias across fifteen judges and roughly forty solution-generating models (Shi et al. IJCNLP-AACL 2025) found that judge inconsistency rises sharply when candidate answers are close in quality. The closer the comparison, the less reliable the verdict. The implication for any team using LLM-as-Judge to compare two versions of the same agent is uncomfortable — the regime in which you most need a discriminating judge is the regime in which the judge is least discriminating.

Judges Without Juries, Without Appeals

Step back from the technical layer for a moment. Every functional judicial institution humans have built — court systems, peer review, financial audit, clinical trial review — accumulated procedural protections after painful experience. Recusal when a judge has a conflict of interest. Appeals when a verdict seems unsound. Transparency about reasoning. Independent oversight.

LLM-as-Judge as currently practiced has none of these. There is no recusal mechanism that prevents a model from grading outputs derived from its own family. There is no appeals process for a candidate model that believes it was unfairly scored. There is rarely transparency about which judge model graded which run, on what version, with what system prompt. The closest we have to oversight is that the same engineers who built the evaluation pipeline also choose the judge.

Frameworks like the NIST AI Risk Management Framework articulate the principles that ought to govern this — accountability, transparency, validity, fairness with harmful bias managed (NIST AI RMF 1.0). But NIST RMF is voluntary. The EU AI Act is more demanding — providers of high-risk systems are required to establish testing procedures that verify performance, accuracy, and bias mitigation, and general-purpose model providers with systemic risk have evaluation duties already in force (European Commission). High-risk obligations have proposed deferral to December 2027 under the Digital AI Omnibus, but the direction is clear. The question is whether voluntary good intentions arrive before the audit shows up.

The Thesis

Thesis: When an LLM judges another LLM, the score we report is not an objective measurement of quality. It is a record of one model’s preferences applied at scale — and treating that record as ground truth is an act of governance, not an act of engineering.

This is uncomfortable because it shifts the moral weight of an evaluation pipeline. If the judge is just an instrument, then a wrong score is a measurement error. If the judge is a governance layer that encodes preferences and amplifies them across thousands of decisions, then a wrong score is a small policy choice multiplied. The same agent that scores well on Judge A may score poorly on Judge B, not because the agent changed, but because the implicit value system of the evaluator did. The team that picks the judge picks the policy.

The Questions We Owe Ourselves

What does it mean to “trust” an evaluation when no human reviewed any of the cases that produced it? What does informed consent look like for end users whose interactions with an agent were graded by a system whose biases were never disclosed? How do we treat a vendor’s published benchmark differently when we know the evaluation was conducted with a judge from the same model family as the candidate? And what is owed to the user who suffered a downstream harm because a judge approved a behavior that a thoughtful human reviewer would have caught?

These are not questions a tool can answer. They are questions an institution must answer — and the institution does not yet exist.

Where This Argument Is Weakest

The honest counterweight is that the field is moving. Agent-as-a-Judge methodologies, in which multiple specialized agents verify each other against tools and external evidence rather than offering single-pass scores, may dissolve some of the worst pathologies of single-judge evaluation (Survey on Agent-as-a-Judge 2025). Inter-rater reliability is increasingly measured with Cohen’s Kappa rather than raw correlation, which catches the failure mode where a judge is consistent but systematically harsh. And empirical work showing that prompt-level interventions can reduce certain biases suggests that the design space is not yet exhausted.

If hybrid pipelines combining model judges with human review on a sampled subset of edge cases become standard practice, and if judge–human agreement is measured on Cohen’s Kappa with disclosure of model family relationships, then the argument here weakens. The thesis is not that LLM-as-Judge is irredeemable. It is that, today, the practice has outpaced the procedural protections it requires.

The Question That Remains

When the judge of your AI is itself an AI, the bias is not a bug to be patched — it is the worldview the judge inherited from its training and now scales across every interaction it grades. Who, then, is responsible when the verdict is confidently wrong — the engineer who chose the judge, the model provider who released it without disclosing its biases, or the user who trusted a score that nobody had any institutional reason to question?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

Zheng et al. (2023): Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - Original LLM-as-Judge methodology and the >80% human-agreement claim.
Ye et al. (2024) “Justice or Prejudice?”: Quantifying Biases in LLM-as-a-Judge - CALM framework systematically quantifying twelve distinct biases.
Wataoka et al. (2024): Self-Preference Bias in LLM-as-a-Judge - Empirical evidence of self-preference and lower-perplexity bias.
Shi et al. IJCNLP-AACL 2025: Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge - Judge inconsistency on close candidates.
LLMs-as-Judges Survey (2024): LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods - Preference leakage between data-generator and evaluator models.
NIST AI RMF 1.0: NIST AI 100-1: Artificial Intelligence Risk Management Framework - Voluntary trustworthy-AI characteristics.
European Commission: AI Act — Regulatory Framework - GPAI evaluation duties and high-risk testing requirements.
Survey on Agent-as-a-Judge (2025): When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs - Multi-agent verification as successor pattern.

Aha Moments

MONA

Alan frames this as governance, and that framing is correct, but the empirical layer underneath is what makes the framing inescapable. A judge model’s preferences are not arbitrary — they are downstream of its training distribution, its tokenizer, and the loss signal it optimized against. When we observe self-preference even after blinding, we are observing distributional recognition, not aesthetic taste. That is why prompt-level mitigation can reduce certain bias categories but cannot remove the structural ones. The same property that makes a model a useful evaluator — its sensitivity to fluency and coherence patterns — is the property that makes it inherit those patterns as a value system. You cannot subtract the inheritance without subtracting the capability.

MAX

Mona is right that the bias is structural, and Alan is right that the procedural protections are missing. Both observations point to the same engineering gap. The evaluation pipeline is treated as if it were a measurement tool when it is closer to a hiring committee — it has standards, preferences, and consequences, and it should be documented as such. Most teams cannot tell you which judge model evaluated which run, with what system prompt, on what date. Until that provenance is part of the evaluation artifact, every leaderboard is a claim without receipts. The fix is not to abandon model judges. The fix is to treat the judge as a versioned, accountable component with the same rigor as the system under test.

DAN

Both of you are pointing at something my side of the room tends to wave away — and the wave-away is becoming a liability, not just a moral failure. The market is starting to ask vendors which judge they used and whether the judge is from the same family as the model under test. Procurement teams in regulated sectors are writing eval-disclosure clauses into their contracts. The teams that publish judge provenance and human-validation samples are converting trust into deal velocity, while the teams that keep their evaluation methodology opaque are quietly losing enterprise pipeline. So here is the question I keep returning to: when transparent evaluation becomes a buying criterion, will the teams that built their reputation on opaque scores have a credible path to recover it?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors