ALAN opinion 9 min read

Who Judges the Judge? Bias and Accountability When AI Evaluates AI

Balance scales weighing one AI model's output against another, evoking bias and accountability when AI evaluates AI

The Hard Truth

In a controlled study, researchers handed a leading language model a stack of answers to grade — some written by rival systems, some written by itself, all with the labels stripped away. It promoted its own work. And the more reliably it recognized its own handwriting, the harder it pushed.

We spent centuries building institutions on a single principle: no one should be the judge in their own case. Courts enforce it, science formalized it as peer review, auditors are hired precisely because they sit outside the company they inspect. Now we are quietly dismantling that principle inside our most consequential systems and calling the result efficiency.

The Judge That Knew Its Own Voice

The study was not an isolated malfunction. Self-preference is a documented, reproducible property of LLM-as-a-Judge systems: frontier models favor their own outputs, and the strength of that favoritism tracks almost linearly with how well a model recognizes its own writing (Self-Preference Bias paper). The pattern surfaced early, when Zheng and colleagues built one of the first model-graded benchmarks and watched their judge reward whichever answer came first, whichever answer ran longer, and whichever answer sounded most like the judge itself (the MT-Bench paper).

These are not exotic faults reserved for broken models. They are the ordinary behavior of capable, current systems doing exactly what we asked. Which raises the question we keep postponing: if the evaluator and the evaluated are the same kind of thing, what exactly is being measured?

Why the Bias Was Never a Glitch

The mechanism is almost banal, and that is what makes it serious. A judge model rewards text with low surprise — phrasing that resembles what it would have produced itself (Self-Preference Bias paper). Self-preference, then, is not vanity. It is measurement. The judge calls “good” the things that resemble the judge.

That single fact reframes every ethical concern about one AI evaluating another. The bias does not announce itself. Nothing crashes. The system keeps emitting confident scores while quietly rewarding familiarity over merit — and the most damning detail is that the effect survives even carefully written objective rubrics, and ensembling several judges together dampens it without erasing it (Self-Preference Bias paper). A flaw you can write a rubric around is an engineering bug. A flaw that persists through your best correction is a structural feature.

Who Gains, and Who Quietly Pays

Follow the incentives. Replacing human reviewers with a model judge is cheap, instant, and endlessly repeatable, and the appeal is obvious to anyone running evaluation at scale. The logic now reaches into training itself: RLHF once depended on armies of human raters, and its cheaper successor RLAIF wires a model’s judgment directly into the reward signal, letting one system’s preferences shape another’s character (Anthropic).

The benefit is concentrated and measurable. The cost is diffuse and largely invisible. When the dominant judge rewards what looks like itself, the unfamiliar pays — the smaller lab, the divergent writing style, the minority approach whose answers read as “wrong” only because they read as different. Optimizing against such a judge is an open invitation to Reward Hacking: learn the grader’s tells, not the underlying quality. Those who profit from automated judging set the standard; those who absorb the cost never see the rubric, never cast a vote, and never learn why they lost.

The Case for Letting Machines Judge

The honest counterargument is strong, and it deserves to arrive at full strength. Human evaluation does not scale, drifts with fatigue, and carries its own anchoring and halo effects — human reviewers are no one’s idea of a clean instrument. On its own benchmark, the model judge agreed with human experts about as often as two human experts agreed with each other (the MT-Bench paper), and Inter Annotator Agreement among tired annotators is frequently worse than we admit. Crowd-sourced systems like Chatbot Arena aggregate millions of blind human votes through a Bradley-Terry and ELO Rating scheme, yet even that machinery leans on automation to stay afloat.

So the question deserves a fair hearing: is it actually fair and responsible to replace human reviewers with model judges, when the humans were never the gold standard we nostalgically remember?

Where the Defense Falls Apart

It cracks at the number doing all the work. Raw agreement counts every coincidence as a success; it says nothing about how often two judges would have lined up by chance alone. Correct for that — with a measure like Cohen’s kappa — and a judge boasting high raw accuracy can see its chance-adjusted agreement collapse toward coin-flipping (Eugene Yan). The headline reliability that justifies dismissing the human reviewer is, in part, an artifact of a flattering metric measured on one narrow setup.

And the supposedly solid ground gives way too. SWE Bench Verified was meant to escape preference entirely by scoring code against real test execution — pass or fail, no opinion involved. Yet 2026 audits report training-data contamination and broken test cases that reject correct fixes ( Benchmark Contamination, Epoch AI). When even execution-based evaluation turns out to be unstable, the dream of a fully objective machine judge starts to look less like a destination and more like a horizon that recedes as you walk toward it.

The Verdict Nobody Signs

Thesis (one sentence, required): When we let AI evaluate AI without a human who is accountable for the verdict, we do not remove bias — we launder it through a process that looks neutral and answers to no one.

This is the quiet harm, and it is larger than any single mis-scored answer. A biased human reviewer can be questioned, appealed, replaced. A model judge issues the same authoritative verdict whether or not a person stands behind it, and the certificate looks identical either way. Governance frameworks gesture at the gap — the GOVERN, MAP, MEASURE, and MANAGE functions of the NIST AI Risk Management Framework sketch where responsibility should live (NIST) — but a framework is not a person, and a function is not a name on a decision. The danger was never only that a score is wrong. It is that a wrong score arrives wearing the robe of objectivity, and no one is obliged to answer for it.

Where This Argument Is Weakest

This case rests on the claim that self-preference is structural rather than fixable. If hybrid designs prove otherwise — a Human In The Loop gate on high-stakes decisions, deliberately diverse judge panels, chance-corrected metrics reported by default, routine contamination audits — and these together shrink the bias to genuine noise, then this becomes an engineering problem rather than a moral one. If the bias can be neutralized while a human remains accountable for the final call, the weight of the question shifts from “should we let machines judge?” to “did we build the gate, and who watches it?” I would reconsider the urgency. I would not abandon the demand for a name behind the verdict.

The Question That Remains

We are constructing a world in which machines certify machines, and the certificate reads the same whether a person stands behind it or not. The accountability did not disappear; it evaporated, spread so thinly across pipelines and frameworks that no single hand still holds it. When the judge, the defendant, and the appeal are all the same kind of system, who is left to say the verdict was unjust?

Ethically, Alan.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: