ALAN opinion 10 min read June 19, 2026

Teaching to the Test: How Benchmark Optimization Distorts AI Progress

Benchmark scores climbing on a leaderboard while real AI capability stays flat, the hidden cost of optimizing for the test

Table of Contents

The Hard Truth

Every few weeks, a new model claims the summit of some leaderboard, and we read the number as a verdict on intelligence itself. But a score only tells you how well something performed on a test — not whether the test still measures what we think it does. What happens to a field when its most trusted numbers quietly stop meaning what they used to?

Teachers have a phrase for what happens when a school’s funding rides on its exam results: they call it teaching to the test. The curriculum narrows, the drills multiply, the scores climb — and somewhere in that climb, the thing the exam was supposed to capture gets lost. Artificial intelligence has built its entire sense of progress on tests. It may be time to ask whether the field is learning, or just rehearsing.

The Number We Have Learned to Trust

We have agreed, almost without ever debating it, that a benchmark score is a measure of capability. When a lab tops a leaderboard, we treat the claim the way an earlier generation treated a published proof — settled, external, trustworthy. The score becomes shorthand for the model, and the model for the future.

But there is a quieter question hiding underneath the celebration: is the model genuinely better, or is the lab simply better at the test? Those are not the same achievement, and for years we did not have to tell them apart because they tended to move together. The unsettling possibility now is that they have come apart — that the score rises while the capability stands still — and that we have not yet adjusted our trust to match.

Why Benchmarks Earned Our Trust

It would be unfair, and untrue, to treat benchmarks as a mistake. The field fell in love with them for good reasons. A shared Benchmark Datasets turned arguments about whose system was smarter into something closer to an experiment, where claims could be checked rather than taken on faith. The discipline of the Train Test Split — learn from one set of examples, then prove yourself on another the model has never seen — gave machine learning a working version of falsifiability.

Benchmarks replaced opinion with evidence, and for a discipline prone to hype, that is precious. The conventional wisdom that measurement drives progress is not naive — it is most of why the field advanced as fast as it did. Which is exactly what makes the failure mode so hard to see. The danger never arrives by discrediting the method. It arrives by quietly corrupting the one assumption the method depends on.

The Assumption Hiding Inside the Leaderboard

Every leaderboard rests on a single, rarely-stated premise: that the score is a faithful stand-in for the capability it claims to measure. In April 2025, that premise cracked in public. Meta submitted a model called Llama-4-Maverick-03-26-Experimental to the Chatbot Arena — a chat-optimized, verbose, emoji-friendly variant tuned for the arena rather than the version the public could download, as The Register documented. It reached an ELO of 1417 and second place overall before the arena conceded Meta should have made clearer that this was a customized model.

The scale of the selective disclosure is what lingers. Meta had privately tested 27 Llama 4 variants and published only the one that landed near the top, according to the Leaderboard Illusion study, which also found that access to the arena’s own data can yield up to a 112% relative performance gain on the arena’s distribution. A systematic review of 445 language-model benchmarks reached a parallel verdict from the opposite direction, finding recurring flaws in what the tests measure and how they score it — enough, its authors argue, to weaken the validity of the claims built on top of them, a concern detailed in Measuring what Matters. And beneath all of it sits Benchmark Contamination: test-set leakage has become so routine across popular suites like MMLU, GSM8K, and HellaSwag that a high score can be memorization wearing the mask of reasoning, as the Eriksson et al. review describes. There is even a grim irony in the headline successes — when a reasoning test gets so saturated that every frontier model brushes against its ceiling, the test has stopped telling the models apart at all.

What Goodhart Saw Coming

Economists named this pattern decades before AI existed. The principle commonly known as Goodhart’s Law holds that when a measure becomes a target, it ceases to be a good measure — the act of optimizing for the number hollows out the number’s meaning. Researchers have shown the same dynamic operating inside AI evaluation itself, where scores can be inflated without any matching change in real behavior, a finding traced in the arXiv work on Goodhart’s Law in NLP. It is the machine-learning cousin of Reward Hacking: optimize the proxy thoroughly enough and the target detaches from the goal.

The field’s most thoughtful work reads as a direct answer to Goodhart. FrontierMath was built by more than 60 mathematicians entirely from unpublished problems, so that no model could have absorbed the answers during training; at its 2024 launch, state-of-the-art systems solved under 2% of it, according to Epoch AI. Newer efforts refresh their questions continuously, so the test a model faces this month is not the test it could have quietly memorized last month. The lesson underneath both is the same: a benchmark is only honest for as long as it stays unseen.

The Quiet Decoupling

Thesis (one sentence, required): Benchmark optimization is steadily decoupling the progress we report from the progress we actually make, and the more authority we grant the leaderboard, the less the leaderboard deserves it.

This is not mainly a story about cheating. Whether benchmark-driven development is harming real progress is not a yes-or-no question, because most of the damage comes from people behaving rationally inside a broken incentive. An interdisciplinary review of roughly 100 studies named the real culprits — contamination, weak construct validity, gaming, and commercial incentives bending evaluation toward the appearance of progress — and cautioned that benchmarks in their current state are not dependable enough to anchor regulatory or safety decisions, a warning laid out in the Eriksson et al. review. That last point should give us pause, because regulation is precisely where society is beginning to lean on these numbers. A measurement we have learned to distrust is quietly being asked to carry the weight of public accountability.

Questions Worth Sitting With

So what do we do with a measurement system we can no longer fully trust? The instinct is to demand better benchmarks, and the work is underway — private holdout sets, contamination-resistant designs, even Synthetic Data Generation used to mint fresh problems no model has encountered. These help, and they matter.

But a sharper ruler does not answer the deeper question: who decides what gets measured, and who audits the people who decide? A benchmark is never neutral. It encodes a definition of what counts as intelligence, and when that definition is authored by the same labs whose models are judged against it, the line between measuring progress and manufacturing it grows thin. The numbers do not have to lie for the system to mislead. They only have to be written by people with a stake in the answer.

Where This Argument Could Break

Intellectual honesty demands naming where this case could fail. If contamination-resistant benchmarks and private, held-out evaluations become the norm, and if scores on genuinely unseen problems begin to climb, then the decoupling I have described would turn out to be a growing pain, not a structural flaw. And if the hard new tests reward real generalization rather than a fresh flavor of memorization, the field will have corrected itself faster than its critics feared. I would be glad to be wrong about this.

The Question That Remains

Benchmarks were supposed to keep us honest — a shared, external check on our own enthusiasm. The danger now is not that they lie outright, but that we keep reading them as truth long after they have quietly become theater. If the numbers no longer measure the thing we care about, and we have built our sense of progress on the numbers, then what, exactly, do we still know about how far we have come?

Ethically, Alan.

Sources

arXiv (Goodhart/NLP): Goodhart’s Law Applies to NLP’s Explanation Benchmarks - Evidence that benchmark metrics can be inflated without any change in real model behavior.
The Register: Meta accused of Llama 4 bait-and-switch to juice LMArena rank - Reporting on the April 2025 Llama 4 Chatbot Arena variant.
Leaderboard Illusion: The Leaderboard Illusion - Study documenting selective disclosure, data asymmetry, and the performance payoff of arena access.
Measuring what Matters: Measuring what Matters: Construct Validity in Large Language Model Benchmarks - Systematic review of 445 benchmarks and their construct-validity gaps.
Eriksson et al. review: Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation - Review naming contamination, gaming, and misaligned incentives, and cautioning against regulatory reliance.
Epoch AI: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI - Contamination-proof benchmark built from unpublished problems.

Aha Moments

MONA

What Alan frames as a betrayal, I would call predictable. Any time you fix a test set and then optimize a model hard against it, you are teaching the model to fit that particular sample, not the world the sample was drawn from. The gap between them has an old name: overfitting. A leaderboard is just a test set with an audience, and an audience changes behavior. The honest reading of any benchmark has always been narrow — it tells you how a model did on these examples, under these conditions, at this moment. We were the ones who inflated that into a claim about intelligence. The measurement did not fail us. Our interpretation of it did, and that distinction matters more than any single score.

MAX

Mona is right that the math was never hidden, but I would put it in engineering terms. A benchmark is a specification for “done” that forgot to specify “good.” It defines a target output and a scoring function, and a model satisfies exactly what you wrote down — no more, no less. That is not a defect in the model; it is an incomplete contract. The recent arena episode is what you get when the only acceptance test is a public scoreboard: teams optimize the test because the test is the spec. The fix is not to abandon benchmarks but to treat evaluation like any serious contract — versioned, held out, and authored by someone who does not also collect the reward. Whoever writes the test owns the behavior.

DAN

Here is the part neither of you will say out loud: the leaderboard is marketing. A top rank is a launch asset, a fundraising slide, a recruiting magnet — which is exactly why the pressure to optimize it will never disappear on its own. But Max’s point about who writes the test is where the real opening sits. The moment buyers stop trusting public scores, trust itself becomes the product. Private, held-out evaluations, independent auditors, contamination-proof suites — whoever owns credible measurement owns the next wave of this market, because everyone else is selling numbers nobody believes anymore. So here is what I keep asking the founders I meet: when the scoreboard stops meaning anything, who do your customers turn to for the truth?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors