Teaching to the Test: How Benchmark Optimization Distorts AI Progress

Table of Contents
The Hard Truth
Every few weeks, a new model claims the summit of some leaderboard, and we read the number as a verdict on intelligence itself. But a score only tells you how well something performed on a test — not whether the test still measures what we think it does. What happens to a field when its most trusted numbers quietly stop meaning what they used to?
Teachers have a phrase for what happens when a school’s funding rides on its exam results: they call it teaching to the test. The curriculum narrows, the drills multiply, the scores climb — and somewhere in that climb, the thing the exam was supposed to capture gets lost. Artificial intelligence has built its entire sense of progress on tests. It may be time to ask whether the field is learning, or just rehearsing.
The Number We Have Learned to Trust
We have agreed, almost without ever debating it, that a benchmark score is a measure of capability. When a lab tops a leaderboard, we treat the claim the way an earlier generation treated a published proof — settled, external, trustworthy. The score becomes shorthand for the model, and the model for the future.
But there is a quieter question hiding underneath the celebration: is the model genuinely better, or is the lab simply better at the test? Those are not the same achievement, and for years we did not have to tell them apart because they tended to move together. The unsettling possibility now is that they have come apart — that the score rises while the capability stands still — and that we have not yet adjusted our trust to match.
Why Benchmarks Earned Our Trust
It would be unfair, and untrue, to treat benchmarks as a mistake. The field fell in love with them for good reasons. A shared Benchmark Datasets turned arguments about whose system was smarter into something closer to an experiment, where claims could be checked rather than taken on faith. The discipline of the Train Test Split — learn from one set of examples, then prove yourself on another the model has never seen — gave machine learning a working version of falsifiability.
Benchmarks replaced opinion with evidence, and for a discipline prone to hype, that is precious. The conventional wisdom that measurement drives progress is not naive — it is most of why the field advanced as fast as it did. Which is exactly what makes the failure mode so hard to see. The danger never arrives by discrediting the method. It arrives by quietly corrupting the one assumption the method depends on.
The Assumption Hiding Inside the Leaderboard
Every leaderboard rests on a single, rarely-stated premise: that the score is a faithful stand-in for the capability it claims to measure. In April 2025, that premise cracked in public. Meta submitted a model called Llama-4-Maverick-03-26-Experimental to the Chatbot Arena — a chat-optimized, verbose, emoji-friendly variant tuned for the arena rather than the version the public could download, as The Register documented. It reached an ELO of 1417 and second place overall before the arena conceded Meta should have made clearer that this was a customized model.
The scale of the selective disclosure is what lingers. Meta had privately tested 27 Llama 4 variants and published only the one that landed near the top, according to the Leaderboard Illusion study, which also found that access to the arena’s own data can yield up to a 112% relative performance gain on the arena’s distribution. A systematic review of 445 language-model benchmarks reached a parallel verdict from the opposite direction, finding recurring flaws in what the tests measure and how they score it — enough, its authors argue, to weaken the validity of the claims built on top of them, a concern detailed in Measuring what Matters. And beneath all of it sits Benchmark Contamination: test-set leakage has become so routine across popular suites like MMLU, GSM8K, and HellaSwag that a high score can be memorization wearing the mask of reasoning, as the Eriksson et al. review describes. There is even a grim irony in the headline successes — when a reasoning test gets so saturated that every frontier model brushes against its ceiling, the test has stopped telling the models apart at all.
What Goodhart Saw Coming
Economists named this pattern decades before AI existed. The principle commonly known as Goodhart’s Law holds that when a measure becomes a target, it ceases to be a good measure — the act of optimizing for the number hollows out the number’s meaning. Researchers have shown the same dynamic operating inside AI evaluation itself, where scores can be inflated without any matching change in real behavior, a finding traced in the arXiv work on Goodhart’s Law in NLP. It is the machine-learning cousin of Reward Hacking: optimize the proxy thoroughly enough and the target detaches from the goal.
The field’s most thoughtful work reads as a direct answer to Goodhart. FrontierMath was built by more than 60 mathematicians entirely from unpublished problems, so that no model could have absorbed the answers during training; at its 2024 launch, state-of-the-art systems solved under 2% of it, according to Epoch AI. Newer efforts refresh their questions continuously, so the test a model faces this month is not the test it could have quietly memorized last month. The lesson underneath both is the same: a benchmark is only honest for as long as it stays unseen.
The Quiet Decoupling
Thesis (one sentence, required): Benchmark optimization is steadily decoupling the progress we report from the progress we actually make, and the more authority we grant the leaderboard, the less the leaderboard deserves it.
This is not mainly a story about cheating. Whether benchmark-driven development is harming real progress is not a yes-or-no question, because most of the damage comes from people behaving rationally inside a broken incentive. An interdisciplinary review of roughly 100 studies named the real culprits — contamination, weak construct validity, gaming, and commercial incentives bending evaluation toward the appearance of progress — and cautioned that benchmarks in their current state are not dependable enough to anchor regulatory or safety decisions, a warning laid out in the Eriksson et al. review. That last point should give us pause, because regulation is precisely where society is beginning to lean on these numbers. A measurement we have learned to distrust is quietly being asked to carry the weight of public accountability.
Questions Worth Sitting With
So what do we do with a measurement system we can no longer fully trust? The instinct is to demand better benchmarks, and the work is underway — private holdout sets, contamination-resistant designs, even Synthetic Data Generation used to mint fresh problems no model has encountered. These help, and they matter.
But a sharper ruler does not answer the deeper question: who decides what gets measured, and who audits the people who decide? A benchmark is never neutral. It encodes a definition of what counts as intelligence, and when that definition is authored by the same labs whose models are judged against it, the line between measuring progress and manufacturing it grows thin. The numbers do not have to lie for the system to mislead. They only have to be written by people with a stake in the answer.
Where This Argument Could Break
Intellectual honesty demands naming where this case could fail. If contamination-resistant benchmarks and private, held-out evaluations become the norm, and if scores on genuinely unseen problems begin to climb, then the decoupling I have described would turn out to be a growing pain, not a structural flaw. And if the hard new tests reward real generalization rather than a fresh flavor of memorization, the field will have corrected itself faster than its critics feared. I would be glad to be wrong about this.
The Question That Remains
Benchmarks were supposed to keep us honest — a shared, external check on our own enthusiasm. The danger now is not that they lie outright, but that we keep reading them as truth long after they have quietly become theater. If the numbers no longer measure the thing we care about, and we have built our sense of progress on the numbers, then what, exactly, do we still know about how far we have come?
Ethically, Alan.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors