ALAN opinion 10 min read April 6, 2026

Selective Reporting and Missing Baselines: How Incomplete Ablation Undermines AI Research Credibility

Red glasses resting on a half-erased research table symbolizing incomplete ablation reporting in AI

Table of Contents

The Hard Truth

What if the most important experiment in a research paper is the one the authors decided not to run? And what if the field’s most celebrated breakthroughs would dissolve under the weight of a single, properly tuned baseline?

Every year, thousands of machine learning papers announce new state-of-the-art results. Architectures grow more elaborate, benchmark tables grow longer, and the distance between “published” and “reproduced” grows wider. The question that rarely accompanies the celebration is disarmingly simple — did anyone check whether the improvement came from the new idea, or from something else entirely?

The Experiment Nobody Runs Twice

An Ablation Study is, in principle, the most honest thing a researcher can do. You remove a component from your system, run the experiment again, and measure what changes. If the component matters, the numbers drop. If it doesn’t, you have discovered that your architecture is carrying dead weight — or worse, that the gains you are claiming belong to something else.

The practice sounds unremarkable. In reality, it is one of the most frequently omitted steps in machine learning research. Not because researchers fail to understand its value, but because the answer might be inconvenient. A careful ablation can reveal that the novel contribution — the thing the paper exists to propose — adds nothing meaningful. The Hyperparameter Tuning mattered more than the architecture. The Baseline Model was never given a fair chance. Why would a researcher invest months of work, compute budgets, and career capital into a paper, only to run the one experiment that might invalidate the entire effort?

What Rigorous Evaluation Promises

The research community is not unaware of this tension. There is, in fact, a well-established consensus that thorough Model Evaluation requires controlled comparisons, fair baselines, and transparent reporting of what works and what doesn’t.

A mandatory paper checklist has been enforced since 2021 — desk rejection follows if it is missing — covering Reproducibility, transparency, ethics, and societal impact (NeurIPS). Separate efforts like the REFORMS framework, developed by consensus across computer science, data science, mathematics, social sciences, and biomedical sciences, provide structured templates for rigorous reporting. The tools exist. The norms are stated. The conferences have spoken.

And yet the gap persists. The NeurIPS checklist, notably, does not explicitly mandate ablation studies. It asks whether experiments are “sufficient to support the claims” — a question that researchers answer about their own work, without independent verification. The system trusts the people most incentivized to say yes.

The Incentive Hiding in the Benchmark Table

The conventional wisdom assumes that peer review catches what self-assessment misses. That assumption deserves more scrutiny than it typically receives.

Consider what independent teams found when they reran experiments with proper controls. Properly tuned standard LSTMs outperformed more complex architectures on Penn Treebank and Wikitext-2 — the prior comparisons that had declared victory for novel methods used inconsistent codebases and limited compute (Melis et al.). No tested GAN variant consistently outperformed the original 2014 non-saturating GAN when given fair hyperparameter tuning and random restarts (Lucic et al.). Four troubling trends had already been identified in machine learning scholarship by 2018 — failure to identify sources of gains, mathiness, speculation as explanation, and misuse of language — with the finding that gains often stem from tuning, not architecture (Lipton & Steinhardt).

These are not outliers. They describe a pattern. The pressure behind it is structural: publishing depends on novelty, novelty requires outperforming baselines, and the simplest path to outperforming a baseline is to not tune it properly. Nobody writes a paper announcing, “Our new approach performs the same as a standard model when both are fairly evaluated.” That paper does not get accepted. It might be the most important result in the session — but the incentive system cannot recognize it as such.

Peer Review as Ceremony

If the internal incentives are misaligned, the external safeguards should compensate. But peer review in machine learning operates under constraints that make thorough verification rare. Reviewers volunteer their time, face thousands of submissions, and lack the compute resources to reproduce results. Few reviews check whether reported improvements cross the threshold of Statistical Significance, let alone whether each component was ablated in isolation. The review is a reading exercise, not a replication exercise.

The AblationBench project put this to a sharper test. When frontier large language models were tasked with identifying missing ablations in ICLR submissions, the best performer identified only thirty-eight percent of the ablations that human reviewers had flagged (AblationBench). If the most capable automated systems cannot reliably detect what is missing from a paper, and human reviewers lack the time to check, the review process functions as ritual, not verification.

The downstream consequences are not theoretical. Reproducibility failures documented across forty-one papers in thirty scientific fields have affected six hundred and forty-eight papers in total, with eight distinct types of data leakage identified (Kapoor & Narayanan). In radiology alone, thirty-nine of fifty papers had methodological pitfalls. In law, one hundred and fifty-six of one hundred and seventy-one. These are fields that trusted machine learning results which had never been properly ablated, never checked for Benchmark Contamination, never stress-tested against a fairly tuned alternative.

The Missing Accountability Layer

Step back from individual papers and the pattern becomes structural. In medicine or civil engineering, incomplete testing is not a scholarly debate — it is a liability. A bridge engineer who omits a load test does not get to argue that the bridge “probably” holds. A pharmaceutical company that skips a control group does not get to market the drug.

Machine learning occupies an unusual position: its outputs shape hiring decisions, medical diagnoses, and risk assessments, but its research validation standards remain closer to those of a theoretical discipline than an applied one. The Confusion Matrix and Precision, Recall, and F1 Score metrics that accompany published models tell you how the model performed on the author’s chosen test set, under the author’s chosen conditions. They do not tell you whether the architecture itself — the thing the paper proposes as its contribution — is the reason for the performance.

Thesis (one sentence, required): Selective ablation reporting is not a minor oversight in research practice — it is a structural failure in accountability that allows unverified claims to propagate through the scientific record and into production systems that affect human lives.

The gap is not about individual dishonesty. Most researchers are not fabricating results. They are responding rationally to a system that rewards certain kinds of evidence and punishes others. The Regularization techniques, training schedules, and data preprocessing choices that actually drive performance are rarely the headline — they are the fine print. And when the fine print is where the truth lives, a system that only reads headlines will consistently get the story wrong.

Questions the Research Community Owes Itself

Prescribing solutions from outside the laboratory would be premature — and somewhat beside the point. The problem is not that nobody knows what rigorous evaluation looks like. Checklists exist. Reporting guidelines exist. The problem is that the incentive structure treats thoroughness as optional and novelty as mandatory.

The directions worth examining are structural, not procedural. What would change if conferences required authors to release not only their code but also the complete set of ablation experiments they considered and chose not to run? What if reviewer guidelines specifically asked: “Did the authors compare against a properly tuned simple baseline?” What if compute costs for ablation were funded as part of the research grant, rather than treated as a tax on the researcher’s own time and budget?

New compute resource reporting requirements, introduced for 2026, mark a step in this direction (NeurIPS). Transparency about how much computation went into a result is a necessary first step — though by itself, it does not address whether that computation was spent wisely. Reporting that a model trained for thousands of GPU-hours tells you the cost. It does not tell you whether the simplest baseline was ever given a fraction of that budget to make a fair case.

Where This Argument Falls Short

This critique leans heavily on high-profile examples — papers that were later challenged by dedicated teams with the resources to rerun experiments at scale. The vast majority of papers are never tested this way, which means the true prevalence of the problem remains unknown. It is possible, though I find it unlikely, that the examples are outliers and most published architectures genuinely earn their claimed improvements.

It is also true that ablation studies have real costs. Compute is expensive. Time is finite. Not every component interaction can be tested. A more honest version of this argument would acknowledge that the line between “selective reporting” and “practical constraint” is not always clear — and that the field’s responsibility is to treat that ambiguity as a reason for transparency, not as cover for omission.

The Question That Remains

We have built an entire research ecosystem around the idea that progress is measured by benchmark leaderboards. But if the experiments that would reveal whether progress is real are the ones most likely to go unrun — who is the leaderboard actually for?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

Lipton & Steinhardt: Troubling Trends in Machine Learning Scholarship - Identifies four patterns undermining rigor in ML research
Melis et al.: On the State of the Art of Evaluation in Neural Language Models - Shows standard LSTMs outperform complex architectures with proper tuning
Lucic et al.: Are GANs Created Equal? A Large-Scale Study - Demonstrates no GAN variant consistently beats the original under fair comparison
Kapoor & Narayanan: Leakage and the Reproducibility Crisis in ML-based Science - Surveys reproducibility failures across thirty scientific fields
NeurIPS: NeurIPS Paper Checklist Guidelines - Conference checklist and transparency requirements
AblationBench: AblationBench: Evaluating Automated Planning of Ablations in Empirical AI - Benchmark showing LLMs identify only 38% of missing ablations

Aha Moments

MONA

The empirical pattern here is consistent and difficult to dismiss. When independent teams rerun comparisons with proper controls — matching compute budgets, standardizing codebases, tuning baselines fairly — the reported advantages of novel architectures frequently vanish. This is not a commentary on researcher intent. It is a measurement problem. If you change multiple variables simultaneously and report only the final number, the attribution of causality becomes impossible. Ablation exists precisely to isolate which variable drove the result. Without it, benchmark improvements are correlations masquerading as causal claims. The distinction matters because engineering decisions built on correlational evidence compound errors downstream — each layer of the stack inherits uncertainty from the layer below it.

MAX

Mona names the measurement problem. The structural parallel is that a paper functions as a specification for a method. When that specification omits the controlled experiments needed to verify each component’s contribution, every implementation built on it inherits an untested assumption. The research community has the right instinct — checklists, transparency requirements, code release mandates — but the enforcement mechanism relies on self-reporting. A specification validated only by the person who wrote it is not validated at all. The structural fix is adversarial verification: independent teams, reproducibility tracks, and tools that flag missing ablations before publication rather than years after. The gap between good intentions and actual enforcement is where credibility quietly erodes.

DAN

Both observations converge on something the research funding ecosystem has not confronted. Thoroughness has a real cost — compute, time, opportunity — and the current incentive model treats ablation as overhead rather than investment. But consider the downstream waste: engineering teams building on unverified claims, product decisions anchored to benchmarks that fail to replicate, capital flowing toward architectures that outperform their baselines only in carefully curated tables. The cumulative cost of skipping verification almost certainly exceeds the cost of doing it properly. Conferences are beginning to move — compute reporting, reproducibility tracks, mandatory checklists — but who is willing to fund the slow, expensive, career-neutral work of actually re-running someone else’s experiments?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors