DAN Analysis 8 min read June 19, 2026

SWE-bench Pro, ARC-AGI-2, and Humanity's Last Exam: The Benchmarks Defining Frontier Models in 2026

Comparison of 2026 AI benchmarks SWE-bench Pro, ARC-AGI-2, and Humanity's Last Exam replacing saturated coding tests

Table of Contents

TL;DR

The shift: AI labs are abandoning saturated coding benchmarks for contamination-resistant tests that models can’t memorize their way through.
Why it matters: The benchmark score on a model card now tells you less than the test it came from — and the labs know it.
What’s next: SWE-bench Pro, ARC-AGI-2, and Humanity’s Last Exam become the numbers that separate real capability from leaderboard theater.

Every model card in your feed still leads with a SWE Bench score. Most are quoting a number the test’s own backers no longer trust. On February 23, 2026, OpenAI stopped reporting SWE-bench Verified — the benchmark it had ridden to the top of every coding leaderboard. That’s not a footnote. That’s the industry conceding its favorite yardstick broke.

The Ruler the Labs Threw Out

Thesis: The frontier just traded saturated, contaminated benchmarks for tests with real headroom — and the score everyone still quotes is now the one that tells you the least.

For two years, a handful of Benchmark Datasets ranked every frontier release. Now the people who built them are walking away.

SWE-bench Verified did its job too well. The frontier cluster bunched above 80 percent, top models packed near the ceiling, per LLM-Stats. When every model aces a test, the test stops ranking anything.

OpenAI’s stated reason was Benchmark Contamination. Frontier models were reproducing the gold-standard fixes almost verbatim — they had seen the answers. Worse, OpenAI reported that most of the failures it reviewed came from broken or flawed test cases, not model mistakes.

So a high Verified score measured two things: memorization and a noisy grader. Neither is capability.

It’s the same arc that flattened MMLU Benchmark and HumanEval before it. And as labs lean harder on Synthetic Data Generation to feed training runs, the overlap between training corpora and test sets only widens.

The labs didn’t lose faith in the models. They lost faith in the ruler.

Three Tests, One Bet

Three teams answered with three different tests. They made the same bet: measure what a model can’t fake.

SWE-bench Pro, from Scale AI, runs 1,865 real-world tasks across 41 professional repositories, scored Pass@1, per Scale Labs. The difficulty gap is brutal — models that hit 80 to 95 percent on Verified solve roughly 59 percent on Pro under a standardized harness, according to the Morph SWE-bench Pro tracker. Same models. Honest test. Half the score.

ARC-AGI-2, from the ARC Prize Foundation, launched in March 2025 — and every frontier model scored zero. The average person solves about two-thirds of it; the grand-prize bar sits above 85 percent, per ARC Prize. Through 2026 the numbers are climbing but still contested across trackers, and the headroom is real.

Humanity’s Last Exam, built by the Center for AI Safety with Scale AI, runs 2,500 expert-written, multi-modal questions. Without tools, the frontier cluster sits in the mid-forties on the official leaderboard, per Scale Labs. Domain experts score around 90 percent. That gap is the point.

Three teams, three tests, one signal: the era of the ace-able benchmark is over.

Who Gains From a Harder Test

The test-makers just became kingmakers. Scale AI now sits behind two of the three benchmarks that matter; the ARC Prize Foundation and the Center for AI Safety hold the others. Whoever defines the test defines the frontier.

Labs that post strong numbers on contamination-resistant tests get something a saturated leaderboard could never sell: a credible claim. A near-perfect Verified score is a press release. A leading SWE-bench Pro score is a moat.

Open-weight models gain the most. On the standardized Pro harness, open releases — GLM-5.2 among them — are running near the top of the coding tier, per the Morph tracker. When the test can’t be gamed by memorization, the gap to closed models narrows on merit.

You’re either competing on the honest test or you’re selling the reflection in a broken mirror.

Who Just Got Exposed

Any model whose entire pitch was a ceiling-scraping Verified score just lost its headline.

The vendor-versus-standardized gap is now visible to everyone. Self-reported SWE-bench Pro figures run higher than what third-party harnesses reproduce, per the Morph tracker. That spread used to be invisible. Now it’s a credibility test.

Teams picking models off saturated leaderboards are flying blind. If your selection process still ranks on Verified, you’re optimizing for a test the field already retired.

The saturated leaderboard was a mirror. The new ones are a window.

What Happens Next

Base case (most likely): Labs converge on SWE-bench Pro, ARC-AGI-2, and Humanity’s Last Exam as the default trio on model cards, and the vendor-versus-standardized gap becomes a standing scrutiny point. Signal to watch: More labs follow OpenAI and quietly drop SWE-bench Verified from their reporting. Timeline: Through the second half of 2026.

Bull case: Contamination-resistant evaluation restores trust in published scores, and ARC-AGI-3’s interactive-reasoning track becomes the next proving ground. Signal: Third-party standardized harnesses become the citation everyone defaults to, not vendor numbers. Timeline: Late 2026.

Bear case: The new tests get gamed or fragment across harness settings, until no two reported scores are comparable and the trust problem simply moves addresses. Signal: Self-reported Pro numbers drift far above standardized results with no reconciliation. Timeline: Into 2027.

Frequently Asked Questions

Q: Which benchmarks are AI labs reporting on model cards in 2026? A: Frontier labs increasingly report SWE-bench Pro for coding, ARC-AGI-2 for abstract reasoning, and Humanity’s Last Exam for expert knowledge. OpenAI dropped SWE-bench Verified in February 2026, pushing the field toward contamination-resistant tests.

Q: How did frontier models score on SWE-bench Verified and ARC-AGI-2 in 2026? A: On SWE-bench Verified, top models bunched above 80 percent — the saturation that got it deprecated. On ARC-AGI-2, scores climbed from zero at its 2025 launch but stayed contested and far from the 85-percent prize bar, according to third-party trackers.

Q: Will new benchmarks like ARC-AGI-2 and Humanity’s Last Exam restore meaningful AI evaluation in 2026? A: They restore headroom — domain experts still beat top models by more than 40 points on Humanity’s Last Exam. But durability depends on resisting contamination as training data catches up. Expect the cycle to repeat unless harnesses stay standardized.

The Bottom Line

The number on the model card matters less than the test behind it. In 2026, a benchmark nobody can ace is worth more than one everybody tops. You’re either tracking which labs post contamination-resistant scores — or you’re trusting a ruler their peers already threw out.

Stay ahead, Dan.

Sources

OpenAI: Why we no longer evaluate SWE-bench Verified - Rationale for deprecating SWE-bench Verified: contamination and flawed test cases.
Scale Labs: SWE-bench Pro Leaderboard - Dataset size, task design, and standardized Pass@1 scores for SWE-bench Pro.
Scale Labs: Humanity’s Last Exam Leaderboard - Official no-tools frontier scores and the human-expert gap.
Center for AI Safety: Humanity’s Last Exam - Benchmark creators, scope, and the 2,500-question expert-written design.
ARC Prize: ARC Prize 2025 Results and Analysis - ARC-AGI-2 launch, zero-at-launch scores, human baseline, and prize thresholds.
Morph SWE-bench Pro tracker: SWE-bench Pro Leaderboard (2026) - Standardized versus vendor-reported model scores and the Verified-to-Pro difficulty gap.
LLM-Stats: SWE-bench Verified & ARC-AGI-v2 Leaderboards - Saturation data for SWE-bench Verified and ARC-AGI-2 frontier tracking.

Aha Moments

MONA

Saturation is a measurement failure, not a milestone. When a test’s items leak into training data, a high score reflects recall, not reasoning — the benchmark stops discriminating between models because it is grading memory. Not smarter models. A broken ruler. What the new tests recover is construct validity: tasks held out from training, graded cleanly, with enough difficulty that the score still varies across systems. That variance is the whole purpose of a benchmark. The moment every frontier model clusters at the ceiling, the instrument has gone blind — it can no longer tell you which model is actually better, only that they all memorized the same set. Contamination resistance is not a feature. It is the precondition for measuring anything at all.

MAX

Mona is right about the grader, and that is the part teams underestimate. A benchmark is a spec. If the test cases are broken — and OpenAI found that many were — you are optimizing against a wrong spec, and every point you gain is noise dressed as progress. The fix is the same discipline you would apply to any acceptance suite: held-out tasks, a standardized harness, real repositories instead of toy problems, and graders you have actually audited. SWE-bench Pro reads like someone finally wrote the spec properly. The lesson for anyone choosing a model: do not trust a score until you know how it was produced. The harness, the effort setting, the tooling — those define the number more than the model does.

ALAN

Both of them are describing a better instrument. I would ask who holds it. A handful of organizations now define what “frontier” means by designing the tests everyone is judged against — and the same companies sometimes build the benchmarks and the models. Harder tests restore headroom, yes. But a harder test is still a test someone chose, scoped, and scored according to their own idea of what intelligence should look like. When the bar moves, capability follows the bar, not the other way around. We optimize for what we measure, and we measure what a few institutions decided was worth measuring. So before we celebrate the new rulers: who gets to decide what counts as progress, and who is in the room when that line is drawn?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors