DAN Analysis 8 min read

SWE-bench Pro, ARC-AGI-2, and Humanity's Last Exam: The Benchmarks Defining Frontier Models in 2026

Comparison of 2026 AI benchmarks SWE-bench Pro, ARC-AGI-2, and Humanity's Last Exam replacing saturated coding tests

TL;DR

  • The shift: AI labs are abandoning saturated coding benchmarks for contamination-resistant tests that models can’t memorize their way through.
  • Why it matters: The benchmark score on a model card now tells you less than the test it came from — and the labs know it.
  • What’s next: SWE-bench Pro, ARC-AGI-2, and Humanity’s Last Exam become the numbers that separate real capability from leaderboard theater.

Every model card in your feed still leads with a SWE Bench score. Most are quoting a number the test’s own backers no longer trust. On February 23, 2026, OpenAI stopped reporting SWE-bench Verified — the benchmark it had ridden to the top of every coding leaderboard. That’s not a footnote. That’s the industry conceding its favorite yardstick broke.

The Ruler the Labs Threw Out

Thesis: The frontier just traded saturated, contaminated benchmarks for tests with real headroom — and the score everyone still quotes is now the one that tells you the least.

For two years, a handful of Benchmark Datasets ranked every frontier release. Now the people who built them are walking away.

SWE-bench Verified did its job too well. The frontier cluster bunched above 80 percent, top models packed near the ceiling, per LLM-Stats. When every model aces a test, the test stops ranking anything.

OpenAI’s stated reason was Benchmark Contamination. Frontier models were reproducing the gold-standard fixes almost verbatim — they had seen the answers. Worse, OpenAI reported that most of the failures it reviewed came from broken or flawed test cases, not model mistakes.

So a high Verified score measured two things: memorization and a noisy grader. Neither is capability.

It’s the same arc that flattened MMLU Benchmark and HumanEval before it. And as labs lean harder on Synthetic Data Generation to feed training runs, the overlap between training corpora and test sets only widens.

The labs didn’t lose faith in the models. They lost faith in the ruler.

Three Tests, One Bet

Three teams answered with three different tests. They made the same bet: measure what a model can’t fake.

SWE-bench Pro, from Scale AI, runs 1,865 real-world tasks across 41 professional repositories, scored Pass@1, per Scale Labs. The difficulty gap is brutal — models that hit 80 to 95 percent on Verified solve roughly 59 percent on Pro under a standardized harness, according to the Morph SWE-bench Pro tracker. Same models. Honest test. Half the score.

ARC-AGI-2, from the ARC Prize Foundation, launched in March 2025 — and every frontier model scored zero. The average person solves about two-thirds of it; the grand-prize bar sits above 85 percent, per ARC Prize. Through 2026 the numbers are climbing but still contested across trackers, and the headroom is real.

Humanity’s Last Exam, built by the Center for AI Safety with Scale AI, runs 2,500 expert-written, multi-modal questions. Without tools, the frontier cluster sits in the mid-forties on the official leaderboard, per Scale Labs. Domain experts score around 90 percent. That gap is the point.

Three teams, three tests, one signal: the era of the ace-able benchmark is over.

Who Gains From a Harder Test

The test-makers just became kingmakers. Scale AI now sits behind two of the three benchmarks that matter; the ARC Prize Foundation and the Center for AI Safety hold the others. Whoever defines the test defines the frontier.

Labs that post strong numbers on contamination-resistant tests get something a saturated leaderboard could never sell: a credible claim. A near-perfect Verified score is a press release. A leading SWE-bench Pro score is a moat.

Open-weight models gain the most. On the standardized Pro harness, open releases — GLM-5.2 among them — are running near the top of the coding tier, per the Morph tracker. When the test can’t be gamed by memorization, the gap to closed models narrows on merit.

You’re either competing on the honest test or you’re selling the reflection in a broken mirror.

Who Just Got Exposed

Any model whose entire pitch was a ceiling-scraping Verified score just lost its headline.

The vendor-versus-standardized gap is now visible to everyone. Self-reported SWE-bench Pro figures run higher than what third-party harnesses reproduce, per the Morph tracker. That spread used to be invisible. Now it’s a credibility test.

Teams picking models off saturated leaderboards are flying blind. If your selection process still ranks on Verified, you’re optimizing for a test the field already retired.

The saturated leaderboard was a mirror. The new ones are a window.

What Happens Next

Base case (most likely): Labs converge on SWE-bench Pro, ARC-AGI-2, and Humanity’s Last Exam as the default trio on model cards, and the vendor-versus-standardized gap becomes a standing scrutiny point. Signal to watch: More labs follow OpenAI and quietly drop SWE-bench Verified from their reporting. Timeline: Through the second half of 2026.

Bull case: Contamination-resistant evaluation restores trust in published scores, and ARC-AGI-3’s interactive-reasoning track becomes the next proving ground. Signal: Third-party standardized harnesses become the citation everyone defaults to, not vendor numbers. Timeline: Late 2026.

Bear case: The new tests get gamed or fragment across harness settings, until no two reported scores are comparable and the trust problem simply moves addresses. Signal: Self-reported Pro numbers drift far above standardized results with no reconciliation. Timeline: Into 2027.

Frequently Asked Questions

Q: Which benchmarks are AI labs reporting on model cards in 2026? A: Frontier labs increasingly report SWE-bench Pro for coding, ARC-AGI-2 for abstract reasoning, and Humanity’s Last Exam for expert knowledge. OpenAI dropped SWE-bench Verified in February 2026, pushing the field toward contamination-resistant tests.

Q: How did frontier models score on SWE-bench Verified and ARC-AGI-2 in 2026? A: On SWE-bench Verified, top models bunched above 80 percent — the saturation that got it deprecated. On ARC-AGI-2, scores climbed from zero at its 2025 launch but stayed contested and far from the 85-percent prize bar, according to third-party trackers.

Q: Will new benchmarks like ARC-AGI-2 and Humanity’s Last Exam restore meaningful AI evaluation in 2026? A: They restore headroom — domain experts still beat top models by more than 40 points on Humanity’s Last Exam. But durability depends on resisting contamination as training data catches up. Expect the cycle to repeat unless harnesses stay standardized.

The Bottom Line

The number on the model card matters less than the test behind it. In 2026, a benchmark nobody can ace is worth more than one everybody tops. You’re either tracking which labs post contamination-resistant scores — or you’re trusting a ruler their peers already threw out.

Stay ahead, Dan.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: