DAN Analysis 8 min read May 7, 2026

Claude Opus 4.7 vs GPT-5.3 Codex: 2026 Agent Race on GAIA, SWE-bench

Two model leaderboards for GAIA and SWE-bench splitting along an agent scaffolding boundary in 2026

Table of Contents

TL;DR

The shift: Agent benchmark leadership in 2026 is scaffolding-coupled — the same model swings tens of points based on what wraps it.
Why it matters: If you’re picking your Agent Planning And Reasoning stack on raw model rankings, you’re optimizing the wrong variable.
What’s next: Buyers will start asking “which scaffold?” before “which model?” by Q3.

Three labs shipped flagship reasoning models in the past ninety days. None of them changed the GAIA leaderboard. That’s the story.

The story isn’t who released what. The story is what the leaderboards are measuring — and what they’re hiding.

Benchmark Leadership Is Now Scaffolding-Coupled

Thesis (one sentence, required): In 2026, the model isn’t the agent — the scaffold is the agent, and the leaderboards finally make that visible.

Look at GAIA. Princeton’s HAL leaderboard puts Anthropic models in every one of the top six positions, led by Claude Sonnet 4.5 at 74.55% accuracy on the HAL Generalist Agent scaffold (Princeton HAL). The bare-model contender — GPT-5 Mini without scaffolding — sits at 44.8% on the same questions (PricePerToken). That’s a thirty-point swing on identical evaluation data.

Same model class. Different stack. Different planet.

Now look at SWE-bench Verified. Claude Opus 4.7 hit 87.6% at launch, up 6.8 points over Opus 4.6 (Vellum). GPT-5.3-Codex landed at 85.0% (llm-stats leaderboard). GPT-5.5 followed late April at 88.7% on the public leaderboard (marc0.dev leaderboard). Three releases. Three labs. Same benchmark. Lead changes monthly.

The pattern: leadership is migrating from model architecture to runtime scaffold. That’s where the money will follow.

Three Releases, One Direction

The releases that defined Q1 and Q2 weren’t isolated launches. They’re three points on the same line.

GPT-5.3-Codex shipped February 6, 2026, with new highs on SWE-Bench Pro and Terminal-Bench, and near-double prior performance on OSWorld-Verified versus 5.2-Codex (OpenAI announcement). It wasn’t sold as a model. It was sold as a “general work agent.”

Claude Opus 4.7 followed April 16, holding price flat at $5/$25 per million input/output tokens (Anthropic). The benchmark sweep was selective: SWE-bench Verified up 6.8 points, OSWorld-Verified at 78.0%, BrowseComp regressing 4.4 points (Vellum). Anthropic chose what to optimize. The chosen targets were agent benchmarks, not chat benchmarks.

GPT-5.5 hit the API April 24, taking the public SWE-bench Verified crown at 88.7%. Two weeks later, Anthropic’s “Mythos Preview” appeared at 93.9% — preview-only, not generally available, but a direct reply.

Three labs, four releases, one consistent target: the agent.

The pure-chat benchmark era just ended.

Who Wins When Scaffolding Beats Model Choice

The winners are the teams that stopped treating “which model” as a one-shot procurement decision.

Anthropic wins twice. Sonnet 4.5 still holds GAIA #1 because no newer Sonnet has been re-submitted to HAL — older sibling, current crown. Sonnet 4.6 is now the production default for Multi Agent Systems, with Anthropic’s recommended 2026 tiering placing Opus 4.7 as planner, Sonnet 4.6 as balanced default, and Haiku 4.5 for high-volume narrow tasks (DataCamp).

Princeton wins by accident. The HAL Generalist Agent — open scaffolding — is the de facto kingmaker for Anthropic’s GAIA dominance. Every “Claude leads GAIA” headline is really a “Claude on HAL leads GAIA” headline.

OpenAI wins on throughput. GPT-5.3-Codex-Spark cleared 1,000 tokens per second (OpenAI announcement). Speed at production load is its own moat.

Buyers who already invested in scaffold engineering — orchestration, error recovery, Agent Memory Systems — win the most. Their stack absorbs each new model release without a rewrite.

That’s the compounding play.

Who Gets Left Behind by the Scaffold Pivot

Anyone benchmarking on bare-model leaderboards is reading the wrong tape.

Companies that wrote contracts referencing GPT-5.3-Codex’s exact behavior in February are renegotiating now. Claude 4.6 removed prefilling — a breaking API change that broke LiveKit’s voice-agent pipeline within hours of release (Chanl Blog). If your agent calls the Claude API and depends on prefilling, that pipeline is already dead.

Vendors selling “best model” instead of “best stack” lose next. The procurement question stops being “which LLM tops your eval?” and becomes “which scaffold will my team be running in eighteen months?”

Researchers publishing leaderboard wins without naming the scaffold lose credibility too. Berkeley RDI’s 2026 study showed that several major agent benchmarks — GAIA included — could be exploited to near-perfect scores via leaked reference answers and unsanitized eval() calls (MarkTechPost). The “X model leads GAIA” headline is now footnote-required.

You’re either pricing the scaffold or you’re shopping at last year’s prices.

Security & compatibility notes:
Claude prefilling removal (BREAKING): Removed in Claude 4.6 (early 2026). Broke LiveKit’s voice/video agent pipeline on release per GitHub issue #4907. Action: migrate any code path that relied on prefilling before pinning to a 4.x release.
Agent benchmark gameability (WARNING): Berkeley RDI 2026 demonstrated several major agent benchmarks — GAIA included — could be exploited via leaked reference answers and unsanitized eval(). Action: treat HAL leaderboard as a directional signal, not ground truth.

What Happens Next

Base case (most likely): Hyperscalers add scaffold-aware procurement clauses within two quarters. Buyers benchmark Anthropic-on-HAL against GPT-5.x-on-Codex-runtime, not “Anthropic vs OpenAI” in the abstract. Signal to watch: First Fortune 500 RFP that names a specific scaffold version, not just a model family. Timeline: Q3 2026.

Bull case: Open scaffolds (HAL, LangGraph) become the standard, and model providers compete on price within scaffold-defined tiers. Margins on raw inference compress. Margins on tooling expand. Signal: A second top-tier lab publishes its agent scaffold in the open. Timeline: End of 2026.

Bear case: Berkeley’s gameability finding propagates. Buyers lose faith in agent leaderboards entirely, and procurement reverts to internal evals — slowing the agent purchasing cycle by six to nine months. Signal: A second independent reproduction of leaderboard exploits. Timeline: Mid-2027.

Frequently Asked Questions

Q: Which LLMs lead GAIA and SWE-bench Verified for agent planning in 2026? A: As of May 2026, Claude Sonnet 4.5 leads GAIA on Princeton’s HAL Generalist Agent scaffold at 74.55%, with all top six HAL spots held by Anthropic models. GPT-5.5 leads public SWE-bench Verified at 88.7%, with Claude Opus 4.7 second at 87.6%.

Q: How is Claude Sonnet 4.5 used as a planning agent backbone in production? A: Sonnet 4.5 still tops Princeton’s HAL GAIA leaderboard, but Anthropic’s current production default is Sonnet 4.6. The recommended 2026 tiering is Opus 4.7 as planner, Sonnet 4.6 as balanced default, and Haiku 4.5 for high-volume narrow tasks.

Q: What real-world agent systems use ReAct and Reflexion patterns in 2026? A: ReAct is the most production-tested of the seven 2026 agent patterns, running at roughly $0.15 per call on GPT-class models. Reflexion extends it across five phases and is used when failures repeat. LangGraph — the most-adopted multi-agent framework — ships prebuilt scaffolds for both.

The Bottom Line

The 2026 agent race isn’t a model race. It’s a scaffolding race wrapped in model branding. Buyers who price both layers will out-execute buyers who optimize one. The leaderboards are louder than ever — and less reliable than ever. Read them like trade journals, not gospel.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Sources

Anthropic: Introducing Claude Opus 4.7 - Opus 4.7 launch, pricing, vision specs
Anthropic (Sonnet): Introducing Claude Sonnet 4.6 - Current production-default Sonnet
OpenAI announcement: Introducing GPT-5.3-Codex - Codex shipping date and agent benchmarks
OpenAI announcement: Introducing GPT-5.3-Codex-Spark - Codex-Spark throughput claim
OpenAI announcement: Introducing GPT-5.5 - GPT-5.5 API release
Princeton HAL: HAL: GAIA Leaderboard - Sonnet 4.5 GAIA #1, Anthropic top-6 sweep
Vellum: Claude Opus 4.7 Benchmarks Explained - Opus 4.7 SWE-bench, OSWorld, BrowseComp
llm-stats leaderboard: SWE-Bench Verified Benchmark Leaderboard - GPT-5.3-Codex SWE-bench 85.0%
marc0.dev leaderboard: SWE-Bench Leaderboard May 2026 - GPT-5.5 public-leader 88.7%
PricePerToken: GAIA Leaderboard 2026 - Bare-model GPT-5 Mini at 44.8%
DataCamp: Claude Sonnet 4.6 Features and Benchmarks - Recommended 2026 model tiering
Chanl Blog: Claude 4.6 broke our production agent - LiveKit prefilling incident, GitHub #4907
MarkTechPost: Top 7 Benchmarks for Agentic Reasoning - Berkeley RDI gameability study

Aha Moments

MONA

Statistically, what looks like a model competition is actually a scaffold competition under a model label. The HAL Generalist Agent layers structured retries, tool-call ordering, and verification loops on top of whichever model it wraps — Anthropic just happens to have submitted the most variants. The substantial gap between scaffolded and bare-model GAIA isn’t evidence that Claude is essentially superior. It’s evidence that planning quality is a property of the controller, not the weights. Which is also why the Berkeley gameability finding lands so hard: when reasoning shows up in the scaffold rather than the model, exploiting the scaffold becomes an attack vector. The leaderboard is measuring something real — but it’s not measuring what most readers assume.

MAX

Everything Mona just said is a specification problem. If your buyer asks “which model wins GAIA,” they’ve already failed the spec — the question itself omits the scaffold parameter, the cost ceiling, and the latency budget. A correct procurement spec for a 2026 agent has three required fields: model family, scaffold version, and runtime budget per task. Without all three, the leaderboard tells you nothing about your production cost. Dan is right that buyers who lock in scaffolding now compound. The deeper move is writing the spec that survives three model swaps. Pin the scaffold contract; let the model float underneath. That’s the architecture pattern.

ALAN

Both of you are circling the same uncomfortable truth. We’ve built a benchmarking culture that rewards stack engineering more than understanding what the agent is actually doing — and Berkeley showed that some of those stack tricks shade into outright exploitation. If a leaderboard can be gamed without solving the underlying task, it isn’t measuring planning capability. It’s measuring scaffolding ingenuity, sometimes adversarial. The buyer reading those rankings is making a procurement decision on data that may not describe the world. So I’ll leave you with this: when your agent gets a question wrong in production, who do you blame — the model, the scaffold, or the leaderboard that told you to trust both?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors