DAN Analysis 8 min read

Claude Opus 4.7 vs GPT-5.3 Codex: 2026 Agent Race on GAIA, SWE-bench

Two model leaderboards for GAIA and SWE-bench splitting along an agent scaffolding boundary in 2026
Before you dive in

This article is a specific deep-dive within our broader topic of Agent Planning and Reasoning.

This article assumes familiarity with:

TL;DR

  • The shift: Agent benchmark leadership in 2026 is scaffolding-coupled — the same model swings tens of points based on what wraps it.
  • Why it matters: If you’re picking your Agent Planning And Reasoning stack on raw model rankings, you’re optimizing the wrong variable.
  • What’s next: Buyers will start asking “which scaffold?” before “which model?” by Q3.

Three labs shipped flagship reasoning models in the past ninety days. None of them changed the GAIA leaderboard. That’s the story.

The story isn’t who released what. The story is what the leaderboards are measuring — and what they’re hiding.

Benchmark Leadership Is Now Scaffolding-Coupled

Thesis (one sentence, required): In 2026, the model isn’t the agent — the scaffold is the agent, and the leaderboards finally make that visible.

Look at GAIA. Princeton’s HAL leaderboard puts Anthropic models in every one of the top six positions, led by Claude Sonnet 4.5 at 74.55% accuracy on the HAL Generalist Agent scaffold (Princeton HAL). The bare-model contender — GPT-5 Mini without scaffolding — sits at 44.8% on the same questions (PricePerToken). That’s a thirty-point swing on identical evaluation data.

Same model class. Different stack. Different planet.

Now look at SWE-bench Verified. Claude Opus 4.7 hit 87.6% at launch, up 6.8 points over Opus 4.6 (Vellum). GPT-5.3-Codex landed at 85.0% (llm-stats leaderboard). GPT-5.5 followed late April at 88.7% on the public leaderboard (marc0.dev leaderboard). Three releases. Three labs. Same benchmark. Lead changes monthly.

The pattern: leadership is migrating from model architecture to runtime scaffold. That’s where the money will follow.

Three Releases, One Direction

The releases that defined Q1 and Q2 weren’t isolated launches. They’re three points on the same line.

GPT-5.3-Codex shipped February 6, 2026, with new highs on SWE-Bench Pro and Terminal-Bench, and near-double prior performance on OSWorld-Verified versus 5.2-Codex (OpenAI announcement). It wasn’t sold as a model. It was sold as a “general work agent.”

Claude Opus 4.7 followed April 16, holding price flat at $5/$25 per million input/output tokens (Anthropic). The benchmark sweep was selective: SWE-bench Verified up 6.8 points, OSWorld-Verified at 78.0%, BrowseComp regressing 4.4 points (Vellum). Anthropic chose what to optimize. The chosen targets were agent benchmarks, not chat benchmarks.

GPT-5.5 hit the API April 24, taking the public SWE-bench Verified crown at 88.7%. Two weeks later, Anthropic’s “Mythos Preview” appeared at 93.9% — preview-only, not generally available, but a direct reply.

Three labs, four releases, one consistent target: the agent.

The pure-chat benchmark era just ended.

Who Wins When Scaffolding Beats Model Choice

The winners are the teams that stopped treating “which model” as a one-shot procurement decision.

Anthropic wins twice. Sonnet 4.5 still holds GAIA #1 because no newer Sonnet has been re-submitted to HAL — older sibling, current crown. Sonnet 4.6 is now the production default for Multi Agent Systems, with Anthropic’s recommended 2026 tiering placing Opus 4.7 as planner, Sonnet 4.6 as balanced default, and Haiku 4.5 for high-volume narrow tasks (DataCamp).

Princeton wins by accident. The HAL Generalist Agent — open scaffolding — is the de facto kingmaker for Anthropic’s GAIA dominance. Every “Claude leads GAIA” headline is really a “Claude on HAL leads GAIA” headline.

OpenAI wins on throughput. GPT-5.3-Codex-Spark cleared 1,000 tokens per second (OpenAI announcement). Speed at production load is its own moat.

Buyers who already invested in scaffold engineering — orchestration, error recovery, Agent Memory Systems — win the most. Their stack absorbs each new model release without a rewrite.

That’s the compounding play.

Who Gets Left Behind by the Scaffold Pivot

Anyone benchmarking on bare-model leaderboards is reading the wrong tape.

Companies that wrote contracts referencing GPT-5.3-Codex’s exact behavior in February are renegotiating now. Claude 4.6 removed prefilling — a breaking API change that broke LiveKit’s voice-agent pipeline within hours of release (Chanl Blog). If your agent calls the Claude API and depends on prefilling, that pipeline is already dead.

Vendors selling “best model” instead of “best stack” lose next. The procurement question stops being “which LLM tops your eval?” and becomes “which scaffold will my team be running in eighteen months?”

Researchers publishing leaderboard wins without naming the scaffold lose credibility too. Berkeley RDI’s 2026 study showed that several major agent benchmarks — GAIA included — could be exploited to near-perfect scores via leaked reference answers and unsanitized eval() calls (MarkTechPost). The “X model leads GAIA” headline is now footnote-required.

You’re either pricing the scaffold or you’re shopping at last year’s prices.

Security & compatibility notes:

  • Claude prefilling removal (BREAKING): Removed in Claude 4.6 (early 2026). Broke LiveKit’s voice/video agent pipeline on release per GitHub issue #4907. Action: migrate any code path that relied on prefilling before pinning to a 4.x release.
  • Agent benchmark gameability (WARNING): Berkeley RDI 2026 demonstrated several major agent benchmarks — GAIA included — could be exploited via leaked reference answers and unsanitized eval(). Action: treat HAL leaderboard as a directional signal, not ground truth.

What Happens Next

Base case (most likely): Hyperscalers add scaffold-aware procurement clauses within two quarters. Buyers benchmark Anthropic-on-HAL against GPT-5.x-on-Codex-runtime, not “Anthropic vs OpenAI” in the abstract. Signal to watch: First Fortune 500 RFP that names a specific scaffold version, not just a model family. Timeline: Q3 2026.

Bull case: Open scaffolds (HAL, LangGraph) become the standard, and model providers compete on price within scaffold-defined tiers. Margins on raw inference compress. Margins on tooling expand. Signal: A second top-tier lab publishes its agent scaffold in the open. Timeline: End of 2026.

Bear case: Berkeley’s gameability finding propagates. Buyers lose faith in agent leaderboards entirely, and procurement reverts to internal evals — slowing the agent purchasing cycle by six to nine months. Signal: A second independent reproduction of leaderboard exploits. Timeline: Mid-2027.

Frequently Asked Questions

Q: Which LLMs lead GAIA and SWE-bench Verified for agent planning in 2026? A: As of May 2026, Claude Sonnet 4.5 leads GAIA on Princeton’s HAL Generalist Agent scaffold at 74.55%, with all top six HAL spots held by Anthropic models. GPT-5.5 leads public SWE-bench Verified at 88.7%, with Claude Opus 4.7 second at 87.6%.

Q: How is Claude Sonnet 4.5 used as a planning agent backbone in production? A: Sonnet 4.5 still tops Princeton’s HAL GAIA leaderboard, but Anthropic’s current production default is Sonnet 4.6. The recommended 2026 tiering is Opus 4.7 as planner, Sonnet 4.6 as balanced default, and Haiku 4.5 for high-volume narrow tasks.

Q: What real-world agent systems use ReAct and Reflexion patterns in 2026? A: ReAct is the most production-tested of the seven 2026 agent patterns, running at roughly $0.15 per call on GPT-class models. Reflexion extends it across five phases and is used when failures repeat. LangGraph — the most-adopted multi-agent framework — ships prebuilt scaffolds for both.

The Bottom Line

The 2026 agent race isn’t a model race. It’s a scaffolding race wrapped in model branding. Buyers who price both layers will out-execute buyers who optimize one. The leaderboards are louder than ever — and less reliable than ever. Read them like trade journals, not gospel.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors