DAN Analysis 8 min read

Claude Code, OpenHands, and Devin: How the 2026 SWE-bench Race Is Reshaping Code Execution Agents

Three competing code execution agents racing along diverging scaffolding paths above a benchmark leaderboard
Before you dive in

This article is a specific deep-dive within our broader topic of Code Execution Agents.

This article assumes familiarity with:

TL;DR

  • The shift: The same base model now produces wildly different SWE-bench scores depending on which agent scaffold wraps it.
  • Why it matters: The buying decision for Code Execution Agents stopped being about the LLM and started being about the harness.
  • What’s next: First-party scaffolds (Claude Code, Codex) pull away from generic integrations, and broader evals replace pure Verified scoring.

The leaderboard is misleading you. Anthropic’s models hold five of the top ten slots on SWE-bench Verified. GPT-5.5 sits at 88.7%. Headlines keep crowning a new king every six weeks. But the actual purchasing decision — which coding agent your team should bet on — has stopped being a model decision. The scaffold around the LLM now swings results by roughly 17 points on the same base weights. That’s the whole story.

The Real Race Isn’t Between Models — It’s Between Scaffolds

Thesis: The 2026 SWE-bench race split — frontier models keep climbing, but the agent framework wrapped around them now decides whether code ships or stalls.

For two years the assumption was simple. Pick the strongest model, get the strongest agent. That assumption is broken.

Look at Claude Opus 4.6 alone. On a bare scaffold it posts 80.8% on SWE-bench Verified. Wrap that same model in OpenHands with CodeAct v3 and the score lands at 68.4% (Awesome Agents). Same weights. Same benchmark. A ~17-point gap.

That’s not noise. That’s architecture.

The benchmark is no longer measuring the model. It’s measuring how well the harness plans, retries, recovers, and reads the file system. That work is Workflow Orchestration For AI, and the agent is the product now — not the model.

Three Numbers, One Pattern

The evidence stacks the same direction from three independent angles.

Anthropic’s first-party scaffold leads. Claude Code on Opus 4.7 posts 87.6% on SWE-bench Verified (llm-stats.com). The same Opus weights under a generic scaffold drop materially. Anthropic isn’t winning because Opus is best in isolation. Anthropic is winning because Anthropic built the scaffold.

OpenAI ran the same playbook. GPT-5.5 hit 88.7% on April 23, 2026 (OpenAI), but the agent shipping into real workflows is GPT-5.3-Codex — purpose-built tooling, Codex-tuned, leading Terminal-Bench 2.0 at 77.3% and SWE-bench Pro at 56.8%. The Codex variant exists because raw GPT-5.x without a scaffold is not the same product.

Then the production reality check. Devin scores 45.8% on SWE-bench Verified unassisted (Cognition). It also posts a 67% PR merge rate on real codebases, up from 34% the prior year (Cognition Blog). That gap — benchmark vs production — isn’t a Devin problem. METR found maintainer merge decisions run about 24 percentage points below SWE-bench grader pass rates across the field.

Three labs. Three approaches. One signal: the benchmark and the product are diverging.

The Winners

Anthropic owns the moment. Five of the top ten SWE-bench Verified entries are Claude variants (llm-stats.com). Claude Code ships everywhere — terminal, VS Code, JetBrains, desktop, web, iOS — and powers the 87.6% Opus 4.7 result (techjacksolutions.com). Subscription tiers sit at $20/mo Pro, $100–$200/mo Max, and $100/seat/mo Premium with a 5-seat minimum (Claude pricing page). Teams paying API rates pay $5/MTok input and $25/MTok output for Opus 4.7 (Anthropic’s pricing page) — but get the scaffold and the model as one stack.

OpenAI’s Codex line is the other commercial winner. GPT-5.3-Codex leads SWE-bench Pro and Terminal-Bench 2.0. Not because GPT-5.3 is stronger than GPT-5.5 — it isn’t — but Codex is engineered for the loop, not the leaderboard.

OpenHands wins the open-source category. Roughly 68–72% on SWE-bench Verified with a Claude 4 backend (OpenHands blog), and the OpenHands Index, launched January 28, 2026, expanded evaluation beyond Verified into issue resolution, greenfield, frontend, and testing. Right move at the right time.

The Losers

Teams treating SWE-bench Verified as the buying signal lost the plot. The score predicts grader pass rates. It does not predict maintainer merge decisions. METR’s ~24-point production gap means an 80% benchmark agent can still see most of its PRs rejected.

Bare-model integrators are next. Anyone wiring Opus 4.7 or GPT-5.5 into a custom in-house harness is competing against Anthropic and OpenAI’s purpose-built scaffolds — and starting 10 to 20 points down before they ship a feature.

Then the frontier-as-marketing crowd. Claude Mythos Preview posts 93.9% on SWE-bench Verified (Anthropic) but isn’t a shipping coding agent — it’s gated inside Project Glasswing for cybersecurity-only access across a dozen partner firms. You cannot pick it. Stop benchmarking against it.

What Happens Next

Base case (most likely): The scaffold-vs-model gap widens through Q4 2026. The meaningful buying question shifts from “best model” to “best agent on which model.” Signal to watch: A second-tier base model topping SWE-bench Verified inside a stronger scaffold than its frontier sibling. Timeline: Six to nine months.

Bull case: The OpenHands Index and SWE-bench Pro become the standard. Verified loses authority as scaffolds optimize for production-shaped work. Signal: A frontier lab headlines results on OpenHands Index or SWE-bench Pro before Verified. Timeline: Twelve months.

Bear case: The gap stays unfixed. Teams keep buying on Verified, hit the production wall, and the category cools for two quarters before broader evals reset expectations. Signal: A wave of “we tried it and rolled back” enterprise post-mortems. Timeline: Three to six months.

Frequently Asked Questions

Q: Which code execution agent tops SWE-bench Verified in 2026? A: Among shipping coding agents, Claude Code on Opus 4.7 leads at 87.6% (llm-stats.com). GPT-5.3-Codex tracks near 85%. OpenHands with a Claude 4 backend sits at 68–72%. Claude Mythos Preview (93.9%) is restricted access, not a buyable product.

Q: What is the future of code execution agents? A: Scaffolding decides outcomes more than the underlying LLM. Expect first-party agents (Claude Code, Codex) and open-source frameworks (OpenHands) to converge on broader evaluations like the OpenHands Index and SWE-bench Pro, while pure-Verified scoring loses authority.

Q: Where are code execution agents headed after Claude Mythos and GPT-5.2? A: GPT-5.2 was the inflection point at 80.0%. GPT-5.3-Codex and GPT-5.5 already moved the ceiling, and Mythos sits above the shipping market without being available for it. The frontier keeps climbing — but the buying decision moves to scaffold quality and production merge rates.

The Bottom Line

The 2026 race isn’t Claude vs GPT-5. It’s first-party scaffolds vs everything else. If your evaluation framework still treats SWE-bench Verified as the answer, you’re shopping with last year’s map. You’re either testing agents on your own codebase or you’re trusting a leaderboard that doesn’t predict your production outcomes.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors