DAN Analysis 8 min read May 14, 2026

Claude Code, OpenHands, and Devin: How the 2026 SWE-bench Race Is Reshaping Code Execution Agents

Three competing code execution agents racing along diverging scaffolding paths above a benchmark leaderboard

Table of Contents

TL;DR

The shift: The same base model now produces wildly different SWE-bench scores depending on which agent scaffold wraps it.
Why it matters: The buying decision for Code Execution Agents stopped being about the LLM and started being about the harness.
What’s next: First-party scaffolds (Claude Code, Codex) pull away from generic integrations, and broader evals replace pure Verified scoring.

The leaderboard is misleading you. Anthropic’s models hold five of the top ten slots on SWE-bench Verified. GPT-5.5 sits at 88.7%. Headlines keep crowning a new king every six weeks. But the actual purchasing decision — which coding agent your team should bet on — has stopped being a model decision. The scaffold around the LLM now swings results by roughly 17 points on the same base weights. That’s the whole story.

The Real Race Isn’t Between Models — It’s Between Scaffolds

Thesis: The 2026 SWE-bench race split — frontier models keep climbing, but the agent framework wrapped around them now decides whether code ships or stalls.

For two years the assumption was simple. Pick the strongest model, get the strongest agent. That assumption is broken.

Look at Claude Opus 4.6 alone. On a bare scaffold it posts 80.8% on SWE-bench Verified. Wrap that same model in OpenHands with CodeAct v3 and the score lands at 68.4% (Awesome Agents). Same weights. Same benchmark. A ~17-point gap.

That’s not noise. That’s architecture.

The benchmark is no longer measuring the model. It’s measuring how well the harness plans, retries, recovers, and reads the file system. That work is Workflow Orchestration For AI, and the agent is the product now — not the model.

Three Numbers, One Pattern

The evidence stacks the same direction from three independent angles.

Anthropic’s first-party scaffold leads. Claude Code on Opus 4.7 posts 87.6% on SWE-bench Verified (llm-stats.com). The same Opus weights under a generic scaffold drop materially. Anthropic isn’t winning because Opus is best in isolation. Anthropic is winning because Anthropic built the scaffold.

OpenAI ran the same playbook. GPT-5.5 hit 88.7% on April 23, 2026 (OpenAI), but the agent shipping into real workflows is GPT-5.3-Codex — purpose-built tooling, Codex-tuned, leading Terminal-Bench 2.0 at 77.3% and SWE-bench Pro at 56.8%. The Codex variant exists because raw GPT-5.x without a scaffold is not the same product.

Then the production reality check. Devin scores 45.8% on SWE-bench Verified unassisted (Cognition). It also posts a 67% PR merge rate on real codebases, up from 34% the prior year (Cognition Blog). That gap — benchmark vs production — isn’t a Devin problem. METR found maintainer merge decisions run about 24 percentage points below SWE-bench grader pass rates across the field.

Three labs. Three approaches. One signal: the benchmark and the product are diverging.

The Winners

Anthropic owns the moment. Five of the top ten SWE-bench Verified entries are Claude variants (llm-stats.com). Claude Code ships everywhere — terminal, VS Code, JetBrains, desktop, web, iOS — and powers the 87.6% Opus 4.7 result (techjacksolutions.com). Subscription tiers sit at $20/mo Pro, $100–$200/mo Max, and $100/seat/mo Premium with a 5-seat minimum (Claude pricing page). Teams paying API rates pay $5/MTok input and $25/MTok output for Opus 4.7 (Anthropic’s pricing page) — but get the scaffold and the model as one stack.

OpenAI’s Codex line is the other commercial winner. GPT-5.3-Codex leads SWE-bench Pro and Terminal-Bench 2.0. Not because GPT-5.3 is stronger than GPT-5.5 — it isn’t — but Codex is engineered for the loop, not the leaderboard.

OpenHands wins the open-source category. Roughly 68–72% on SWE-bench Verified with a Claude 4 backend (OpenHands blog), and the OpenHands Index, launched January 28, 2026, expanded evaluation beyond Verified into issue resolution, greenfield, frontend, and testing. Right move at the right time.

The Losers

Teams treating SWE-bench Verified as the buying signal lost the plot. The score predicts grader pass rates. It does not predict maintainer merge decisions. METR’s ~24-point production gap means an 80% benchmark agent can still see most of its PRs rejected.

Bare-model integrators are next. Anyone wiring Opus 4.7 or GPT-5.5 into a custom in-house harness is competing against Anthropic and OpenAI’s purpose-built scaffolds — and starting 10 to 20 points down before they ship a feature.

Then the frontier-as-marketing crowd. Claude Mythos Preview posts 93.9% on SWE-bench Verified (Anthropic) but isn’t a shipping coding agent — it’s gated inside Project Glasswing for cybersecurity-only access across a dozen partner firms. You cannot pick it. Stop benchmarking against it.

What Happens Next

Base case (most likely): The scaffold-vs-model gap widens through Q4 2026. The meaningful buying question shifts from “best model” to “best agent on which model.” Signal to watch: A second-tier base model topping SWE-bench Verified inside a stronger scaffold than its frontier sibling. Timeline: Six to nine months.

Bull case: The OpenHands Index and SWE-bench Pro become the standard. Verified loses authority as scaffolds optimize for production-shaped work. Signal: A frontier lab headlines results on OpenHands Index or SWE-bench Pro before Verified. Timeline: Twelve months.

Bear case: The gap stays unfixed. Teams keep buying on Verified, hit the production wall, and the category cools for two quarters before broader evals reset expectations. Signal: A wave of “we tried it and rolled back” enterprise post-mortems. Timeline: Three to six months.

Frequently Asked Questions

Q: Which code execution agent tops SWE-bench Verified in 2026? A: Among shipping coding agents, Claude Code on Opus 4.7 leads at 87.6% (llm-stats.com). GPT-5.3-Codex tracks near 85%. OpenHands with a Claude 4 backend sits at 68–72%. Claude Mythos Preview (93.9%) is restricted access, not a buyable product.

Q: What is the future of code execution agents? A: Scaffolding decides outcomes more than the underlying LLM. Expect first-party agents (Claude Code, Codex) and open-source frameworks (OpenHands) to converge on broader evaluations like the OpenHands Index and SWE-bench Pro, while pure-Verified scoring loses authority.

Q: Where are code execution agents headed after Claude Mythos and GPT-5.2? A: GPT-5.2 was the inflection point at 80.0%. GPT-5.3-Codex and GPT-5.5 already moved the ceiling, and Mythos sits above the shipping market without being available for it. The frontier keeps climbing — but the buying decision moves to scaffold quality and production merge rates.

The Bottom Line

The 2026 race isn’t Claude vs GPT-5. It’s first-party scaffolds vs everything else. If your evaluation framework still treats SWE-bench Verified as the answer, you’re shopping with last year’s map. You’re either testing agents on your own codebase or you’re trusting a leaderboard that doesn’t predict your production outcomes.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Sources

llm-stats.com: SWE-Bench Verified Benchmark Leaderboard - SWE-bench Verified scores for Claude Mythos Preview, Opus 4.7, Opus 4.6, and GPT-5.x variants.
Anthropic: Project Glasswing - Claude Mythos Preview access program for cybersecurity partners.
Anthropic’s pricing page: Claude API Pricing - Opus 4.7 input/output rates.
Claude pricing page: Plans & Pricing - Claude Code subscription tiers.
OpenAI: Introducing GPT-5.5 - GPT-5.5 SWE-bench Verified score and release.
OpenHands blog: Introducing the OpenHands Index - OpenHands SWE-bench results and broader eval suite.
Cognition: Devin SWE-bench Results - Devin 2.0 unassisted SWE-bench Verified score.
Cognition Blog: Devin’s 2025 Performance Review - Devin PR merge rate year-over-year.
METR: Many SWE-bench-Passing PRs Would Not Be Merged into Main - Benchmark vs production maintainer merge gap.
Awesome Agents: SWE-Bench Coding Agent Leaderboard 2026 - Scaffolding swing on identical base models.
techjacksolutions.com: Claude Code 2026: Features, Pricing & SWE-bench Breakdown - Claude Code platform availability and SWE-bench result.

Aha Moments

MONA

The scaffold gap is a measurement problem before it’s a product problem. SWE-bench Verified rewards a specific kind of patch-shaped output — small, contained, grader-checkable. When you wrap the same model in a different harness, you’re not changing the model’s competence; you’re changing which slice of its competence the benchmark can see. The numbers move because the harness changes what gets explored and what gets discarded before evaluation reaches the grader. Treat scaffold-induced score swings as a signal that the benchmark resolution is too coarse for the question being asked. The model is doing roughly the same work. The frame around it decides what counts.

MAX

Mona’s right that it’s a measurement problem, but the specification gap is the actionable layer. Every scaffold encodes implicit decisions: how files get read, how tests get retried, when to stop, what counts as done. Anthropic’s first-party harness wins because the spec is tight and the loop is tuned end-to-end. A generic open-source scaffold isn’t worse engineering — it’s a different spec for a different audience. Before benchmarking, write the contract: what does “done” look like in your codebase, who reviews, what gets blocked. A scaffold that fits that contract will outperform a leaderboard winner that doesn’t. The benchmark cannot tell you which spec it ran.

ALAN

Both points hold, and yet the production gap METR measured should make us slow down. Maintainers reject patches that pass automated graders. That gap is humans noticing something the eval cannot see — context, intent, code that future readers will have to live with. As we optimize scaffolds for benchmark performance, we should ask what we’re optimizing away. If the agent that ships the most merged PRs is also the agent that produces code most aligned with what a grader can check, are we training a generation of systems to produce legible work — or work that human reviewers would actually choose to inherit?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors