Claude Opus 4.6, GPT-5.4 Operator, and Project Mariner: The 2026 Browser Agent Leaderboard Race

Table of Contents
TL;DR
- The shift: A three-way browser-agent headline collapsed into a two-lab frontier duel inside fifteen months.
- Why it matters: Buyers planning around “Mariner vs Operator vs Opus 4.6” are budgeting for products that no longer ship as standalone.
- What’s next: The next leaderboard fight is Mythos Preview vs GPT-5.5 — with open-source agent stacks closing on WebArena from below.
The “three-way race” headline was true for about a quarter. Then it stopped being true. Project Mariner is gone. Operator no longer ships as a separate product. The original Opus 4.6 is already two generations back. If you bought the narrative, you bought a snapshot — and the picture moved.
The Race Ended Before Most Buyers Saw the Starting Gun
Thesis (one sentence, required): The 2026 Browser And Computer Use Agents race is no longer a three-way contest — it’s a two-lab frontier duel, with open-source stacks closing the gap on WebArena from below.
Three competitors went into 2026. By May, only two were still racing as standalone products.
Google retired Project Mariner on May 4, 2026, folding its visual web-agent tech into Gemini Agent and Chrome’s auto-browse (Wikipedia). OpenAI had already done the same move a year earlier — Operator merged into ChatGPT agent in 2025, and “Operator” today is a capability label inside the agent, not a separate product (OpenAI Help Center). Anthropic kept the model-name cadence going: Claude Opus 4.6 shipped in early February; Opus 4.7 took the flagship slot within months; a research preview called Mythos now sits at the top of the public leaderboards (Anthropic Blog).
That’s not three competing products. That’s a market that consolidated faster than its own press cycle.
You’re either planning around the new two-lab frontier or you’re budgeting for SKUs that vendors are deliberately collapsing.
Three Releases, Two Survivors, One Wildcard
The reported OSWorld-Verified scores tell the story without ambiguity. Claude Mythos Preview sits at 79.6%, GPT-5.5 at 78.7%, and Claude Opus 4.7 at 78.0%, per LLM Stats. GPT-5.4 trails at 75.0%; the standard OSWorld result for Opus 4.6 is 72.7%.
Every entry on the OSWorld-Verified board is currently self-reported, not independently audited — read the numbers as direction, not law. Mythos itself is a research preview, not a shipping product.
WebArena tells the same story from another angle. Mythos leads the closed-model board at 68.7%, with GPT-5.4 Pro at 65.8% and Opus 4.6 at 64.5% (BenchLM).
The human baseline is around 78% (EmergentMind). Closed frontier models are still below human on web-task completion.
Now the wildcard: framework stacks. A DeepSeek v3.2 agent stack reportedly hits 74.3% on WebArena via Steel.dev’s leaderboard; OpAgent, a planner-grounder-reflector pipeline from CodeFuse AI, posted 71.6% earlier this year.
These aren’t bigger base models. They’re Workflow Orchestration For AI layered on commodity models — the same pattern that let Retrieval Augmented Agents eat first-generation chatbot stacks. Orchestration is doing what brute scale used to do.
The frontier is two labs. The catch-up lane is the open-source stack.
Who Moves Up
Anthropic is the obvious winner of the consolidation. Cowork’s enterprise launch on February 24, 2026 bundled Deep Connectors for Drive, Gmail, DocuSign, and FactSet — moving Claude from “model API” to “agent inside your existing tools.”
Ramp’s AI Index reportedly put Anthropic at 34.4% of tracked enterprise AI spend versus OpenAI’s 32.3% in April (according to Ramp, single-vendor index — treat as directional). The EPAM partnership planning 10,000 Claude-certified architects, with 1,300 already certified, is the kind of consulting-channel build-out that compounds (EPAM Newsroom).
OpenAI is the other winner, even though the press cycle didn’t always read it that way. Folding Operator into ChatGPT agent looked like a retreat at the time. With GPT-5.5 reportedly leading on BrowseComp-style tasks, the consolidation now reads as concentration — one product surface, one capability label, one billing relationship.
Open-source framework vendors moved up too. Steel.dev, BrowserBase, and the agent-stack scaffolding around DeepSeek and Qwen aren’t winning the OSWorld board. They’re winning the cost curve.
Who Got Left Behind
Google’s web-agent ambitions didn’t die — they got absorbed. But “absorbed” is not “leading.” Project Mariner held headline space; Gemini Agent and Chrome auto-browse don’t yet hold leaderboard space (Wikipedia; Android Headlines).
Anyone who pitched their stack around “Mariner integration” or a standalone “Operator API” through last quarter is reskinning slide decks. Anyone building procurement plans around standalone browser-agent SKUs is buying a category that vendors are deliberately collapsing.
And the system-card caveat is worth naming. Anthropic’s own Opus 4.6 documentation notes the model is “at times overly agentic in coding and computer use settings, taking risky actions without first seeking user permission” (Anthropic System Card). Pilots that skipped the guardrail layer are about to learn what Code Execution Agents actually do when you hand them live credentials.
What Happens Next
Base case (most likely): The leaderboard race tightens to Anthropic vs OpenAI, with one open-source stack reaching 80%+ on WebArena before year-end. Signal to watch: Mythos Preview moves from research preview to GA pricing. Timeline: Q3 2026.
Bull case: Open-source agent stacks pass the WebArena human baseline by late 2026, breaking frontier-lab pricing power for browser-agent inference. Signal: A second framework stack (not just DeepSeek v3.2) clears the 75% line with reproducible runs. Timeline: Q4 2026.
Bear case: A high-profile computer-use incident — data exfiltration, an unauthorized purchase, an irreversible workflow action — triggers an enterprise pause. Procurement freezes on agent deployments for two quarters. Signal: First named-enterprise rollback published in a major outlet. Timeline: Possible inside six months.
Frequently Asked Questions
Q: Which browser and computer use agents lead OSWorld and WebArena in 2026? A: As of May 2026, Claude Mythos Preview reportedly leads OSWorld-Verified at 79.6% and WebArena closed-model at 68.7%, with GPT-5.5 close behind on OSWorld-Verified at 78.7%. At the framework level, a DeepSeek v3.2 agent stack reportedly leads WebArena at 74.3% per Steel.dev.
Q: How is Anthropic Computer Use being deployed in real enterprise workflows? A: Through Anthropic Cowork, launched February 24, 2026, Claude now connects to Drive, Gmail, DocuSign, and FactSet via Deep Connectors. EPAM is training 10,000 Claude-certified architects to deploy these workflows at large enterprises, with 1,300 already certified.
The Bottom Line
The three-way framing aged out before buyers caught up. Plan around two frontier labs, one closing open-source lane, and one enterprise-deployment surface that already moved from API to embedded workflow. The next leaderboard print isn’t a benchmark — it’s a procurement signal.
Disclaimer
This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors