Cold Starts, Flaky Tests, and Context Blowup: The Technical Limits of Code Execution Agents in 2026

Table of Contents
ELI5
Code execution agents in 2026 hit three structural walls: sandboxes can be fast or isolated but rarely both, benchmarks measure flaky tests as often as model skill, and effective context shrinks far below the window the vendor advertises.
A team I spoke with last quarter rolled out a Code Execution Agents pipeline that benchmarked beautifully on SWE-bench Verified, ran clean in their lab, then quietly degraded over two weeks in production until the on-call channel went sideways. No single bug. No single regression. Just three different ceilings, brushing against each other.
That pattern keeps repeating. And the pattern is not random.
The three ceilings an agent runs into
The first thing worth noticing about agent failures in 2026 is that they cluster. Engineers expect a long tail of bespoke bugs. Instead, almost every production incident a coding agent produces traces back to one of three structural limits — limits that come from how sandboxes, benchmarks, and attention layers are built, not from the model itself.
What are the technical limitations of code execution agents?
Three. Always three.
The first is sandbox cold-start latency, which sits in direct tension with isolation strength. The second is benchmark fragility — the scaffolding we use to measure agent skill is itself unreliable, and the field is still in the middle of admitting it. The third is effective context collapse, where the working memory an agent actually uses on a long task is a fraction of the context window the model card promises.
None of these are growing pains that more training will fix. They are properties of the underlying mechanism. The article that follows walks each one down to where the math gets uncomfortable.
Cold starts: the latency–isolation trade-off
A sandbox for an agent has to do two things that fight each other. It has to start fast enough that an agent loop with hundreds of tool calls remains usable — and it has to isolate the executing code well enough that a prompt-injected script or a model hallucination cannot reach the host system or steal another tenant’s data.
Modern stacks resolve this tension by picking a position on a curve. Firecracker microVMs, used by E2B, give you kernel-level isolation at roughly 150ms–2s cold-start range (Northflank Blog), with pre-warmed snapshot pools pulling typical boots to about 150ms (E2B Docs). Daytona uses containers with optional microVM hardening and reports sub-90ms code-to-execution, with optimised paths around 27ms (Daytona Docs). Browser isolates on Cloudflare reach under 50ms but support a narrower language matrix (Cloudflare Blog).
Read those numbers as best-case marketing figures. They are vendor-reported, measured under favourable load, and almost certainly diverge from the p95 and p99 latencies a busy agent will see in production. Treat them as the lower bound of what is physically reachable, not the operating mean.
The mechanism underneath the trade-off is straightforward. Stronger isolation means more layers between the executing code and the host — a separate kernel, a separate page table, a hypervisor stage in the syscall path. Each layer costs time to construct and to tear down. You can amortise that cost with snapshots and pools, but you cannot eliminate it. Pricing reflects the same physics: a Daytona vCPU-hour runs about $0.0504 with $0.0162 per GiB-hour (Daytona’s pricing page), and the cheaper, faster tiers across the market are precisely the ones with thinner isolation.
The single most useful frame for an agent architect is this: cold-start latency is not a vendor problem, it is a physics problem. Pick the isolation level your threat model requires, then optimise within that envelope.
When the benchmark is the bug
The natural next question is whether all this latency engineering is paying off — whether agents are actually getting better at the work. In 2026, that question runs straight into a second wall, and this one is more uncomfortable than the first.
Why SWE-bench Verified started lying
SWE-bench Verified was supposed to be the clean benchmark. Human-reviewed, expert-curated, the version everyone could trust. Then SWE-bench Pro’s authors audited the hardest unsolved problems on Verified and reported that roughly 59% of them contained flawed test cases (SWE-Bench Pro paper). Not flawed solutions — flawed graders. Agents were failing because the test harness disagreed with itself, or because the patch the agent produced was correct in a way the test could not recognise.
The downstream effect showed up in the scores. Claude Opus 4.5 records 80.9% on SWE-bench Verified (SWE-bench leaderboard) and 45.9% on SWE-bench Pro under its standardised scaffolding (Scale SEAL Leaderboard). Pro covers 1,865 long-horizon tasks across 41 repositories, with patches averaging around 107 lines across 4.1 files (SWE-Bench Pro paper) — a workload closer to what production engineers actually build.
As of early 2026, OpenAI stopped reporting Verified scores (Morph LLM), and the field has been migrating toward Pro and toward consistency-based metrics. The lesson is not that Verified was useless. It is that a single-pass accuracy number on a flaky test suite tells you something other than what you think it tells you.
Why one good run does not mean a good agent
Sierra’s tau-bench introduced a measurement called pass^k, which asks a different question entirely: out of k independent attempts on the same task, in how many did the agent succeed? GPT-4o function calling can clear 60% on a single attempt at tau-retail but drops to roughly 25% at pass^8 (Sierra tau-bench paper). The model has not become worse. The benchmark has stopped letting it cherry-pick a lucky trajectory.
The mechanism here is a property of stochastic sampling. Every long-horizon agent run is a sequence of probabilistic choices over tool calls, formats, and intermediate plans. A single good run can survive several small mistakes that happen to cancel. Averaged over independent attempts, those mistakes stop cancelling. What you see is the underlying reliability, not the headline number.
A separate UC Berkeley result reported in April 2026 sharpened the point further. According to a secondary writeup, all eight major agent benchmarks studied were shown to be reward-hackable to near-perfect scores by adversarial agent strategies that did not actually solve the tasks (Rapid Claw Scorecard). Treat the exact figure as preliminary until the primary paper is widely cited. The direction of travel is clear regardless: single-shot leaderboard numbers are increasingly the worst signal an architect can budget against.
Context windows that aren’t
The third ceiling shows up only when you ask an agent to do real work for a long time. It does not appear in the benchmark snippets. It does not show up in the latency dashboard. It lives entirely inside the attention mechanism, and it bites every long-horizon agent eventually.
The advertised context window — 200k tokens, 1M tokens, whichever vendor figure is current — describes the maximum number of tokens the model can technically attend to. The effective context, the portion the model actually uses well for the task, is consistently smaller. One survey of 2026 long-context behaviour reports that on complex tasks, models fall short of their advertised maximum by margins that can exceed 99%, with simple retrieval working at 5,000 tokens while complex reasoning collapses between 400 and 1,200 tokens (Atlan).
The mechanism is dilution. Attention is a soft-max over similarities. As the context grows, irrelevant tokens compete for probability mass against the few tokens that actually matter for the next step. The model still attends to everything. It just attends to the right things less.
Anthropic’s recent engineering writeups describe the current state of practice for long-running agents: compaction of stale tool calls, structured note-taking that lets the agent externalise plan state, and multi-agent harnesses that isolate sub-tasks into separate contexts to keep each one short (Anthropic Engineering). The Context Folding paper goes further, showing up to 10× smaller active context while matching baseline performance on long-horizon tasks by recursively summarising completed sub-tasks back into compact state (Context Folding paper).
Security & compatibility notes:
- SWE-bench Verified (WARNING): Community considers it partially obsoleted by Pro due to test-quality issues; OpenAI stopped reporting Verified scores in early 2026. Pin agent evaluations to Pro plus a pass^k consistency metric (Morph LLM).
- Single-shot leaderboard scores (INFO): April 2026 finding shifted the field toward N-run consistency and cost-adjusted accuracy. Prefer pass^k over headline accuracy when budgeting reliability (Rapid Claw Scorecard).
- Naive long-context prompting (INFO): Performance degrades sharply with context length; compaction, folding, and multi-agent harnesses are now baseline (Anthropic Engineering).
What this predicts for production agents
The reason these three limits cluster matters more than any individual number. Each one is a property of how the system is built — kernel boundaries, stochastic sampling, attention geometry — and that means each one yields predictable failure shapes you can plan against.
- If your agent loop makes many small tool calls in tight succession, the dominant performance cost will be sandbox cold-start, and the fastest gain will come from snapshot pooling or batching calls inside a single sandbox session — not from a faster model.
- If your agent scores well on a public benchmark but degrades in production, you are almost certainly looking at a pass^1 / pass^k gap. The benchmark let it sample a good trajectory; production averages over the bad ones.
- If your agent works on short tasks and fails on long ones with no obvious failure mode, the failure is usually attention dilution. Adding more context will make it worse, not better. Compaction or sub-agent decomposition is the lever, paired with a Workflow Orchestration For AI layer that owns the hand-offs between contexts.
Rule of thumb: Pick the isolation level the threat model requires, measure agents on pass^k rather than pass^1, and architect for context as a budget you spend rather than a window you fill.
When it breaks: Production agents will hit a regression that none of these three frames predicts when two limits interact — for example, when context compaction strategies drop a piece of state that the next sandbox call required, producing a failure that looks like a model bug but is actually an architectural one. The trade-offs do not stay clean.
The Data Says
Three structural limits — sandbox cold-start versus isolation, benchmark fragility, effective context collapse — explain a striking share of production incidents from code execution agents in 2026. None of them disappear with a larger model. All of them yield to architectural choices made before the agent is built.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors