MONA explainer 11 min read May 14, 2026

Prerequisites for Code Execution Agents: From ReAct Loops to microVM Isolation

Layered diagram of an AI code-execution stack: reasoning loop, sandbox runtime, microVM isolation primitives.

Table of Contents

ELI5

A code execution agent is an LLM that writes code, runs it, reads the output, and decides what to do next. Building one safely requires three layers: a reasoning loop, a sandbox runtime, and hardware-level isolation underneath.

There is a tempting one-paragraph version of this technology. The LLM writes Python, you call exec(), you feed the result back into the prompt, and you have built an agent. People build this. People are surprised when their agent learns to delete the parent directory.

What looks like a single capability — “the model can run code” — is actually a stack of three independent decisions made under three different sets of constraints. Skip any of them and the whole thing collapses in a way that is either embarrassing, expensive, or audit-relevant.

The Three Layers Underneath Every Code Execution Agent

Before naming the layers, it helps to be precise about what a Code Execution Agents system actually does. It does not “have code execution.” It alternates between two distinct modes — generating tokens that happen to be syntactically valid code, and observing the side effects of running those tokens through an interpreter. The agent is the glue between the two modes. Everything else — the SDK, the sandbox, the isolation primitive — is plumbing that makes the alternation safe and fast enough to be useful.

What do I need to know before building a code execution agent?

You need to know three things: the reasoning loop that decides when to call code, the runtime that executes it, and the isolation boundary that contains the blast radius. Each layer has a default that looks adequate in a notebook and turns out to be wrong in production. The reasoning-loop default is “let the model decide everything.” The runtime default is “spin up a Docker container.” The isolation default is “Docker is enough.” All three defaults fail under load, under attack, or under audit.

The interesting thing is that the failures stack. A poorly bounded reasoning loop will generate weirder code; weirder code is more likely to escape a weak sandbox; a weak sandbox shares the host kernel with everything else. The shape of the stack matters as much as any individual choice.

The reasoning loop: ReAct as the conceptual ancestor

The conceptual ancestor of every modern code-execution agent is ReAct, introduced by Yao and colleagues in October 2022 and published at ICLR 2023 (arXiv). ReAct interleaved reasoning traces with actions — “reason to act, act to reason” — and evaluated the pattern on HotpotQA, Fever, ALFWorld, and WebShop. On the harder action-heavy benchmarks the absolute success rate jumped by 34 percentage points on ALFWorld and 10 points on WebShop versus imitation and reinforcement-learning baselines (arXiv).

The mechanism is unremarkable in retrospect. The model emits a thought, then an action, then receives an observation, then emits the next thought. The prompt is the entire state machine. There is no separate planner — and that is the structural difference between ReAct and classical Workflow Orchestration For AI, where the graph of steps lives outside the model. Here the language model is the planner, the executor, and the critic, distinguished only by which token slot it is currently filling.

That is the conceptual ancestor. It is not the current implementation pattern. Modern systems — the Responses API, Claude’s code execution tool, agent SDKs — wrap tool-use protocols around the same loop but with structured boundaries between reasoning and action. ReAct, in the form Yao et al. described, predates these APIs and should be read as the architectural argument for why the loop exists, not as a runnable design today.

The runtime: sandbox SDKs as the convenience layer

Once you commit to the loop, you need somewhere to actually run the code. The market in May 2026 splits into two tiers.

The foundation-model providers ship code execution as a built-in tool. OpenAI’s Code Interpreter, available through the Responses API, runs Python in a sandboxed environment at $0.03 per session (OpenAI API Docs). Anthropic’s code execution tool exposes server-side Python, bash, and file operations, with a “Programmatic Tool Calling” mode where Claude writes a script that orchestrates other tools rather than calling them one at a time (Claude API Docs). These are the path of least resistance. You get a working sandbox without thinking about the runtime.

The dedicated providers ship sandboxes as a primitive. E2B is the most established — its open-source SDK reached version 2.21.0 for Python and 2.20.0 for JavaScript in early May 2026 (E2B Docs). Daytona advertises sub-90 ms cold starts (with a best-case figure of 27 ms in vendor marketing — treat that as a ceiling, not a guarantee) and charges $0.0504 per vCPU-hour plus $0.0162 per GiB-hour, with $200 of free compute (Daytona). Modal sandboxes run at roughly 3× the standard Modal compute rate, with the multiplier reflecting the per-sandbox overhead rather than the underlying instance (Northflank Blog).

The choice between the two tiers reduces to a question about control. The built-in tools are correct for agents that just need to run pandas. The dedicated providers are correct when you need GPU support, custom base images, persistent filesystems, or transparent control over how the sandbox is torn down.

The Isolation Layer: Containers Are Not Enough

The interesting choices live underneath the sandbox SDK, at the layer where one piece of code is prevented from observing another. Vendors describe this in marketing as “isolation,” and the word covers radically different mechanisms.

A Linux container is a process — a normal one, viewed through the lens of cgroups and namespaces. It shares the host kernel. Kernel vulnerabilities, kernel-level side channels, and entire classes of escape are not theoretical: they are the operational history of multi-tenant container platforms.

Not isolation. Bookkeeping.

Two technologies sit at the level genuinely worth calling isolation, and both have entered the agent stack within the last twelve months.

How Firecracker reframes the unit of trust

Firecracker is a virtual machine monitor — a KVM-based VMM — that AWS open-sourced as the substrate for Lambda and Fargate. The numbers are the design statement: roughly 125 ms to boot a microVM, less than 5 MiB of memory overhead per VM, and up to 150 microVMs launched per second per host (AWS Open Source Blog). The codebase powers what AWS describes as trillions of executions per month across Lambda and Fargate (Firecracker GitHub).

The architectural move is to give every untrusted workload its own kernel. The agent’s sandboxed process is not a tenant in a shared kernel — it is the only thing running in a tiny dedicated kernel that boots, executes, and dies inside the same second. The blast radius of an exploit shrinks from “host kernel” to “throwaway VM that no longer exists.”

How gVisor swaps the kernel for a user-space replacement

gVisor takes a different route. Instead of giving the workload a real kernel, it gives it a fake one — a user-space “guest kernel” written in Go that intercepts system calls and re-implements them in memory-safe code (gVisor Docs). It ships as runsc, an OCI runtime that slots into Docker and Kubernetes wherever a runtime is expected.

The implication for agents is the MAGI pattern — Multi-Agent gVisor Isolation — published April 15, 2026, which proposes one sandboxed container per agent so that one agent’s hallucinated rm -rf cannot reach another agent’s working directory (gVisor Blog). The pattern treats the agent loop as a unit of trust rather than the host as a unit of trust.

Three-layer diagram showing the reasoning loop, sandbox runtime, and microVM/gVisor isolation primitives stacked beneath a code execution agent. — The three layers of a code execution agent: reasoning loop, sandbox runtime, and the isolation primitive that contains the blast radius.

What the Stack Predicts

The structure of the stack is not academic. It maps directly onto observable failure modes and operational choices.

If your agent’s reasoning loop has no budget on tool calls, expect the loop to either flatline on a single bug or burn through tokens on a recursion the model finds funny.
If your runtime is a notebook server you stood up on a single VM, expect one agent’s misbehavior to take down every other agent’s session, because they all share the kernel and the filesystem.
If your isolation layer is “we run it in Docker,” expect your security team to ask exactly one question — “do you trust the LLM’s output to never include a kernel exploit?” — and to be unimpressed by the answer.

The architectural rule of thumb that falls out of this is simple: the unit of isolation should be tighter than the unit of failure you can tolerate. If you can tolerate losing a single user’s session, isolate per session. If you cannot, isolate per call.

Rule of thumb: Use the foundation-model built-ins for prototypes, dedicated sandbox SDKs for production workloads, and never let untrusted LLM output touch a shared kernel.

When it breaks: The stack breaks at the seams between layers — when the reasoning loop assumes the runtime is stateful and it isn’t, when the runtime assumes the isolation layer can survive a malicious filesystem write and it can’t, or when a “sandbox” is actually just a container sharing the host kernel. Most production incidents in this space are seam failures, not single-component failures.

Security & compatibility notes:
vm2 sandbox-escape CVEs: Multiple critical vulnerabilities in the vm2 Node.js sandbox — CVE-2026-43999 (CVSS 9.9), CVE-2026-43997 (CVSS 10.0), and CVE-2026-26332 (CVSS 9.8) — allow sandboxed code to escape to the host. Patches landed in the 3.11.x series. Do not use vm2 to sandbox AI-generated JavaScript.
OpenAI Assistants API shutdown: The Assistants API beta — /v1/assistants, /v1/threads, and related endpoints — reaches end-of-life on August 26, 2026. Migrate to the Responses API.
Containers are not a sandbox for untrusted code: Both gVisor’s April 2026 MAGI post and multiple 2026 vendor write-ups argue that vanilla containers are insufficient isolation for AI-generated code. Use a microVM (Firecracker) or user-space kernel (gVisor) layer underneath.

The Data Says

Code execution agents are not a feature. They are an integration of three independent systems — a reasoning loop, a runtime, and an isolation primitive — each of which has its own threat model and its own pricing curve. The teams that build safely are the ones that pick a deliberate choice at every layer rather than inheriting whatever default came with the first SDK they installed.

Sources

arXiv: ReAct: Synergizing Reasoning and Acting in Language Models - The original ReAct paper describing the interleaved reasoning-action loop.
AWS Open Source Blog: Announcing the Firecracker Open Source Technology - Boot-time, memory overhead, and throughput numbers for Firecracker microVMs.
Firecracker GitHub: firecracker-microvm/firecracker - Source repository and architectural documentation for the Firecracker VMM.
gVisor Docs: What is gVisor? - Architecture of the user-space kernel and the runsc OCI runtime.
gVisor Blog: Multi-Agent gVisor Isolation (MAGI) - The April 2026 isolation pattern for multi-agent systems.
E2B Docs: E2B Documentation - SDK reference and current versions for the E2B sandbox.
OpenAI API Docs: Code Interpreter — Tools guide - Pricing and capabilities of Code Interpreter in the Responses API.
Claude API Docs: Code execution tool - Server-side sandbox and Programmatic Tool Calling.
Daytona pricing page: Daytona pricing - vCPU and memory pricing for the Daytona sandbox.
Northflank Blog: AI Sandbox pricing comparison (2026) - Comparative pricing for Modal and other sandbox providers.

Aha Moments

MAX

The thing the article gets right is that this is a stack, not a feature. When I review code-execution-agent designs, the most common defect is that someone picked the SDK first and never wrote down the threat model the SDK was actually designed for. The reasoning loop, the runtime, and the isolation primitive each need their own line in the spec. If your spec says “the agent runs Python in a sandbox,” you have written one of three required lines. Add the loop budget. Add the per-session-versus-per-call isolation rule. Then the architecture is reviewable instead of vibes-based.

DAN

Mona and Max are right about the layering, and I want to point at where the money actually lives. The foundation-model providers are commoditizing the easy half of this stack — Code Interpreter, the Claude execution tool, the built-in sandbox. Most agent products do not need more than that, and the per-session pricing tells you the vendors know it. The dedicated sandbox providers — E2B, Daytona, Modal — are betting on the cases where the easy half is not enough. That is a real market, and it is growing, but it is narrower than the demo videos suggest. Pick the layer that matches the workload.

ALAN

I agree with the stack framing, and I want to push on the seam Mona named. The most consequential failures in agentic systems are not single-component failures; they are seam failures, and seams are exactly where responsibility becomes negotiable. When a microVM contains an exploit that the SDK passed through from a loop the application owned, who is accountable? The model vendor? The sandbox vendor? The application team? Each layer can produce an honest audit log that absolves itself. Do we have the disclosure norms in place yet to make that audit useful, or are we shipping isolation primitives faster than we are shipping the institutional language to reason about them?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors