Code Execution Agents

Also known as: CodeAct agent, code-as-action agent, executable-code agent

Code Execution Agents
A code execution agent is an LLM-driven AI agent whose primary action format is executable code, typically Python run inside a sandboxed interpreter, rather than structured JSON function calls. The model writes a snippet, the sandbox runs it, and the result feeds the next reasoning step.

A code execution agent is an AI agent that takes actions by writing and running real code — usually Python — inside a sandboxed interpreter, instead of emitting structured JSON tool calls.

What It Is

Most agents you have already met take action through function calls: the model writes a tidy JSON blob saying “call search_database with these arguments,” and your application is responsible for dispatching that call, formatting the result, and feeding it back. A code execution agent flips this. The model writes a short program — usually Python — and a sandbox actually runs it. The return value becomes the next observation in the reasoning loop. The action space is the language itself.

This shift solves a specific pain point. JSON tool calls compose badly. If you need to fetch three rows, join them on a key, filter by date, and average a column, the agent has to chain four or five separate tool calls, with the model re-reading intermediate JSON each time. A few lines of Python does the same work in one step and one round trip. According to the CodeAct paper (Wang et al., ICML 2024), this pattern produced up to 20% higher task success and roughly 30% fewer steps versus JSON-style actions across 17 evaluated LLMs.

The runtime side matters as much as the model side. Because the agent is generating real code, that code has to execute somewhere that cannot harm the host machine. Modern implementations rely on isolated sandboxes — Firecracker microVMs, container-based environments, or vendor-managed interpreters — that boot fast, expose a controlled filesystem, restrict network access, and tear down cleanly between sessions. The pairing of “code as the action format” plus “interpreter that is safe to run untrusted code in” is what makes the whole pattern viable in production.

How It’s Used in Practice

The most common entry point for a non-engineer is a chat interface that quietly runs Python for you. When you upload a CSV to ChatGPT and ask “which region had the largest drop in conversions last quarter?”, a code execution agent is what answers. The model writes a pandas snippet, the sandbox runs it on your file, and you get back a chart and a sentence. According to the OpenAI API Docs, this lives under the Code Interpreter tool on the Responses API. According to the Claude API Docs, Anthropic exposes an equivalent code execution tool that runs Python and Bash in a sandboxed container.

Developers building products on top of these APIs use the same pattern for data analysis features, document parsing, on-the-fly calculations, generating downloadable files, and any task where the cleanest expression of the work is a few lines of code rather than a forced sequence of JSON calls.

Pro Tip: Treat the sandbox as a hostile environment that happens to live inside your product. Set per-call timeouts, cap memory, restrict network egress to an explicit allowlist, and never share a sandbox session across two end users. Most production incidents with code execution agents come from skipping one of these four steps.

When to Use / When Not

ScenarioUseAvoid
Multi-step data manipulation on a user-supplied file
Triggering an idempotent action in a system of record (charge, refund, ticket)
Chained math, transformations, or library-flavored work (pandas, numpy)
Strict, schema-bound integrations where you need an audit trail of typed calls
Prototyping a workflow before you know which discrete tools you actually need
Single-shot API calls with no chaining and no computation

Common Misconception

Myth: A code execution agent gives the model “shell access” to your servers and is therefore inherently dangerous. Reality: The model never touches your servers. Generated code runs inside an isolated sandbox — a microVM or a locked-down container — that is created for the session and destroyed afterward. The security posture is determined by sandbox configuration (network, filesystem, timeouts, quotas), not by the model itself.

One Sentence to Remember

Code execution agents replace the “model picks a tool” loop with a “model writes a program, sandbox runs it” loop, which compresses multi-step work into single round trips — useful for analysis and computation, less appropriate for high-stakes, schema-bound business actions.

FAQ

Q: Is a code execution agent the same thing as a coding assistant like Cursor or Claude Code? A: No. Coding assistants help a human write code in their editor. A code execution agent writes code so a sandbox can run it as the agent’s own next action, then keeps reasoning over the result.

Q: Why is this safer than letting the model “run commands” directly? A: Because the code never executes on your infrastructure. It runs inside an ephemeral sandbox with restricted filesystem, restricted network, fixed timeouts, and no persistence — destroyed when the session ends.

Q: Do I still need traditional function calling if I use code execution? A: Usually yes. Code execution is great for computation and chained logic. Typed function calls remain the right pattern for high-stakes business actions where you want a strict schema and a clean audit trail.

Sources

Expert Takes

The interesting move here is not “let the model write code.” It is changing the action space from a small, discrete menu of tool calls to a Turing-complete language. That single decision is what unlocks composition: loops, conditionals, intermediate variables, library calls. Empirically, that expressiveness translates into shorter trajectories and higher task success — because the agent stops emulating control flow inside a JSON dialect and just writes it.

Specification still wins. A code execution agent without a clear context file is just a faster way to generate the wrong answer. Tell the agent exactly which libraries are available, what the input schema looks like, where outputs go, and what counts as “done.” The sandbox should be boring — fixed Python version, pinned dependencies, no network by default. Determinism comes from the spec, not from the model.

The action layer of agents is consolidating around code, not JSON. The major vendors all ship a hosted interpreter, and open-source frameworks default to code-as-action. For product teams, the strategic read is simple: if your roadmap depends on agents doing real work over user data, the question is no longer whether to use code execution — it is which sandbox provider you trust and what guardrails you wrap around it.

A sandbox is a promise, and promises break. The model writes code that the user never reads, on data the user uploaded in a hurry, in an environment whose exact boundaries the user cannot inspect. When something goes sideways — a leaked file, a runaway loop, an unexpected outbound call — the user has no mental model for what just happened. Who explains it to them, and in language they understand?