MONA explainer 12 min read May 7, 2026

Multi-Agent Systems: Supervisor, Debate, and Swarm Patterns

Diagram of three multi-agent architectures: supervisor, debate, and swarm patterns coordinating AI agents

Table of Contents

ELI5

A multi-agent system is a group of language model agents that split a job between them. One agent plans and delegates, others execute, and a memory layer keeps the conversation coherent across turns.

When Anthropic rebuilt its Research feature, it stopped trusting a single model with the whole task. A lead Claude Opus 4 instance now spawns parallel Claude Sonnet 4 sub-agents that each chase a different thread of a query. The shape — one planner, many specialists, a shared memory — outperformed single-agent Opus 4 by 90.2% on Anthropic’s internal research evaluation (Anthropic Engineering). The interesting part is not that two heads beat one. It is that the geometry of coordination changes what the system can actually compute.

The illusion of “just add more agents”

There is a tempting mental model: if one agent is smart, ten agents must be smarter. Slot a few personas into a chat room, let them argue, harvest the consensus.

That model is wrong in a useful way.

A multi-agent system is not a panel of pundits. It is a control structure — a graph of who calls whom, who holds state, who breaks ties, and who is allowed to terminate. Change the graph and you change the kinds of failures the system produces. The supervisor, debate, and swarm patterns are not flavors of the same idea; they are different geometries that solve different coordination problems.

How a multi-agent system actually computes

Before mapping the three architectures, we need to be precise about what a multi-agent system is doing under the hood and why developers reach for one in the first place.

What are multi-agent systems in AI?

A multi-agent system in AI is a runtime in which two or more LLM-driven Agent Orchestration loops cooperate to solve a task that any single loop would handle poorly. Each agent has its own role, its own toolset, and often its own model. They communicate through structured messages — handoffs, tool calls, or shared state — rather than free-form conversation.

The standard architectural taxonomy separates centralized supervisors (orchestrator-worker), peer-to-peer or decentralized topologies, hierarchical multi-supervisor trees, debate, and swarm-style handoff patterns. The differences are not cosmetic. A supervisor tree concentrates planning in one node; a swarm pushes routing into the agents themselves; a debate forces agreement through repeated exposure to dissenting drafts.

The reason you reach for any of them is usually mechanical, not philosophical. A single agent has a finite context window, a single tool-calling policy, and a single model with a single set of biases. The moment your task asks the model to plan AND search AND critique AND format, you are asking the same parameters to specialize in opposite directions. Splitting the task across agents lets each one keep a narrow, well-conditioned role.

Not magic. Specialization.

How do multi-agent systems coordinate and delegate tasks between agents?

Coordination in modern frameworks happens through three mechanisms — message passing, tool-style handoffs, and shared memory — usually layered together.

The first mechanism is structured message passing. Frameworks like LangGraph 1.1.0 model the system as a graph: each node is an agent, edges describe legal transitions, and messages flow along the edges with type-safe payloads. The graph is the contract. An agent that wants to escalate cannot improvise — it must take an edge that exists.

The second mechanism is the handoff-as-tool. The OpenAI Agents SDK exposes another agent as if it were a function the current agent can call; handoff() accepts tool_name_override, tool_description_override, an on_handoff callback, and input_filters that let the receiving agent see only the relevant slice of conversation history (OpenAI Agents SDK Docs). The hand-off becomes part of the action space the model already knows how to use. Microsoft Agent Framework — the v1.0 GA successor that merged AutoGen and Semantic Kernel — exposes the same idea alongside sequential, concurrent, group-chat, and Magentic-One orchestration patterns (Microsoft Learn). Google ADK, introduced at Google Cloud NEXT 2025, builds the hierarchy directly into the type system: every BaseAgent can declare sub_agents, and delegation walks the parent-child tree (Google’s ADK site).

The third mechanism is shared memory. Without it, each handoff would amount to amnesia, and the system would behave like a relay race where every runner forgets the baton. We will come back to memory in the next section.

What is happening underneath all three mechanisms is the same thing: a single token budget is being divided across multiple inference passes that each maintain their own posterior over what to do next. Each agent samples from a distribution conditioned on its narrow role and recent memory, instead of from one diluted distribution conditioned on everything at once. That is the actual reason coordination beats concatenation.

The anatomy of a coordinated agent

Coordination only works when each agent has the right organs. There are four — orchestrator, workers, memory, and tools — and the most common failure modes come from getting one of them wrong.

What are the core components of a multi-agent system: orchestrator, workers, memory, and tools?

The orchestrator is the agent that owns the plan. It decomposes the user request into sub-tasks, decides which worker handles each one, and decides when the task is finished. In LangGraph, it is typically a node with a routing function or a Supervisor Agent Pattern, and LangChain now recommends building supervisors directly via tool-calling rather than the dedicated supervisor library for most cases — it gives the developer more control over the context the supervisor sees. In CrewAI 1.14.4, the orchestrator is implicit in the Crew’s hierarchical process, with Agents defined by role, goal, and backstory and Tasks routed sequentially or hierarchically.

The workers are specialized agents with narrow roles — a researcher, a coder, a critic, a formatter. They typically have access to a strict subset of tools and a model chosen for the task: faster and cheaper for retrieval, slower and more capable for synthesis. Anthropic’s Research system follows this exact split, with Opus 4 as the lead and Sonnet 4 as the parallel sub-agents (Anthropic Engineering); the Claude Agent SDK exposes the same orchestrator-worker primitives for developers building on Claude.

The memory layer is the connective tissue. The standard taxonomy in Agent Memory Systems distinguishes four kinds: short-term (the working context window), long-term episodic (what happened across past sessions), long-term semantic (facts the agent knows), and procedural (how it does things). Long-term memory is bridged into short-term via vector retrieval (Redis Engineering Blog). A swarm or supervisor with no memory layer collapses into a stateless function call; a system with poorly-scoped memory leaks irrelevant context into every turn and pays for it in latency and confusion.

The tools are how agents act on the outside world — APIs, code execution, file I/O, other agents. The interesting design choice is which tools each agent can see. Hide the tools you do not want the model to use. Expose handoffs as tools when you want delegation to feel native. The action space is the policy.

When you read a multi-agent framework’s docs, this is the four-organ skeleton you should be looking for. The vocabulary differs — Crew/Agent/Task in CrewAI, Graph/Node/State in LangGraph, Agent/Handoff/Guardrail/Tracing in the OpenAI Agents SDK — but the anatomy is the same.

Three architectures, three geometries of control

The choice of architecture is a choice about where the planning effort lives.

In supervisor or orchestrator-worker systems, control lives at the top. One agent decomposes the problem; workers execute leaves of the plan; results bubble up; the supervisor decides when to stop. This is the shape Anthropic uses for parallel research, and it is the shape LangGraph’s supervisor reference implementation models with create_handoff_tool, including multi-level hierarchies — supervisors of supervisors. It is the safest default when the task decomposes cleanly into independent sub-tasks. It fails when the sub-tasks are not actually independent, because the supervisor cannot see what the workers learned mid-task without explicit feedback edges.

In Agent Debate systems, control is symmetric. Multiple agents propose answers, then re-read each other’s drafts and refine over several rounds. The Du et al. multi-agent debate paper showed this pattern improves factuality and reasoning on benchmarks where a single chain-of-thought run would not. The mechanism is not “the agents convince each other.” It is that exposing each agent to dissenting drafts shifts the conditional distribution over the next token toward answers that survive cross-examination. The cost is real — multiple rounds across multiple agents — and the benefit is task-dependent. Debate earns its keep on hard reasoning and factual claims; on simple lookups, it just multiplies the bill.

In Swarm Architecture or handoff systems, control is distributed. There is no global supervisor; each agent decides when to hand the conversation to a peer, usually via a handoff tool. The shape is closer to a relay than a hierarchy. The OpenAI Agents SDK is the production successor to OpenAI’s experimental Swarm, whose repository is no longer actively maintained (OpenAI’s Swarm GitHub). The word “swarm” is overloaded: there is OpenAI’s archived library, the general decentralized peer-to-peer pattern from the multi-agent systems literature, and the langgraph-swarm library — they are not the same artifact. Swarms shine when the routing logic is itself task-dependent and you do not want a single planner to be the bottleneck.

The three architectures are not competitors. They are different answers to the question: where should the planning effort live?

Comparison of supervisor, debate, and swarm multi-agent architectures showing how control flows between orchestrator, workers, and peer agents — Three coordination patterns: supervisor (top-down), debate (symmetric refinement), and swarm (peer handoff).

What the geometry predicts

Once you see the architecture as a control graph, several practical predictions fall out.

If your task decomposes into independent sub-tasks, expect the supervisor pattern to dominate on latency, because workers run in parallel.
If your task is a hard reasoning or factuality problem, expect debate to help — and expect the help to come at multiplied compute.
If your task has dynamic routing that depends on intermediate results, expect a swarm or handoff-style architecture to outperform a static supervisor, because the planner does not have to predict the routing in advance.
If your handoffs feel like amnesia, the failure is almost always in the memory layer, not the agents — check that long-term episodic memory is being retrieved into short-term context at handoff time.

Rule of thumb: the shape of the agent graph should mirror the shape of the task — flat tasks deserve flat coordination, tree-shaped tasks deserve trees.

When it breaks: Multi-agent systems multiply both capability and cost. Token usage scales with the number of agents and rounds, debugging spans multiple inference traces, and a misrouted handoff can cascade silently because each agent looks locally correct. Anthropic’s own guidance is to use multi-agent systems for tasks worth the token bill — research, complex coding, long-horizon planning — not for tasks a single well-prompted agent can handle (When to use multi-agent systems).

Framework status notes:
OpenAI Swarm: Repository not actively maintained since the OpenAI Agents SDK launched in March 2025. Use the Agents SDK for new work.
AutoGen: In maintenance mode — bug and security fixes only. Microsoft directs users to Microsoft Agent Framework v1.0 for new projects.
LangGraph supervisor library: LangChain now recommends building the supervisor directly via tool-calling for most cases; the dedicated library is still available but offers less context-engineering control.

The Data Says

A multi-agent system is a control graph, not a chat room. Anthropic’s orchestrator-worker shape outperformed single-agent Opus 4 by 90.2% on its internal research eval, but the gain is conditional on the task actually decomposing. Pick the architecture — supervisor, debate, or swarm — that matches where the planning effort needs to live, and put real engineering into the memory layer or none of it works.

Sources

Anthropic Engineering: How we built our multi-agent research system - Production orchestrator-worker pattern and the 90.2% internal-eval gain
the Du et al. multi-agent debate paper: Improving Factuality and Reasoning in Language Models through Multiagent Debate - Seminal paper on multi-round agent debate
OpenAI Agents SDK Docs: Handoffs — OpenAI Agents SDK - Handoff API parameters and input filters
OpenAI’s Swarm GitHub: openai/swarm - Educational repository, superseded by the OpenAI Agents SDK
Microsoft Learn: Microsoft Agent Framework Overview - v1.0 GA orchestration patterns
Google’s ADK site: Agent Development Kit (ADK) - Hierarchical sub-agents and code-first design
Redis Engineering Blog: AI agent memory: types, architecture & implementation - Short-term, episodic, semantic, procedural memory taxonomy
When to use multi-agent systems: When to use multi-agent systems (and when not to) - Anthropic’s guidance on cost vs benefit

Aha Moments

MAX

Mona is right that the architecture is a control graph, and that is exactly where most teams’ specs go silent. They write what each agent should do; they do not write the contract on the edges — what fields cross a handoff, who owns retries, what counts as “done.” A multi-agent system without an explicit edge spec is just a few prompts in a trench coat. Pick the pattern Mona maps, then write the handoff schema as carefully as you would write a public API. Once that contract exists, debugging stops being archaeology — every failure points at one named edge, not at “the system.”

DAN

Building on Max — the contract is the moat. Every team can demo a multi-agent system in a notebook in an afternoon. The teams that actually run them in production have invested in the boring layer Mona keeps pointing at: routing, memory, observability. The frameworks consolidating right now — Microsoft folding AutoGen into Agent Framework, OpenAI superseding Swarm with the Agents SDK, Google entering with ADK — are racing to own that infrastructure. Treat the choice of orchestrator as a vendor commitment, not a syntactic preference. Pick the platform whose handoff and tracing model fits your operational reality, because switching costs only grow as the agent graph grows.

ALAN

Building on Mona, Max, and Dan — the part that should keep us up at night is the silent cascade. A misrouted handoff in a swarm does not look like a bug. Each agent reasons locally, each step is plausible, and the failure surfaces only when the final answer is confidently wrong. Multiply that by hierarchies of supervisors and you have a system whose decision-making is impossible to reconstruct after the fact. The frameworks Dan named all advertise tracing. The harder question is whether tracing is enough when the chain that produced a harmful action spans several agents, multiple tools, and a memory store. Who is responsible when no single component crossed a line, but the system did?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors