MAX guide 14 min read May 7, 2026

Build Multi-Agent Systems with LangGraph, CrewAI, and OpenAI SDK in 2026

Multi-agent system architecture diagram: supervisor routing, agent handoffs, and shared state across LangGraph, CrewAI, and OpenAI SDK

Table of Contents

TL;DR

Pick the framework based on workflow shape: graph for stateful production, role-based crew for fast prototypes, handoff for transfer-of-control flows.
The topology — supervisor, swarm, or hierarchy — is your spec. Choose it before you choose a framework.
Multi-agent debate is a pattern you build on top of a framework, not a feature you toggle inside one.

Six agents. Three frameworks. A demo that worked Friday and broke Monday. The bug isn’t in the framework — it’s in the topology you picked before you picked the framework.

Before You Start

You’ll need:

An AI coding tool (Cursor, Claude Code, or Codex)
Familiarity with Multi Agent Systems as separate processes that exchange messages and divide a task
A clear picture of which agent owns which decision, not just which task

This guide teaches you: how to decompose a multi-agent system into supervisor, specialists, and handoffs before you commit to LangGraph, CrewAI, or the OpenAI Agents SDK.

The Demo That Worked Once

Here is the failure mode I see every week. A team picks a framework on Monday because they liked a tutorial. By Wednesday they have three agents calling each other through a planner. By Friday the demo works. By the next sprint, nobody can explain why agent B sometimes returns control to the planner and sometimes calls agent C directly.

The framework didn’t break. The team never wrote down the topology. They specified roles (“research agent”, “writer agent”) but not boundaries (which agent decides when the task is done, which agent owns the user-facing reply, which agent is allowed to call which other agent). The framework filled in the missing decisions with its defaults — and those defaults are different in Agent Orchestration libraries that look superficially similar.

It worked Friday. On Monday, the writer agent started looping back to the researcher because a prompt change shifted the planner’s preference, and the topology was implicit, not specified. That’s the spec gap that costs you a sprint.

Step 1: Map the Decision Boundaries

Before you compare frameworks, decompose the problem into agents and decisions. Frameworks compete on primitives — graphs, crews, handoffs — but those primitives only help if you already know which decisions belong to which agent.

Your system has these parts:

The supervisor — owns the routing decision. Reads the user request and current state, picks which specialist runs next. Never produces user-facing output directly.
The specialists — each owns exactly one capability (search, draft, validate, summarize). They produce work, not routing decisions. They never call each other unless your topology explicitly allows it.
The shared state — what every agent can read (the original task, intermediate artifacts) versus what only the supervisor writes (the routing log, the “is this done” flag).
The exit condition — who decides the run is finished, and what they hand back to the caller.

This is the supervisor-worker shape, the most common production topology. The Supervisor Agent Pattern dominates production multi-agent systems precisely because it isolates the routing brain from the doing brains, and that separation is what makes the system debuggable.

There is a second shape — Swarm Architecture — where agents hand off directly to each other without a central planner. The LangChain team recommends swarm for exploratory and research-mode systems, with graph or hierarchy patterns as the production default (LangGraph Swarm GitHub). Pick swarm only when you cannot define a supervisor’s job description in one sentence.

The Architect’s Rule: If you cannot draw the supervisor → specialist → handoff diagram on a whiteboard in five minutes, the framework will absorb the ambiguity and convert it into runtime confusion.

Step 2: Specify the Topology and the Handoffs

Now you turn the diagram into a contract the AI tool can implement. This is where the framework choice starts to matter — because each framework expresses topology in different primitives.

Context checklist:

Topology shape declared — supervisor, hierarchy (supervisor of supervisors), or swarm
Per-agent input contract — what the agent receives (JSON schema or typed model)
Per-agent output contract — what the agent returns, including the structured signal that means “I’m done with my piece”
Per-handoff condition — under what state value the supervisor routes to which specialist
Per-handoff timeout — how long before a stuck specialist gets cancelled
Forbidden routes — which agents are not allowed to call which other agents
Shared memory model — read-only fields, write-only fields, append-only logs

Now match the topology to the framework primitives.

LangGraph models the system as an explicit state machine over a graph — nodes are agents, edges are conditional routes, and state is checkpointed (LangGraph Docs). LangGraph 1.1.10 shipped on April 27, 2026 (LangGraph PyPI). The supervisor pattern is now built directly via tools rather than the dedicated langgraph-supervisor library, which still works but is no longer the recommended default (LangChain Multi-Agent Docs).

CrewAI models the system as a crew: agents with role descriptions and a task list, plus flows for event-driven structured workflows with explicit state (CrewAI Docs). CrewAI 1.14.4 shipped three days later, on April 30, 2026 (CrewAI GitHub). It is independent of LangChain and supports both MCP and A2A protocols.

The OpenAI Agents SDK models the system as a chain of handoffs. Calling transfer_to_<agent> is itself a tool call — the runtime immediately switches execution and transfers conversation state (OpenAI Agents SDK Docs). Version 0.15.3 shipped on May 6, 2026 (openai-agents PyPI). The SDK is provider-agnostic and runs against 100-plus model providers, not just OpenAI.

The Microsoft Agent Framework reached 1.0 General Availability on April 3, 2026 for both .NET and Python (Microsoft Foundry Blog). It supports sequential handoffs, group chat, and the Magentic-One task-oriented planner, with native MCP and A2A. It supersedes the legacy AutoGen experimental package and Semantic Kernel — greenfield code should target Agent Framework, not those older tracks.

The Spec Test: If your topology is a graph with conditional edges and you pick a handoff-first framework, you will simulate edges in prose and your routing logic will leak into agent prompts. Match the topology to the primitive.

Step 3: Build the Supervisor Before the Specialists

In single-agent work, you build the brain first. In multi-agent work, you build the router first and stub the brains. This inverts most developers’ instincts and is the single biggest reason teams ship working systems faster.

Build order:

Supervisor with stubbed specialists — the supervisor is real, the specialists return canned responses. Verify the routing graph: does the right specialist get called for the right input? Does the exit condition fire?
One specialist at a time — replace one stub with a real agent. Validate that the contract holds (it returns the shape the supervisor expects). Move to the next.
Shared state and Agent Memory Systems — add memory after the routing graph is stable. Memory amplifies routing bugs; if your routing is wrong, memory makes the bug harder to reproduce.
Cross-cutting concerns last — guardrails, tracing, human-in-the-loop, retry policies. These belong in the framework’s primitives (LangSmith for LangGraph, built-in tracing for OpenAI Agents SDK) rather than in agent prompts.

For each agent, your context must specify:

What it receives (typed input)
What it returns (typed output, including a done signal)
What it must NOT do (forbidden tool calls, forbidden routes)
How to handle failure (retry once, then route to supervisor with error context)

This order works because routing bugs masquerade as model bugs. If you build the specialists first and they look smart, you will blame them when the system misbehaves — and the actual bug will be in the supervisor’s conditional edge that you never tested in isolation.

Step 4: Validate the Routes, Not the Outputs

Most agent test suites assert on final answers. They miss the entire class of failures specific to multi-agent systems: wrong-route bugs, infinite-loop bugs, and silent-handoff-state-loss bugs. Validate the topology directly.

Validation checklist:

Route assertions per scenario — failure looks like: the right answer was produced, but by the wrong specialist (lucky output, broken topology). Caught only by asserting the routing trace.
Loop bound — failure looks like: the system answers correctly but takes 14 turns when 3 was the budget. Set a hard turn limit and assert it.
Handoff state integrity — failure looks like: an agent receives partial context after a handoff and quietly invents the missing fields. The OpenAI Agents SDK’s opt-in RunConfig.nest_handoff_history collapses prior transcripts into a <CONVERSATION HISTORY> block (OpenAI Agents SDK Docs); if you use it, assert that the receiving agent actually reads it.
Failure-route coverage — failure looks like: a specialist crash takes down the whole run. Inject a simulated failure and assert that the supervisor’s fallback fires.
Cost ceiling per run — failure looks like: the system passes correctness tests but burns far more tokens than budgeted. Assert max-tokens-per-run as a regression test.

Four-step decomposition for building multi-agent systems: map decision boundaries, specify topology, build supervisor first, validate routes — The Decompose-Specify-Build-Validate flow for AI-assisted multi-agent system development.

Common Pitfalls

What You Did	Why AI Failed	The Fix
Started with framework choice	Framework primitives shaped the topology you should have chosen yourself	Map decision boundaries first, then pick the framework whose primitives match
Single supervisor for many specialists	Routing decisions hit a token-budget ceiling and the supervisor confused itself	Add a second tier of sub-supervisors or split into separate crews
No timeout on handoffs	One stuck specialist froze the entire chain silently	Specify a per-handoff timeout and a fallback route in the topology
Treated multi-agent debate as a setting	Agent Debate is a pattern, not a framework toggle — single-agent calls happened anyway	Implement debate as an explicit loop on top of the framework, with N agents and K critique rounds

Pro Tip

The framework is downstream of the topology, and the topology is downstream of the decision boundaries. Sketch the topology first, then choose the framework whose primitives match it most cleanly. Every spec you write at the topology layer makes every framework choice downstream cheaper and more reversible. When the framework changes again next year — and it will — your topology diagram migrates. Your code does not.

Frequently Asked Questions

Q: How to build a multi-agent system step by step in 2026? A: Map decision boundaries first — which agent owns which call. Lock the topology (supervisor, swarm, or hierarchy). Pick the framework whose primitives match. Build the supervisor with stubbed specialists, swap one stub at a time, then add memory and guardrails last. The hardest step is the first: most teams skip decomposition and pay for it in week three.

Q: How to use LangGraph for supervisor-worker agent orchestration? A: Define the supervisor as a graph node that returns the next worker’s name as a structured tool call, then route via conditional edges. As of LangGraph 1.1.10, the LangChain team recommends building the supervisor directly via tools rather than the langgraph-supervisor library — both still work, but the tools-first approach gives you fewer abstractions to debug at 2 AM.

Q: When should you use CrewAI vs OpenAI Agents SDK vs Microsoft Agent Framework for multi-agent workflows? A: CrewAI fits role-based crew prototypes where agents collaborate by job description. OpenAI Agents SDK fits handoff-heavy flows where state transfers between agents like phone-tree calls. Microsoft Agent Framework fits enterprise .NET stacks with native A2A and MCP — pick by deployment target and team stack, not by feature checklist.

Your Spec Artifact

By the end of this guide, you should have:

A topology diagram naming each agent, its decision scope, and its handoff partners
A constraint list per agent: input contract, output contract, error fallback, per-handoff timeout
A route validation table that asserts which agent must handle which test scenario, plus loop bounds and cost ceilings

Your Implementation Prompt

Use this prompt in your AI coding tool (Claude Code, Cursor, or Codex) once you have your topology diagram from Step 1. Replace each bracketed placeholder with values from your decomposition. Do not delete sections — empty sections are themselves a signal that you skipped a step.

You are pair-programming a multi-agent system in [framework: LangGraph | CrewAI | OpenAI Agents SDK | Microsoft Agent Framework].

TOPOLOGY (from Step 1):
- Supervisor: [name and one-sentence decision scope]
- Specialist agents: [list each with one-line responsibility]
- Handoff edges: [list each as: from-agent → to-agent, condition]
- Shared state: [what every agent reads; what only the supervisor writes]
- Exit condition: [the state value that signals the run is done]

CONSTRAINTS (from Step 2):
- Tech stack and version: [language + framework version]
- Per-agent input contract: [JSON schema or typed model]
- Per-agent output contract: [same, plus the `done` field shape]
- Error fallback per agent: retry once, then route to supervisor with error context
- Per-handoff timeout: [seconds]
- Forbidden routes: [e.g., specialists never call each other; only supervisor writes to external systems]
- Memory model: [none | conversation-only | shared scratchpad | persistent store]

BUILD ORDER (from Step 3):
1. Implement the supervisor and stub each specialist with canned responses keyed to test inputs.
2. Swap one stub at a time for a real agent; validate routing before output quality.
3. Add shared state and memory only after the routing graph passes its tests.
4. Add guardrails, tracing, and human-in-the-loop last.

VALIDATION (from Step 4):
For each scenario in [list test scenarios], assert:
- The supervisor routed to the expected specialist (route assertion, not output assertion)
- The specialist's output matches its contract
- A simulated specialist failure triggers the documented fallback route
- Total turn count stays under [N]
- Total token cost per run stays under [M]

Generate the framework-specific code for the supervisor and the routing logic only. Stub the specialists. Do not add multi-agent debate, persistent memory, or human-in-the-loop yet — those come after route validation passes.

Ship It

You now know that the topology — supervisor, swarm, or hierarchy — is the spec, and the framework is the runtime that executes it. Decompose first. Lock the contracts. Build the routing before the brains. Validate routes, not just outputs. Every multi-agent project you spec this way gets shorter to debug and easier to migrate when frameworks shift again.

Aha Moments

MONA

Multi-agent debate maps to ensemble methods I’ve spent years studying — multiple model instances proposing and critiquing answers across rounds. The factuality lift is real, and it comes from disagreement, not from collaboration. If your specialist agents always agree, you have built a more expensive single agent. The interesting result from Du et al.’s research is the ceiling: the strongest single agent’s accuracy effectively upper-bounds the team’s accuracy. So Max’s emphasis on topology before framework matches what the research suggests — composition cannot rescue a weak base model. It can only let strong agents disagree productively. Pick your strongest model. Then design the disagreement.

DAN

Mona is right that the topology bounds what the team can achieve. The market signal compounds it. The major frameworks released this year all converged on the same primitives — supervisor, handoff, shared state. That convergence is not coincidence; it is standardization. The framework war is roughly over, and the differentiation now is how cleanly your engineers’ mental model maps to one set of primitives over another. Pick the framework your team already thinks in. The cost of training a team to think in a framework’s primitives is far higher than the cost of switching frameworks once your topology is documented well. Spec the topology, then optimize for team velocity.

ALAN

Dan and Mona are right about the math and the standardization — but those frame the build, not the responsibility. A multi-agent system distributes a single decision across processes that each look intelligent on their own. When the system produces something wrong, the audit trail is fragmented across handoffs, and “who decided” becomes genuinely hard to answer. Which agent gets the blame when the supervisor routed correctly, the specialist generated correctly, and the integration still produced the wrong answer for the user? Max’s spec discipline is necessary but not sufficient. The deeper issue is that the more agents you compose, the harder it becomes to trace responsibility — and the easier it becomes to ship a system nobody fully understands. Before adding the next agent, ask: can a single human still reconstruct what went wrong?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors