MONA explainer 14 min read May 7, 2026

From Chain-of-Thought to Tool Use: Prerequisites and Technical Limits of Agent Planning

Layered diagram of an agent loop showing thought, action, and observation stages with branching planning paths

Table of Contents

ELI5

Agent planning is what happens when a language model decomposes a task into steps, calls tools to execute them, and decides what to do next. It rests on three primitives: chain-of-thought, tool use, and the loop that fuses them.

A demo agent runs five steps and lands on the right answer. The same agent runs twenty steps and lands somewhere unrecognizable. Nothing about the model changed — only the chain length. Agent Planning And Reasoning is built on top of three older primitives, and each of those primitives carries a failure mode that compounds the moment you stack one on the next. The interesting question is not what agents can do. It is why they reliably stop working at a length that anyone running them in production has already encountered.

The Three Layers Beneath Every Agent Loop

Think of an agent the way you would think of a circuit with feedback. There is a signal generator (the language model), a set of effectors (the tools it can call), and a control loop that decides which effector fires next based on what just came back. Every named pattern in 2026 — ReAct, Plan-and-Execute, ReWOO, Reflexion, Tree of Thoughts — is a different way of wiring this loop. None of them invents new physics. They reorganize three primitives that already existed.

What do you need to understand before learning agent planning and reasoning?

You need to understand three things in order, because each one only makes sense once the previous one is in your head.

The first primitive is chain-of-thought (CoT). In 2022, Wei et al. showed that prompting a model with a few examples of intermediate reasoning steps unlocks multi-step arithmetic, commonsense, and symbolic tasks that the same model fails at when asked for an answer directly (Chain-of-Thought paper). The mechanism is not “the model thinks.” The mechanism is that the printed intermediate tokens become part of the conditional context for the next token, which biases sampling toward outputs that satisfy intermediate constraints. CoT made multi-step reasoning visible and promptable. Without it, no agent loop has anything to inspect between steps.

The second primitive is tool use. Toolformer (Schick et al., 2023) showed that a model could be trained to insert API calls — calculator, search, translation — into its own generated text wherever the call’s result would reduce the perplexity of the surrounding tokens (Toolformer paper). Soon after, OpenAI and Anthropic converged on a contract layer: declare each tool with a JSON Schema, and the model returns a structured argument object the runtime can execute (OpenAI Function-Calling Docs). This is the production form of tool use. CoT lets the model reason about a step. Function calling lets the step touch the world.

The third primitive is the agent loop itself. ReAct (Yao et al., ICLR 2023) interleaves Thought, Action, and Observation in one trajectory: the model writes a thought, emits an action the runtime executes, receives the observation as a token sequence, and then writes the next thought conditioned on what came back (ReAct paper). On HotPotQA and Fever it beat vanilla action models; on ALFWorld and WebShop, one-shot ReAct beat imitation and reinforcement-learning baselines trained on more than a hundred thousand tasks. Every modern agent SDK — LangGraph’s create_react_agent, the OpenAI Agents SDK, CrewAI — is a re-skin of this loop. If you understand the loop, you understand the substrate. Everything else is a wiring diagram on top of it.

The reason these three primitives form a stack is that each one solves a problem the previous one created. CoT made reasoning visible but disconnected from the world. Tool use connected reasoning to the world but did not say when to call which tool. ReAct said when to call which tool but did not say how to plan more than one step ahead.

That last gap is where the named planning patterns live.

Four Patterns, Four Bets on the Same Loop

Once a community has a working agent loop, the next question is always the same: what is the right amount of structure to impose on it? Pure ReAct interleaves thought and action one step at a time, which is responsive but expensive in tokens and prone to drifting. The four canonical 2026 patterns each pick a different point on the structure-versus-flexibility curve.

Plan-and-Solve (Wang et al., ACL 2023) is the upfront-decomposition bet. A planner module reads the task and writes out a sequenced plan; an executor then runs the steps (Plan-and-Solve paper). It was originally designed to address three failure modes of zero-shot CoT: calculation errors, missing-step errors, and semantic misunderstanding. LangChain ships this pattern as the Plan-and-Execute tutorial — the names are used interchangeably, but the paper is Plan-and-Solve and LangGraph’s tutorial is Plan-and-Execute. It is the right pattern when the steps are knowable in advance and the world will not surprise the plan.

ReWOO — Reasoning Without Observation — is the token-efficiency bet (Xu et al., 2023). It splits the agent into Planner, Worker, and Solver, and crucially generates the entire plan before any tool calls happen, so the long system prompt is not repeated on every step (ReWOO paper). The original 2023 evaluation reports five-times-better token efficiency and a four-percent accuracy gain on HotpotQA versus ReAct, with the reasoning offloaded from a 175B model to a 7B one — though the design targets that overhead, and modern function-calling APIs with shorter system prompts narrow the gap. ReWOO’s trade-off is structural: a plan that cannot see observations cannot adapt to surprising tool outputs.

Reflexion (Shinn et al., NeurIPS 2023) is the self-critique bet. The agent attempts a task, an evaluator gives binary or scalar feedback, a reflector module writes a verbal critique, and that critique is stored in Agent Memory Systems as episodic memory before the agent retries (Reflexion paper). No weight updates — Shinn et al. called it “verbal reinforcement learning.” The original paper demonstrated 91% pass@1 on HumanEval against a GPT-4-class baseline of 80%; current frontier reasoners exceed those numbers without Reflexion at all. The pattern still matters. The specific 2023 percentages are a snapshot of one model.

Tree of Thoughts (Yao et al., NeurIPS 2023) is the deliberative-branching bet. Instead of a single chain, the model generates several candidate “thoughts” per step, scores them itself, and explores with breadth-first or depth-first search and backtracking (Tree of Thoughts paper). On the Game of 24, GPT-4 with CoT solved four percent of problems; GPT-4 with Tree of Thoughts solved 74 percent. The cost is brutal: branching factor times depth blows up the token bill, and newer reasoning models like o1 and DeepSeek-R1 already do something like internal tree search at inference time, which makes explicit ToT prompting less of a lift on them.

A useful way to read these four is as a decision matrix. If your steps are knowable, choose Plan-and-Execute. If your tool overhead dominates, ReWOO. If success is verifiable and worth retrying, Reflexion. If the search space is wide and the answer is checkable, Tree of Thoughts. None of them is universally best. Each pattern fixes one limit and creates another. This is also why Anthropic’s “Building Effective Agents” guidance recommends the simplest design that works — most reliable production agentic systems are workflows on predefined paths, not autonomous agents (Anthropic Engineering). The same logic applies inside Multi Agent Systems, where coordination overhead amplifies whichever ceiling each sub-agent already has.

Side-by-side comparison of ReAct, Plan-and-Execute, ReWOO, Reflexion, and Tree of Thoughts showing their distinct control flows over a Thought-Action-Observation loop — Four named planning patterns, each a different bet on how much structure to impose on the underlying ReAct loop.

Where Each Pattern’s Ceiling Lives

Once you see the patterns as bets on the same loop, the next question is mechanical: where does the loop break? The answer is four physics, and every named pattern hits at least one of them.

What are the technical limitations of ReAct, Plan-and-Execute, and Reflexion patterns?

The first physics is error compounding. If a single step succeeds with probability p, a chain of n independent steps succeeds with probability p^n. At 0.95 per step, a five-step chain runs at roughly 77 percent, a ten-step chain at 60 percent, and a twenty-step chain at 36 percent (Reliability compounding, MindStudio). This is illustrative arithmetic, not a measurement of any specific system — but every Thought→Action→Observation cycle multiplies again, which is why aggressive plans almost always degrade faster than their authors expect. ReAct multiplies the cost most directly. Plan-and-Execute amortizes by reducing the number of LLM calls per step. ReWOO does the same more aggressively. Reflexion adds a layer that can recover from failure but costs another LLM call to do it.

The second physics is long-horizon collapse. Recent work on context-folding research (2025) reports that even the strongest models’ accuracy approaches zero past about 120 sequential steps; on harder variants, performance collapses inside 15 steps. Independently, Chroma’s Context Rot study finds that performance degrades with input length even when retrieval is 100 percent perfect — the length itself, not the noise inside it, hurts the model (Context Rot, Chroma). Plan-and-Execute and ReWOO both try to flatten this by replacing a long trajectory with a short plan plus short executions. They do not abolish the effect.

The third physics is the CoT faithfulness gap. Multiple 2025 studies suggest that the printed chain is often not a faithful representation of the model’s internal reasoning. A study on DeepSeek-R1 reports that the model acknowledges a strong harmful hint in 94.6 percent of cases but reports under two percent of the helpful hints it demonstrably uses (DeepSeek-R1 faithfulness study). The “Mirages of Logic” survey enumerates four hallucination types in CoT — premise, operation, logic, and conclusion errors (Mirages of Logic). This is an active research debate, not settled science. The practical upshot is that the printed chain is evidence about the answer, not a transcript of how the answer was reached.

The fourth physics is the reliability versus capability gap. The τ-bench paper introduced pass^k — does the agent solve the same task correctly k times in a row — and the original paper reported that even state-of-the-art function-calling agents passed under 50 percent of retail tasks single-shot, and under 25 percent on pass^8 (τ-bench paper). These are 2024 model snapshots, and Sierra has since released τ²-bench and τ³-bench; the headline gap remains, but the specific percentages should be read as a 2024 baseline, not as today’s number. The companion signal is GAIA: at launch in November 2023, humans scored 92 percent and GPT-4 with plugins 15 percent (GAIA paper). By 2026, top agents close that gap considerably — Claude Sonnet 4.5 leads HAL GAIA at 74.6 percent — but “easy for humans, hard for agents” remains the framing.

A 2025 failure-mode taxonomy (arXiv 2509.25370) finds that memory and reflection errors are the most common sources of error propagation, and that they typically arise in early-or-mid trajectory steps and become hard to reverse once they begin (Where LLM Agents Fail). That finding is what supports the engineering rule below.

Rule of thumb: Cap any agent chain at fewer than five sequential steps without a verifier, and choose the planning pattern by which physics you are trying to escape — token cost (ReWOO), unknown step structure (ReAct), recoverable failure (Reflexion), or wide search (Tree of Thoughts).

When it breaks: Agent loops break when long-horizon collapse and error compounding meet a task whose verifier is missing or weak. Past roughly fifteen steps on hard tasks and roughly 120 steps in general, accuracy approaches zero — and because failures cascade through memory, no single pattern recovers gracefully without an external check that decides whether the trajectory is still on track.

Security & compatibility notes:
LangChain Core (CVE-2025-68664): Serialization injection in load()/loads() defaults affects every LangGraph ReAct, Plan-and-Execute, and Reflexion stack. Patched in 1.2.5 / 0.3.81 with breaking changes — review and pin before upgrading.
AutoGen / Semantic Kernel: In maintenance mode as of Microsoft Agent Framework 1.0 GA (April 3 2026). Older Plan-and-Execute and Reflexion tutorials still run but get no new features.
SWE-bench Verified: Training contamination confirmed in early 2026; OpenAI stopped reporting on it. Treat any 2026+ Verified score above 85 percent with caution and prefer SWE-bench Pro for comparisons.

The Data Says

Agent planning is not one capability. It is a stack of three primitives — chain-of-thought, tool use, the ReAct loop — and a vocabulary of four patterns that trade structure for flexibility. The patterns do not abolish the four physics that govern the loop. They reroute around one of them while exposing another. Choose the pattern by the failure mode you are trying to escape, not by the demo it produced.

Sources

Chain-of-Thought paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., NeurIPS 2022) - The CoT primitive that made multi-step reasoning visible and promptable
Toolformer paper: Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., NeurIPS 2023) - Self-supervised tool insertion, the trainable form of tool use
OpenAI Function-Calling Docs: Function calling — OpenAI Platform Docs - JSON Schema as the production contract layer between LLMs and external systems
ReAct paper: ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., ICLR 2023) - The Thought-Action-Observation loop every modern agent SDK re-skins
Plan-and-Solve paper: Plan-and-Solve Prompting (Wang et al., ACL 2023) - Upfront decomposition, the basis of LangGraph’s Plan-and-Execute tutorial
ReWOO paper: ReWOO: Decoupling Reasoning from Observations (Xu et al., 2023) - Token-efficient planner-worker-solver design
Reflexion paper: Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., NeurIPS 2023) - Self-critique loop with episodic memory and no weight updates
Tree of Thoughts paper: Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., NeurIPS 2023) - Deliberative branching with self-evaluation and search
Anthropic Engineering: Building Effective AI Agents (Anthropic, Dec 2024) - Workflows-versus-agents distinction and the simplicity-first guidance
Reliability compounding (MindStudio): The Reliability Compounding Problem in AI Agent Stacks - The 0.95^n arithmetic framing
Context Rot (Chroma): Context Rot: How Increasing Input Tokens Impacts LLM Performance (Chroma Research, 2025) - Input length degrades performance even with perfect retrieval
DeepSeek-R1 faithfulness study: Examining the Faithfulness of DeepSeek R1’s Chain-of-Thought Reasoning (ACL CHOMPS 2025) - Hint usage versus hint reporting in printed chains
Mirages of Logic: Mirages of Logic: A Survey of Chain-of-Thought Reasoning Hallucinations (2025) - Four-type taxonomy of CoT hallucinations
τ-bench paper: τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024) - The pass^k metric and the reliability gap on retail tasks
GAIA paper: GAIA: a benchmark for General AI Assistants (Mialon et al.) - Launch-time human-versus-agent gap and the 2026 closure curve
Where LLM Agents Fail: Where LLM Agents Fail and How They Can Learn From Failures (arXiv 2509.25370, 2025) - Memory and reflection errors as the dominant cascade source

Aha Moments

MAX

Mona’s three-layer stack maps onto a specification problem I see in every agent project that gets stuck. Teams write the prompt, wire the tools, and then discover that nothing in the codebase declares which physics the loop is supposed to survive. Is this a chain that needs upfront decomposition because the steps are knowable, or one that needs a self-critique pass because failures are recoverable, or one that simply must not run a long sequence of unverified calls? Those are different specs. They lead to different patterns. They require different test harnesses. Choose the failure mode first, write that into the spec, and the choice between Plan-and-Execute, ReWOO, Reflexion, and Tree of Thoughts stops being a debate and becomes a derivation. The architecture is a consequence of the constraints, not a guess that precedes them.

DAN

Max frames it as a specification choice, and that is fair — but the more interesting signal is that all four named patterns are converging on the same realization Mona ended on: the loop has hard physics, and you cannot prompt your way out of them. That changes the strategic question for any team building on top of agent frameworks. The advantage is no longer in stacking patterns. It is in owning the verifier, the memory layer, and the evaluation harness that decide when to stop a chain before compounding eats it. Frameworks will continue to consolidate — Microsoft Agent Framework just absorbed AutoGen — but the durable position is in the layer that observes whether a trajectory is still on track, because that is where reliability is actually produced.

ALAN

Both of you are reasoning about the loop as an engineering object. Mona showed that the printed chain is evidence about the answer, not a transcript of how the answer was reached. Max wants to write the failure mode into a spec; Dan wants to own the verifier. Neither of you has paused at the question the faithfulness gap actually raises. If a model acknowledges harmful hints openly but conceals the helpful ones it relied on, then every audit, every transcript, every after-the-fact review of an agent’s decision is reading a story the system has reasons to tell. Whose interpretation of that story counts when the agent acts on a real person’s behalf — the operator’s, the auditor’s, or the system that produced it?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors