MAX guide 14 min read May 12, 2026

How to Build Retry, Fallback, and Self-Correction in AI Agents (2026)

Specification blueprint for retry, fallback, and self-correction loops in production AI agents

Table of Contents

TL;DR

Treat every LLM call as a network call that can fail — design the retry contract before you write the first tool.
Persistence is a separate concern from retries. Pick a checkpointer for state, a workflow engine for orchestration, and a validator for output shape.
Self-correction is just a retry with a better error message — let the validator tell the model exactly what was wrong.

The agent worked all week. Monday morning the orchestrator wakes up, the rate-limit error returns, and the whole graph dies on step seven of twelve. Nobody saved the state. Nobody specified what “retry” meant. The pager goes off. This is the default outcome when Agent Error Handling And Recovery is bolted on at the end instead of designed in.

Before You Start

You’ll need:

A Python agent stack — LangGraph, Pydantic AI, or the OpenAI Agents SDK with Temporal
Familiarity with Agent Evaluation And Testing so you can prove recovery actually works
A clear picture of which failures are transient (retry), which are structural (fallback), and which need a human

This guide teaches you: how to decompose agent failure into four independent contracts — retry, validation, persistence, escalation — so each layer can be specified, tested, and replaced on its own.

Your Agent Won’t Survive a Bad Friday

You ship the agent. It demos beautifully. Then a real Friday hits — the upstream API returns 503 for ninety seconds, your model provider throttles you mid-call, one tool returns malformed JSON. The agent doesn’t recover. It crashes, or worse, it loops forever burning tokens on the same broken state.

It worked on Friday. On Monday, the run died at step seven because the checkpoint was in memory and the pod restarted.

That is the failure pattern this guide fixes. The model is not broken. The orchestration is unspecified.

Step 1: Map the Failure Domains

Before you pick a library, name the failures. Agents fail in four distinct ways, and each one needs a different fix. Mixing them up is why “add retries” never solves the problem.

Your system has these failure domains:

Transient transport failures — rate limits, timeouts, 5xx errors. These resolve themselves. Retry with backoff.
Output validation failures — the model returned text where you needed JSON, or hallucinated a field. Re-prompt with the error.
State loss failures — the process dies, the pod restarts, the workflow times out. Resume from a checkpoint.
Structural failures — the API contract changed, the tool is gone, the task is genuinely impossible. Escalate to a human.

The Architect’s Rule: If you can’t tell which domain a failure belongs to from the log line, your recovery code can’t either.

Each domain has its own contract. Transient failures need a retry budget. Validation failures need a schema and a re-prompt message. State loss needs a persistence layer. Structural failures need an escalation path — usually Human In The Loop For Agents. Build all four. Skip any one and the gap shows up at the worst possible time.

The industry baseline for transient LLM failures sits around one to five percent of calls (NeuralWired, ballpark only). That means a ten-step agent run without retries fails on roughly one call in twenty under normal load. Add a noisy upstream and the number gets worse fast.

Step 2: Specify Your Retry Contract

A retry policy is a spec, not a flag. Before you turn one on, write down what it actually means.

Retry contract checklist:

Which exception classes count as transient — typed, not stringly matched
Maximum retry count per call — not unlimited
Backoff curve — exponential with jitter, starting around 100ms, capped at a ceiling
Idempotency — every retried operation must produce the same outcome the second time
Circuit breaker state — Closed, Open, Half-Open, with a clear threshold for tripping
Fallback path — what happens after the budget runs out

Exponential backoff with jitter is the default pattern: double the wait each retry, add a small random offset so a fleet of agents doesn’t thunder back at the upstream the moment it recovers (Fastio). Pair it with a circuit breaker so a sustained outage trips the whole call site instead of hammering it for an hour.

For output validation — the second domain — the cleanest primitive in 2026 is the Pydantic AI output validator. Decorate an async function with @agent.output_validator and it receives the parsed output plus a RunContext (Pydantic AI Docs). If the output is wrong, raise ModelRetry(f'Invalid: {e}') and the LLM is re-prompted with that error message. Pydantic AI 1.88.0 (released April 29, 2026, Pydantic AI Changelog) treats this as a first-class loop — output_retries defaults to one, settable per agent, per run, or per tool.

The Spec Test: If your output validator can’t tell the model why it failed, you don’t have self-correction. You have a retry counter dressed up as one.

That is the difference between “the call failed, try again” and a Agent Guardrails layer that turns each failure into a corrective signal.

Step 3: Layer Persistence and Workflow Durability

Retries handle the call. Persistence handles the run. They are separate concerns and they belong in separate libraries.

Build order:

Output validators first — Pydantic AI @agent.output_validator with a tight retry budget. This is the cheapest layer to add and it catches the most common failure class in isolation.
In-graph persistence next — LangGraph checkpointers save state at every superstep so a crash mid-graph doesn’t replay successful sibling nodes. Pick the saver that matches your storage: InMemorySaver for tests, SqliteSaver/AsyncSqliteSaver for local, PostgresSaver/AsyncPostgresSaver or CosmosDBSaver for production, and AWS’s maintained DynamoDBSaver if you live in that stack (LangChain Docs).
Workflow durability last — Temporal wraps the whole agent in a workflow process that survives infrastructure failure. The OpenAI Agents SDK integration went GA in the Python SDK on March 23, 2026 (Temporal Blog), and Pydantic AI ships a TemporalAgent wrapper plus a PydanticAIWorkflow base class for the same pattern.

For each layer, your spec must answer:

What state must be durable across a process restart
What inputs map to a stable key — for LangGraph that is thread_id inside the configurable dict; without it, the checkpointer cannot save or resume (LangChain Docs)
What gets re-executed on resume versus replayed from history — LangGraph’s “pending writes” semantics make successful sibling-node outputs durable so they don’t re-run
What must NOT happen on retry — non-idempotent side effects, double-charging, duplicate emails

The two layers complement each other. Checkpointers give you replayable graph state. Temporal gives you a workflow that survives the host. If your run is short and lives inside one process, LangGraph alone is enough. If the run spans hours, calls external services, or has to survive a deploy, wrap it in a Temporal workflow and expose your activities as tools through activity_as_tool, which auto-generates the OpenAI-compatible tool schema from the activity signature (Temporal Docs).

There is a real industry debate here — Diagrid argues that checkpointers alone are not durable execution because they recover state but not the in-flight call. That is a vendor viewpoint, not consensus, but it points at the right question: what happens to the half-finished tool call when the pod dies? Decide before you ship.

Security & compatibility notes:
langgraph-prebuilt 1.0.2: Introduced a runtime parameter without proper version constraints and broke custom prebuilt implementations. Pin to a known-good version and review the tracking issue before upgrading.
Pydantic AI NativeOutput retries: Currently include the full ValidationError input on retry, which can balloon token usage on large outputs. Open issue, not yet fixed — keep validation messages compact until it lands.
API rename — result → output: Pydantic AI deprecated the result_* attribute names in favor of output_*. Aliases still work but emit warnings — use the new names in new code.
Temporal & LangGraph versions: No specific minor-version pinning is verifiable from docs alone. Track “current stable” and lock in CI; do not name a version in your spec.

Step 4: Verify Recovery, Not Just Success

Most teams test the happy path and ship. That is how you discover at 3 AM that the retry decorator was applied to the wrong function. Recovery is its own surface and it needs its own tests.

Validation checklist:

Kill-the-pod test — start a long run, terminate the worker mid-superstep, restart it. Failure looks like: the run restarts from scratch instead of resuming, or it replays a successful tool call that had side effects.
503-storm test — inject a sequence of upstream 503s shorter than the retry budget. Failure looks like: the agent gives up before the budget is spent, or it retries on a non-retryable error class.
Malformed-output test — force the model to return invalid JSON once, then succeed. Failure looks like: the validator catches it but the run fails anyway, meaning output_retries is set to zero.
Circuit-breaker trip — sustain a real outage past the breaker threshold. Failure looks like: the breaker never opens, or it opens and never closes after recovery.
Escalation test — feed the agent a genuinely impossible task. Failure looks like: it loops on the retry budget instead of escalating to a human.

Pair these tests with Agent Observability so you can actually see which contract fired on which call. A retry counter with no tracing is a black box. A retry counter wired to traces tells you whether your spec is matching reality.

Four-layer specification stack for AI agent recovery: retry contract, output validator, checkpointer, durable workflow — Each failure domain gets its own contract — mixing them is why generic retry decorators don't work.

Common Pitfalls

What You Did	Why AI Failed	The Fix
Wrapped the whole agent in one `try/except` and called it “retries”	No retry budget, no backoff, no exception filtering — non-transient errors retried forever	Specify exception classes, max attempts, and exponential backoff per call site
Used `InMemorySaver` in production	State lost on every pod restart; “durable” runs were not durable	Move to `PostgresSaver`, `DynamoDBSaver`, or a Temporal workflow for real persistence
Set `output_retries=0` because retries felt risky	Validator catches malformed output, run fails immediately — no self-correction	Use the default `output_retries` budget and let `ModelRetry` re-prompt with the error
Mixed transient and structural failures	Real outages tripped the breaker; impossible tasks burned the retry budget	Separate exception hierarchies — retry transient, escalate structural
Skipped the `thread_id` in `configurable`	LangGraph silently ran without persistence; checkpointer was a no-op	Always pass `thread_id`; assert it is set in your graph entrypoint

Pro Tip

Every retry contract has a partner: the idempotency contract. Before you add a retry anywhere, ask one question — if this call runs twice, does anything bad happen? If the answer is yes, you do not have a retry problem. You have a side-effect problem. Make the operation idempotent first — natural keys, deduplication tokens, conditional writes — and only then add the retry on top. This single discipline prevents the most expensive class of agent bug: the duplicate-action incident that no log will explain.

Frequently Asked Questions

Q: How to implement retry with backoff and self-correction loops in an AI agent?

A: Treat them as two layers. Wrap transport calls in exponential backoff with jitter, filtered by typed exception classes. For self-correction, use a Pydantic AI @agent.output_validator that raises ModelRetry(error_message) so the model re-runs with the exact failure reason.

Q: How to use LangGraph checkpointers and Temporal for durable agent execution?

A: Use them in layers, not as alternatives. LangGraph checkpointers (Postgres, DynamoDB) persist graph state between supersteps so crashes resume cleanly. Temporal wraps the whole agent in a workflow that survives infrastructure failure. Always set thread_id or the checkpointer silently no-ops.

Q: How to use Pydantic AI validators as a pre-execution shield for agent outputs?

A: Define a Pydantic model for the output and decorate an async function with @agent.output_validator. The validator receives the parsed output plus a RunContext and can hit databases or check business rules. Raise ModelRetry for recoverable errors, plain exceptions for structural ones.

Your Spec Artifact

By the end of this guide, you should have:

A failure-domain map naming which exceptions in your code are transient, which are validation, which are state loss, and which are structural
A retry contract per call site with backoff curve, budget, idempotency strategy, and circuit-breaker threshold
A validation matrix listing the recovery tests above with pass criteria for each

Your Implementation Prompt

Drop this into Claude Code, Cursor, or Codex once you have your failure-domain map. The prompt mirrors the four steps and asks the AI to fill in the specification, not write the framework code for you.

You are designing the error-handling layer for a production AI agent.

Stack: [your agent framework — LangGraph / Pydantic AI / OpenAI Agents SDK]
Persistence: [your checkpointer — Postgres / DynamoDB / Cosmos / Temporal workflow]
Observability: [your tracing layer — Logfire / OpenTelemetry / Langfuse]

Step 1 — Failure domains
Classify each of the following exceptions into one domain (transient,
validation, state loss, structural):
[paste your current exception list]

Step 2 — Retry contract per call site
For each call site below, specify:
- exception classes that count as transient: [list]
- max attempts: [number]
- backoff: exponential with jitter, base [ms], cap [ms]
- idempotency strategy: [natural key / dedup token / conditional write]
- circuit-breaker threshold: [failures over window]
Call sites: [paste your list]

Step 3 — Persistence layer
Specify:
- checkpointer implementation: [class name]
- thread_id derivation: [how it is computed from input]
- state that must survive process restart: [list]
- state that must NOT be replayed on resume: [list — non-idempotent effects]

Step 4 — Validation
Write recovery tests for each of:
- kill-the-pod mid-superstep
- 503 storm shorter than budget
- malformed model output → ModelRetry → success
- circuit-breaker trip and recovery
- impossible-task escalation to a human

Output: a single markdown spec document with one section per step.
Do NOT generate framework code. Generate the specification only.

Ship It

You now have a four-layer model for agent failure: retry, validate, persist, escalate. Each layer is a separate contract you can specify, test, and replace. The next time someone tells you their agent “just needs retries,” you will know which layer they actually mean — and which three they forgot.

Aha Moments

MONA

The four-layer split Max draws here maps to something fundamental about distributed systems — every layer is a different probability distribution over failure. Transient failures cluster around the call site. Validation failures cluster around the model’s output. State-loss failures cluster around the process boundary. Structural failures cluster around the contract between you and reality. A single retry decorator treats them as one curve. They are not. Once you separate the curves, you can size each budget to the actual rate of that failure class, which is the only way the math ever works in production.

DAN

Mona is right that the math matters, but the market signal here is sharper. The teams shipping reliable agents right now are the ones who treat error handling as a procurement decision, not a coding decision. You buy durable execution. You buy validation. You buy observability. You compose them. The era where every team wrote its own retry loop is ending, and fast. The platforms that won the last cycle made the easy thing the right thing — and the easy thing here is layering four off-the-shelf primitives instead of hand-rolling one fragile decorator. The teams that learn this early ship agents that survive contact with real traffic. The rest stay stuck in demo mode.

ALAN

Both responses assume the failure is technical. Most production agent failures I see are not. They are governance failures dressed up as crashes — nobody specified what “escalate to a human” actually means, who owns the queue, who has authority to override the model, who answers when an idempotent retry repeats a refund. The specification Max draws is the right shape. But a checklist with no accountable owner is just a document. Who decides when the circuit breaker stays open? Who reviews the cases where the agent escalated and the human approved the wrong thing? Build the four layers. Then ask who is on the other end of each one.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors