LLM Fallback And Retry Patterns
Also known as: LLM retry logic, AI API fault tolerance, model fallback chains
- LLM Fallback And Retry Patterns
- Fault-tolerance strategies for production AI apps that automatically handle API failures. Retry logic re-attempts a failed call with exponential backoff; fallback logic switches to an alternative model when retries are exhausted — keeping the app running despite single-model outages.
LLM fallback and retry patterns are fault-tolerance strategies that automatically recover from AI API failures — either by re-attempting the same model with exponential backoff or routing to a backup model before the failure cascades.
What It Is
When an AI app calls a model API and receives an error, two paths exist: the user sees the failure, or the app handles it without surfacing it. Fallback and retry patterns make the second path the default. They intercept errors between application code and the model provider, deciding how to respond before the failure reaches the user.
Model APIs fail in two distinct ways. Transient failures are temporary: a rate-limit error (HTTP 429) means the provider is throttling requests; a server error (500, 502, 503) means the provider is under load. Both resolve on their own within seconds. Permanent failures are not recoverable by retrying: a bad request (400) means the application sent something malformed; authentication errors (401, 403) mean the credentials are wrong. According to LiteLLM Docs, permanent error codes should never be retried — the same request will fail every time.
Retry logic targets transient failures. The standard approach is exponential backoff: the wait time between retries doubles after each attempt. According to fast.io, a typical implementation starts with a 1–2 second wait, doubles through 5–7 attempts, and adds jitter — a small random offset — to each interval. Jitter is not cosmetic. Without it, multiple clients that failed at the same moment retry in sync, sending a spike of traffic at an already-strained API. Exponential backoff with jitter spreads retries over time, giving the provider room to stabilize instead of amplifying the load. This is what prevents a brief outage from cascading into a prolonged one.
Fallback logic activates when retries are exhausted. Rather than returning an error, the system routes the request to a backup model — a cheaper model from the same provider, or a model from a different provider entirely. According to LiteLLM Docs, there are three distinct fallback types: a standard fallback switches the model on any failure; a content policy fallback activates only when the primary model refuses the request for policy reasons; a context window fallback routes to a model with a larger context when the input exceeds the primary model’s limit. Each type responds to a different failure signal rather than treating every error as equivalent.
Think of it like a card payment network: when your preferred processor is unavailable, the bank routes the transaction through a secondary network. The payment completes; the routing decision is invisible.
How It’s Used in Practice
Most developers encounter these patterns through LLM gateway libraries or proxies — tools that sit in front of the model API and handle reliability automatically. LiteLLM, for example, lets you configure retry behavior and fallback model chains in a config file. According to OpenRouter Docs, OpenRouter supports a defined list of fallback models per request, routing automatically when the primary model is unavailable.
The common production scenario: a team building a user-facing AI feature on a primary model adds a gateway layer. They configure it to retry on 429 and 5xx errors with exponential backoff, then fall back to a secondary model when retries run out. During a provider incident, users notice slightly longer response times but the app keeps working. Without the gateway, a single provider outage returns raw error messages directly to users.
Pro Tip: Always set a retry cap — 5 to 7 attempts is the standard range — and add jitter to your backoff intervals. Without a cap, a stuck request keeps retrying indefinitely. Without jitter, clients that failed together retry together, turning a brief provider outage into a thundering herd problem that keeps the API from recovering.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Rate-limit errors (429) from a temporarily throttled provider | ✅ | |
| Server errors (500, 502, 503) from a provider under load | ✅ | |
| Authentication failures (401, 403) | ❌ | |
| Malformed request errors (400) | ❌ | |
| Primary model unavailable mid-production run | ✅ (fallback) | |
| Content policy refusal on a compliance-sensitive request | ❌ (log and review instead) |
Common Misconception
Myth: Retrying an error will eventually succeed if you wait long enough.
Reality: According to LiteLLM Docs, errors like 400, 401, and 403 are permanent — the request will fail every time regardless of how long you wait. Only transient errors (429, 500, 502, 503, 504, and network timeouts) benefit from retry logic. Retrying a permanent error wastes time and consumes API quota.
One Sentence to Remember
Retry handles “failed for now”; fallback handles “failed for good” — and exponential backoff with jitter is what keeps retries from turning a brief API outage into a cascade that prevents recovery.
FAQ
Q: What is the difference between retry and fallback in LLM apps? A: Retry re-attempts the same model after a transient failure, spacing attempts with exponential backoff. Fallback activates when retries are exhausted — routing the request to a different model or provider instead of returning an error to the user.
Q: Which HTTP error codes should trigger a retry? A: According to LiteLLM Docs, retry on transient codes: 429, 500, 502, 503, 504, and network timeouts. Do not retry permanent errors — 400, 401, and 403 — because the same request will fail on every subsequent attempt.
Q: Why does exponential backoff prevent cascade failures? A: Without backoff, all clients that failed simultaneously retry at once, amplifying load on an already-strained API. Exponential backoff spreads retries over increasing intervals, reducing traffic during recovery rather than adding to it.
Sources
- LiteLLM Docs: Reliability — Retries, Fallbacks - authoritative reference for retry codes, fallback types, and backoff behavior
- OpenRouter Docs: Model Fallbacks — Reliable AI with Automatic Failover - reference for multi-provider fallback configuration
Expert Takes
Retry and fallback are error-classification systems before they are scheduling systems. The retry decision depends on whether a status code signals a transient state or a permanent one — that distinction must be encoded in the gateway config, not inferred from error text. Exponential backoff solves a classic distributed systems problem: preventing synchronized retries from amplifying load on a recovering service. Distinct fallback types — standard, content policy, context window — exist because “failure” is not a single signal.
The spec decision that kills most retry implementations is treating all timeouts as equivalent. A connection timeout (no bytes received) and a read timeout (partial bytes received) both surface as “timeout” in generic error handling, but they have different causes and different retry safety profiles. Separate them in your gateway config. Fallback chains also need a hop limit: without one, cascading failures across providers look like success to the caller, with only elevated latency as the signal that something went wrong.
Every AI product that shipped without retry logic learned the same lesson when their first real traffic spike hit: model APIs fail at the moment they matter most. Fallback patterns aren’t optional for teams moving past prototype. The gateway layer — routing across models, handling backoff, logging failure patterns — is the minimum viable reliability layer. Teams that treat it as something to add later consistently ship it after their first production incident.
Fallback chains introduce a question that retry logic sidesteps: when you swap a model mid-request because the primary refused for policy reasons, who reviewed whether the secondary’s policies match the use case? Content policy fallbacks exist to keep applications running, but they can route sensitive requests to a model with different — or less documented — refusal behavior. The engineering decision to keep the app up at any cost is also a values decision that rarely surfaces in code review.