MONA explainer 11 min read May 3, 2026

From RAG to Agents: Prerequisites and Hard Limits of Agentic RAG

Layered prerequisite stack of retrieval primitives feeding an agent loop with branching reliability paths

Table of Contents

ELI5

Agentic RAG wraps a classic retrieve-then-generate pipeline inside a reasoning loop. An agent decides when to retrieve, judges what it got, and rewrites the query if the answer is weak — instead of retrieving once and hoping.

Take a retrieval pipeline that works ninety-five times out of a hundred. Wrap an agent around it, add a reranker, plug in an evaluator, give the model permission to rewrite the query and try again. The end-to-end success rate quietly slides toward four-in-five. The system did not get worse. It got longer. That is the part most teams discover after the demo, in production, with traffic.

The Stack You Inherit Before You Touch an Agent

Agentic RAG is not a single technique you can learn in isolation. It is the fifth floor of a building, and most production failures happen because someone tried to install the elevator before the second floor existed. The Singh et al. (2025) survey on agentic RAG treats it explicitly as a generalisation of three earlier layers — and skipping any of them turns the agent loop into theatre.

What do you need to know before learning agentic RAG?

Three concentric layers, each a hard prerequisite.

The first layer is classic Retrieval Augmented Generation — the original retrieve-then-generate architecture introduced in Lewis et al. (2020). You embed a corpus, fetch the top-k chunks for a query, and stuff them into the model’s context. Naive in shape, but every piece of the agentic stack still depends on it. If you cannot reason about chunk size, embedding model choice, or top-k cutoff, no orchestrator on top will save you.

The second layer is advanced RAG — the techniques the field stacked on top of the naive pipeline once it became obvious that one-shot retrieval underperforms. The Naive / Advanced / Modular taxonomy in the Gao et al. (2023) RAG survey is the canonical reference here, and three primitives matter:

Hybrid Search combines dense vector similarity with sparse lexical matching (typically BM25). Dense retrieval finds semantic neighbours; sparse retrieval pins down exact identifiers, rare tokens, and acronyms the embedding space dilutes. Most production retrieval systems run both and fuse the scores.
Reranking takes the top-k from retrieval — say, fifty candidates — and rescoring them with a stronger cross-encoder that actually reads each candidate against the query. The retrieval index is built for recall; the reranker is built for precision. They are different jobs.
Query Transformation rewrites, decomposes, or expands the user’s question before it ever touches the index. User queries are usually shorter, vaguer, and more under-specified than the documents that answer them; bridging that gap is one of the highest-leverage interventions in the stack.

A fourth advanced technique, Contextual Retrieval, prepends document-level context to each chunk before embedding it — so a chunk reading “The latency was 200ms” carries enough surrounding text to be retrievable as “Service X showed 200ms latency in the Q4 review.” Without it, retrieval over real corpora becomes a guessing game across stripped-of-context fragments.

The third layer is agent primitives — the patterns that turn a one-shot pipeline into a loop. Yao et al. (2022) introduced the ReAct pattern, the interleaved Thought → Action → Observation cycle every agentic-RAG implementation derives from. Anthropic’s engineering team catalogues the six composable patterns most teams build from: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and the autonomous agent itself.

Sitting between the second and third layers are the bridge papers. Self-RAG (Asai et al., 2023) trains the language model to emit reflection tokens that decide whether retrieval is needed at all and whether the generation is grounded. CRAG (Yan et al., 2024) bolts a separate lightweight retrieval evaluator onto the pipeline, scoring each retrieved document and routing to one of three corrective actions: use as-is, refine, or fall back to web search. Both are often described as “agentic,” but be precise — Self-RAG fine-tunes the generator, CRAG adds an external evaluator. Neither is a multi-agent system on its own. They are the conceptual bridge.

The reason this stack order matters: an agent loop is only as good as the primitives it calls. If hybrid search is misconfigured, the agent will obediently rewrite the query and retrieve the same garbage with different syntax. Compounding zero is still zero.

Where the Pipeline Stops Behaving Like a Pipeline

Once you have all three layers and you wire an agent loop on top, the system gains a property classical RAG does not have: it can decide. That sounds like an upgrade. It is also a new physics, and the physics has four hard limits that show up the moment you touch production traffic.

What are the technical limitations of agentic RAG architectures?

Four constraints, all of them compounding — and the operational catalogue (latency, cost, reliability, complexity, observability/overhead) lines up with what production teams actually report.

Latency stacks linearly with steps. A naive RAG call is one retrieval plus one generation. An agentic loop with a planner, two retrieval rounds, a grader, and a synthesis step easily multiplies the wall-clock budget several times over. Each step waits for the previous one’s tokens. There is no parallelism inside a sequential trajectory. Users feel this immediately, and no amount of model speed-up rescues the order-of-operations cost.

Cost scales with the trajectory, not the request. The order-of-magnitude rule of thumb in circulation is roughly $0.001 per query for naive RAG, $0.005 with hybrid + rerank, and $0.02 to $0.10 per query for full agentic RAG (Lushbinary 2026 RAG Production Guide), because a multi-step trajectory makes multiple LLM calls — often with the full retrieved context attached each time. Treat that ladder as illustrative arithmetic, not a price quote — but treat the shape of it as real. A ten-cent query is fine for a knowledge-worker copilot and ruinous for consumer search.

Reliability multiplies, it does not add. This is the most underrated property of the architecture. If retrieval succeeds 95% of the time, reranking is right 95% of the time, and the final generation is grounded 95% of the time, end-to-end success is roughly 0.95 × 0.95 × 0.95 ≈ 0.81 — a three-stage pipeline failing about one in five calls. Each new agent step (a query rewrite, a tool router, a critic) multiplies again. The arithmetic is not a measurement — it is what compounding does to independent probabilities.

Not a glitch. An emergent property of multi-step composition.

Evaluator fragility cascades errors instead of catching them. Every “corrective” step in an agentic-RAG pipeline depends on a critic — a grader model, a reflection prompt, a confidence threshold. When the critic is well-calibrated, it filters bad retrievals and rescues the answer. When the critic is miscalibrated — too lenient, too strict, or systematically wrong on a class of inputs — the corrective step actively makes the pipeline worse, because it confidently routes good answers into a rewrite loop and rubber-stamps bad ones (Meilisearch Blog). Operators most often misdiagnose this: they assume a corrective system is robust by construction, when in fact it inherits the brittleness of its evaluator.

The Lewis-style retrieve-then-generate pipeline is a function. The agentic version is a state machine. They have different debugging tools, different cost models, and different failure modes — and that is before you account for the operational complexity and observability overhead that sit underneath the four physics.

Stacked prerequisite layers feeding into an agentic loop, with a sidecar showing reliability decay across steps — The three prerequisite layers below an agentic RAG loop, and the four physics that limit it once it runs.

What This Predicts About Your Build

The mechanism makes specific predictions you can test before you commit to the architecture.

If your retrieval recall on a held-out test set is weak, an agent loop will not fix it — it will just spend more tokens to confirm the bad retrievals. Fix retrieval first.
If your evaluator (grader / critic) has not been measured against a labelled set, treat the corrective step as a coin flip with extra steps. The agent will follow whatever the evaluator says, including when the evaluator is wrong.
If your latency budget is sub-second end-to-end, cycle-based architectures (LangGraph-style) will not fit. Pick a router pattern with at most one retrieval round, or pre-compute.
If you are running consumer-scale traffic, model the per-query cost ceiling first; agentic trajectories make individual queries an order of magnitude more expensive than the advanced-RAG baseline.

Rule of thumb: add an agent loop only after you have measured retrieval recall, reranker precision, and end-to-end latency on the non-agentic baseline.

When it breaks: the typical failure mode is not a single broken component — it is the compounding of three 95%-reliable steps into a roughly 81%-reliable system, with an undiagnosed evaluator that masks which step is responsible for the slide.

Security & compatibility notes:
LangChain Serialization Injection (CVE-2025-68664, “LangGrinch”): Critical vulnerability allowing secret extraction from agentic stacks built on LangChain Core. Patched in 1.2.5 / 0.3.81 with breaking changes — load() / loads() now enforce an allowlist, environment-variable interpolation defaults to off, and Jinja2 templates are blocked. Audit any custom serializers before upgrading.
LangChain Path Traversal (CVE-2026-34070): High-severity. Legacy load_prompt() / load_prompt_from_config() are deprecated; removal is scheduled for 2.0.0. Migrate prompt loading off the legacy helpers now.

The Data Says

Agentic RAG is a generalisation of classic RAG, not a replacement for it — and the generalisation costs latency, dollars, and reliability in measurable, multiplicative ways. Build the three prerequisite layers first; measure the baseline; only then add the loop. The math under the loop does not care about the demo.

Sources

Lewis et al. (2020): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the canonical retrieve-then-generate paper.
Gao et al. (2023): Retrieval-Augmented Generation for Large Language Models: A Survey — the Naive / Advanced / Modular RAG taxonomy.
Singh et al. (2025): Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG — the canonical agentic RAG survey.
Yao et al. (2022): ReAct: Synergizing Reasoning and Acting in Language Models — the Thought → Action → Observation pattern.
Asai et al. (2023): Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection — reflection tokens and adaptive retrieval.
Yan et al. (2024): Corrective Retrieval Augmented Generation — the lightweight retrieval evaluator pattern.
Anthropic Engineering: Building Effective AI Agents — the six composable agent patterns.
Lushbinary 2026 RAG Production Guide: RAG Production Guide 2026 — illustrative cost ladder.
Meilisearch Blog: What is agentic RAG? — evaluator fragility framing.
NIST NVD: CVE-2025-68664 — LangChain Serialization Injection — security advisory referenced in the compatibility note.

Aha Moments

MAX

The mechanism Mona describes is exactly why I keep telling teams: the agent loop is a specification problem, not a model problem. If you cannot write down what each step is supposed to return, what counts as success, and what the evaluator is actually evaluating, you do not have a system — you have a chain of stochastic calls hoping for the best. The constraints she names are downstream of one upstream cause: missing specification at every joint. Write the contract for retrieval. Write the contract for the grader. Write the contract for “good enough to stop looping.” The compounding reliability problem Mona laid out is what happens when those contracts are implicit and the loop gets to invent its own definition of done.

DAN

Mona’s stack ordering also explains the market behaviour I keep seeing in procurement conversations. The teams winning with agentic RAG right now are the ones who already had advanced RAG running in production, with measured baselines, before they touched an agent loop. The teams losing are the ones who treated agentic RAG as a feature flag — flip it on, ship the demo, get surprised by the per-query bill at the end of the quarter. To Max’s point about contracts: buyers are starting to ask for them too. “Show me your retrieval recall before you show me your agent” is becoming a procurement conversation, not just an engineering one. The market is pricing in the prerequisite stack faster than vendors expected, and the demos that skip it are aging in months, not quarters.

ALAN

Both of you are right about the engineering, but I keep thinking about who is on the other end of the math Mona laid out. When end-to-end reliability quietly slides from comfortable to uncomfortable, the failures do not distribute evenly — they cluster on the queries that are unusual, multilingual, low-resource, or asked by users the corpus was never built around. Compounding probability is also compounding bias. A pipeline that looks reliable for the median user can be deeply unreliable for everyone else, and the agent loop will obediently rewrite their queries until something looks plausible enough to ship. So I’ll close with a question for the operators: when you measure the reliability of your agentic RAG, are you measuring it where the variance actually lives — or where the demo runs cleanest?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors