MAX guide 14 min read May 14, 2026

How to Build a Retrieval-Augmented Agent with LangGraph, LlamaIndex, and CrewAI in 2026

Three-lane diagram showing retrieval, decision, and orchestration components in an agentic RAG system

Table of Contents

TL;DR

A Retrieval Augmented Agents system is three contracts: when to retrieve, what to retrieve, and what to do with the result. Specify each one in writing before you pick a framework.
LangGraph, LlamaIndex, and CrewAI solve different parts of the same problem. Pick by the part you need, not by the logo on the README.
Most “the agent hallucinated” incidents are missing-spec incidents. The agent retrieved when it shouldn’t have, or returned text when it should have called a tool.

A team ships an Agentic RAG prototype. It works on the demo. Two weeks in, the support inbox fills with “the bot ignored the manual” tickets. The retriever ran on every turn — including ones where the user asked what time it was. Cost tripled, latency doubled, and the answers got worse, not better.

That is not a model failure. That is a missing spec. Specifically, the spec for when retrieval is even allowed to fire.

This guide walks you through the four contracts a retrieval-augmented agent needs before you let any framework generate code. Then it tells you which of the three current production frameworks — LangGraph, LlamaIndex, or CrewAI — actually matches each contract.

Before You Start

You’ll need:

An AI coding tool — Cursor, Claude Code, or Codex
A working knowledge of vanilla RAG (indexing, embedding, retrieval, generation)
A specific use case in mind — “enterprise knowledge search,” “compliance Q&A over policy PDFs,” or “internal API helpdesk.” Generic “build me an agent” prompts produce generic agents.

This guide teaches you: how to decompose a retrieval-augmented agent into four contracts — retrieval gate, retrieval scope, response policy, and orchestration shape — then specify each one to your AI coding tool so the generated code matches the framework’s idioms instead of fighting them.

The Vanilla-RAG Trap

The bug pattern repeats. A team takes a working vanilla RAG pipeline, wraps it in an agent loop, and calls it “agentic.” The agent now retrieves on every single turn because the loop never specified an alternative. Latency climbs. Cost climbs. Answer quality drops because the retrieval is firing on conversational chit-chat that has no relevant documents.

It worked on Friday’s demo. On Monday, the same prompt produces three irrelevant citations because the user asked a follow-up the retriever can’t ground.

The diagnosis: the agent has retrieval as a reflex, not a decision. Vanilla RAG retrieves, period. Agentic RAG decides whether to retrieve, what to retrieve, and what to do if the retrieval is weak. That decision logic is the entire reason for the word “agentic.” Skip it and you’ve just added latency to vanilla RAG.

Step 1: Identify the Four Contracts

Before you let any framework generate a single line, write down four contracts. They map directly onto the three frameworks below — the contracts are framework-independent, but each framework prefers a different one.

Your retrieval-augmented agent has four contracts:

Retrieval Gate — when retrieval is allowed to fire. Always? Only on factual queries? Only when confidence is low? This is the contract vanilla RAG skips entirely.
Retrieval Scope — which index, which filters, which top-k. A knowledge-base query is not a code-search query. A compliance question is not a casual one.
Response Policy — what the agent does with retrieved chunks. Cite them? Quote them? Refuse to answer if they conflict? Re-query if relevance is low?
Orchestration Shape — single agent with one retriever, or multiple specialist agents handing off to each other.

The Architect’s Rule: If you cannot answer “what should the agent do when retrieval returns nothing useful?” in one sentence, you are not ready to write code. You are ready to write a spec.

The retrieval gate is where most teams go wrong. Vanilla RAG has no gate — it retrieves on every turn. An agentic system needs an explicit decision node. LangChain’s official agentic-RAG pattern names this node generate_query_or_respond — the LLM decides whether to call the retriever tool or answer directly, per LangChain Docs. That decision is the contract.

Step 2: Lock Down the Framework Contract

Once the four contracts are written, the framework choice gets easier. Each of the three production frameworks specializes in one part of the four-contract system. Pick by which part you need most.

Context checklist for your AI coding tool:

Target framework named explicitly with current version
Retrieval gate logic stated as a precondition, not implied
Retrieval scope (which index, filters, top-k) defined per query type
Response policy stated for three cases: strong retrieval, weak retrieval, no retrieval
Orchestration shape: single agent or multi-agent, and why
Error handling pattern: what happens if the retriever times out, the index is empty, or the LLM produces an off-spec response

The framework split in 2026 is cleaner than it was a year ago. LangChain Docs describes LangGraph as the stateful workflow layer where you write the gate logic as an explicit graph node — full control over retries, error handlers, and per-node timeouts. The March 10, 2026 release added type-safe streaming and per-node timeouts, per LangChain’s changelog. That matters because retrieval timeouts are the most common failure mode in production agentic RAG.

LlamaIndex Docs frames AgentWorkflow as the event-driven primitive for document-grounded agents. The architecture LlamaIndex Blog recommends is per-document agents with embedding search plus summarization, with a top-level agent over the document agents performing tool retrieval and Chain-of-Thought reasoning. That maps directly to retrieval scope: each document agent owns its own retrieval contract.

CrewAI Docs positions Crews as role-based multi-agent teams. CrewAI v1.14.0, released April 7, 2026, added path and URL validation on RAG tools — a direct response to prompt-injection vectors in agentic retrieval, per CrewAI’s changelog. Crews handle the orchestration shape contract well when you want to model retrieval as one role and synthesis as another.

The Spec Test: If your context says “use LangGraph” but never names which node owns the retrieval gate, the AI will produce a graph that retrieves on every node. The framework didn’t fail. The spec failed.

Step 3: Wire the Components in the Right Order

The build order matters because retrieval is a downstream dependency. Build it last, and you will rewire half the agent when the retriever’s interface doesn’t match what the gate node expected.

Build order:

Retrieval layer first — index, embeddings, retriever interface. No agent yet. Just the contract: query in, ranked chunks out. Verify it works against a fixed test set before any LLM touches it.
Gate node second — the decision logic. In LangGraph, that’s a graph node returning either a tool call or a direct response. In LlamaIndex, it’s a workflow step. In CrewAI, it’s the agent’s role description plus tool descriptions.
Response policy third — the logic that consumes retrieved chunks. Strong retrieval: answer with citations. Weak retrieval: re-query or escalate. No retrieval: refuse or fall back to general knowledge with an explicit warning.
Orchestration shape last — single agent or multi-agent. Defer this decision until the first three layers are stable. Premature multi-agent orchestration is the second-most-common failure mode after the missing retrieval gate.

For each component, your context must specify:

What it receives — query string, chat history, retrieval scope hint
What it returns — chunks with scores, decisions with reasoning, responses with citations
What it must NOT do — silently fail, hallucinate citations, retrieve when the gate says no
How to handle failure — timeout returns “retrieval unavailable,” empty index returns “no documents indexed for this scope”

The pattern LangChain Docs describes — nodes for retrieval, document grading, query rewriting, and response generation — is not a coincidence. It is the four-contract system rendered as a graph. If you specify the contracts first, the graph almost writes itself.

Step 4: Validate Against the Failure Modes

The validation criteria for an agentic RAG system are not “does it answer questions correctly.” They are “does it fail correctly when it should.” A system that always answers is hallucinating half the time.

Validation checklist:

Retrieval gate fires on factual queries, stays silent on conversational ones — failure looks like: every turn retrieves, including “hi” and “thanks.”
Weak retrieval triggers a re-query or refusal — failure looks like: low-relevance chunks get cited as if they were strong matches.
Citations point to retrieved chunks, not invented ones — failure looks like: the citation text doesn’t appear in the retrieved context.
The agent refuses to answer when retrieval is empty and the question is out-of-scope — failure looks like: confident fabrication when the index has nothing.
Timeouts produce a graceful error, not a stuck loop — failure looks like: the agent retries the retriever indefinitely when the vector DB is down.

Four-contract decomposition mapping retrieval gate, scope, response policy, and orchestration onto LangGraph, LlamaIndex, and CrewAI — The four contracts every retrieval-augmented agent needs — and which framework owns each one in 2026.

Security & compatibility notes:
LangGraph RCE (CVE-2026-27794): Remote code execution in checkpoint deserialization. Fix: upgrade langgraph-checkpoint to 4.0.0; affected versions below 3.0 per NVD.
LangGraph SQL injection (CVE-2025-67644): SQLite checkpoint implementation accepts metadata filter keys without sanitization, per The Hacker News. Patched in current builds — pin versions explicitly.
LangGraph ToolNode breaking change: langgraph-prebuilt 1.0.2 added a required runtime parameter to ToolNode.afunc, breaking code that overrides afunc. Pin to a known-good version and update overrides.
LangChain Core secrets exposure (CVE-2025-68664): Unsafe JSON serialization fallback exposed secrets via serialization injection. Removed in current versions per NVD — verify your langchain-core pin.
LlamaIndex deprecation: Pre-2025 tutorials using OpenAIAgent and ReActAgent are superseded by AgentWorkflow and FunctionAgent. Use Workflows 1.0 patterns or your generated code will reference removed APIs, per LlamaIndex Blog.

These notes are not optional. Three of the five are CVEs, and the two warnings will produce import errors or silent behavior changes if your AI tool generates code against old patterns. Include them in the context file you hand to your coding agent.

Common Pitfalls

What You Did	Why AI Failed	The Fix
Said “build an agentic RAG system” with no gate spec	AI defaulted to retrieve-every-turn, the vanilla pattern	Specify the retrieval gate as the first node. State when it fires and when it doesn’t.
Named the framework but not the version	AI generated code against pre-Workflows-1.0 LlamaIndex or pre-1.x CrewAI APIs	Pin the version explicitly: `langgraph==1.1.10`, `llama-index-core` on the 0.14.x line, `crewai==1.14.0`
Asked for multi-agent without specifying handoff	AI invented a hierarchy that retrieves the same documents three times	Define the handoff contract first. Single agent until the second agent has a distinct role and a distinct retrieval scope.
Skipped the response policy for weak retrieval	AI generated code that cites low-score chunks as if they were authoritative	Spec all three retrieval states: strong, weak, empty. Each gets a different response branch.
No timeout on the retriever	AI generated happy-path code that hangs when the vector DB is slow	Use LangGraph’s per-node timeouts (March 2026 release) or equivalent. State the timeout as a constraint in the spec.

Pro Tip

Frameworks are not religions. The 2026 production pattern, per Knowlee Blog, is composition — teams use LangGraph for control flow plus LlamaIndex for retrieval plus CrewAI or another layer for role abstractions. No monoliths. If your spec assumes you must pick one, your spec is the problem.

The mental shift: each framework is a specialization. LangGraph specializes in stateful graphs with explicit control. LlamaIndex specializes in document-grounded retrieval. CrewAI specializes in role-based multi-agent prototyping. A single retrieval-augmented agent might use all three — LangGraph as the outer loop, LlamaIndex as the retriever inside one node, CrewAI for the workflow that built the index in the first place. The contracts are stable. The implementations compose.

Frequently Asked Questions

Q: How to build a retrieval-augmented agent step by step in 2026? A: Write the four contracts first — retrieval gate, scope, response policy, orchestration. Then pick the framework that matches your dominant contract: LangGraph for control, LlamaIndex for document grounding, CrewAI for roles. Build retrieval before agents — premature multi-agent designs hide the gate.

Q: How to use retrieval-augmented agents for enterprise knowledge search? A: Use per-document agents with their own retrieval scope, then a top-level agent that routes queries to the right one, per LlamaIndex Blog. Watch out: per-document agents that all index the same source. Deduplicate at the corpus level or you pay for the same retrieval three times.

Q: When should you use an agentic RAG framework instead of vanilla RAG? A: When the answer to “should I retrieve right now?” is sometimes no. Vanilla RAG is correct when every query needs retrieval. Agentic RAG fits when retrieval is conditional, scope shifts per query, or weak retrieval needs to trigger a re-query or refusal.

Your Spec Artifact

By the end of this guide, you should have:

A four-contract document — retrieval gate, retrieval scope, response policy, orchestration shape — written before any code
A framework selection rationale that names which contract each chosen framework owns, with version pins
A validation checklist of the five failure modes your system must handle correctly before you ship it

Your Implementation Prompt

Paste this into Claude Code, Cursor, or Codex when you’re ready to generate the first scaffold. Fill the bracketed values from your four-contract document. The placeholders map one-to-one to the Step 2 checklist — every bracket is a decision you have already made.

Generate a retrieval-augmented agent in Python using [framework: langgraph==1.1.10 | llama-index-core==0.14.21 | crewai==1.14.0].

Four contracts (the spec — do not deviate):

1. Retrieval gate: retrieve only when [condition, e.g., "the user query contains a factual claim or named entity from the corpus domain"]. Skip retrieval when [condition, e.g., "the query is conversational, a greeting, or a clarification of the previous response"].

2. Retrieval scope:
   - Index: [name + embedding model]
   - Filters: [metadata filters per query type]
   - Top-k: [number, default 5]

3. Response policy:
   - Strong retrieval (top score > [threshold]): answer with inline citations to retrieved chunks
   - Weak retrieval (top score between [low] and [threshold]): re-query with [rewrite strategy] or escalate
   - Empty retrieval: respond with [refusal text] and do not invent citations

4. Orchestration shape: [single agent | multi-agent with handoff from {role A} to {role B}]

Error handling:
- Retriever timeout after [N] seconds returns "retrieval unavailable" — no retry loop
- Empty index returns "no documents indexed for this scope"
- LLM off-spec response triggers [validation step]

Security constraints:
- Pin langgraph-checkpoint to 4.0.0+ (CVE-2026-27794)
- Validate retrieval tool inputs (path/URL injection)
- Do not use deprecated LlamaIndex OpenAIAgent or ReActAgent — use AgentWorkflow

Validation: produce a test script that exercises all five failure modes from the spec's validation checklist.

Ship It

You now have a four-contract mental model that turns “build an agentic RAG system” into a decomposition the AI can actually generate. The framework choice stops being a religion and starts being a question of which contract you care about most. Once the contracts are written, the version pins and security caveats are the boring-but-critical layer that keeps the system shippable instead of CVE-shaped.

Sources

LangChain Docs: Build a custom RAG agent with LangGraph - Official agentic-RAG pattern with generate_query_or_respond node and graph architecture.
LangChain’s changelog: Changelog — Docs by LangChain - LangGraph 1.1.10 release notes including type-safe streaming and per-node timeouts.
LlamaIndex Docs: Welcome to LlamaIndex — Developer Documentation - AgentWorkflow primitive on top of Workflows 1.0.
LlamaIndex Blog: Agentic RAG With LlamaIndex: Architecture Guide - Per-document agents with top-level orchestration pattern.
CrewAI Docs: CrewAI Documentation - Agents, Tasks, Crew, and Flow primitives.
CrewAI’s changelog: CrewAI Changelog - v1.14.0 RAG tool validation additions.
NVD: CVE-2026-27794 Detail - LangGraph checkpoint deserialization RCE.
NVD: CVE-2025-68664 Detail - LangChain Core secrets exposure via serialization injection.
The Hacker News: LangChain, LangGraph Flaws Expose Files, Secrets, Databases - SQL injection in SQLite checkpoint implementation.
Knowlee Blog: Agentic AI Frameworks Compared 2026: LangGraph, CrewAI, AutoGen - 2026 composition pattern across frameworks.

Aha Moments

MONA

A retrieval-augmented agent is a decision system layered on top of a similarity search. The interesting part is not the retriever — vector similarity has been a solved problem for years. The interesting part is the gate. When the agent decides whether to retrieve, it is implicitly classifying the query into “this needs grounding” or “this does not.” That classification is a learned behavior of the underlying model, shaped by the tool descriptions in the prompt. The spec Max describes is, in effect, a hand-engineered prior on a behavior that would otherwise be inferred from training data. The contract is the inductive bias.

DAN

Picking up where Mona left off — the composition pattern is the real story for buyers. Workflow Orchestration For AI is no longer a single-vendor decision. Teams are stacking LangGraph for control, LlamaIndex for retrieval, and CrewAI for roles, and the vendors know it. That changes procurement. The question shifts from “which platform” to “which primitive owns which contract.” That is a more mature market, and it favors specialists over generalist platforms.

ALAN

Both of you are describing the system as it exists. I want to ask about the system as it acts. Code Execution Agents that decide whether to retrieve are also deciding whose documents count as ground truth. The retrieval scope contract is, in effect, a policy decision about which sources are authoritative — and Max is right that you should write it down, but writing it down does not make it less consequential. Once the agent refuses to answer outside its corpus, the corpus is the world the user gets to see. Who audits which documents the corpus excludes?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors