Retrieval-Augmented Generation

Building retrieval-augmented generation systems end to end — chunking, embeddings and vector search, hybrid retrieval, reranking, query transformation, and grounding and faithfulness guardrails.

Authors 91 articles 981 min total read Updated May 4, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

15 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Agentic RAG →

Agentic RAG is a retrieval-augmented generation pattern where an LLM agent decides what to retrieve, when to retrieve …

5 articles

Contextual Retrieval →

Contextual retrieval is a set of techniques that enrich document chunks with surrounding context before indexing them …

5 articles

Embedding →

Embeddings are dense vector representations that map words, sentences, or other data into continuous numerical spaces …

6 articles

Hybrid Search →

Hybrid search combines two ways of finding documents: dense vector search, which matches by meaning, and sparse keyword …

7 articles

Long-Context vs RAG →

Long-Context vs RAG is the architectural choice between loading whole documents into a model's expanded context window …

6 articles

Multi-Vector Retrieval →

Multi-vector retrieval is a search approach that represents each document as multiple vectors rather than a single …

5 articles

Query Transformation →

Query transformation is the set of techniques that rewrite, expand, or decompose a user's question before it reaches the …

8 articles

RAG Evaluation →

RAG Evaluation is the practice of measuring how well a retrieval-augmented generation pipeline performs across two …

7 articles

RAG Guardrails and Grounding →

RAG guardrails and grounding are the techniques that keep generated answers tied to retrieved evidence rather than model …

7 articles

Reranking →

Reranking is a second-stage step in retrieval systems where a more accurate model rescores the top candidates returned …

6 articles

Retrieval-Augmented Generation →

Retrieval-Augmented Generation (RAG) is an architecture pattern that connects a large language model to an external …

7 articles

Sentence Transformers →

Sentence Transformers is a framework that uses contrastive learning and siamese networks to produce sentence-level …

5 articles

Similarity Search Algorithms →

Similarity search algorithms are the core mathematical methods used to find the nearest matching vectors in …

6 articles

Sparse Retrieval →

Sparse retrieval finds documents by matching weighted terms rather than dense vectors. Classic methods like BM25 score …

5 articles

Vector Indexing →

Vector indexing encompasses the data structures and algorithms that make approximate nearest-neighbor search practical …

6 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated May 4, 2026

Concepts covered

Side-by-side diagram contrasting a long-context KV-cache stack with a RAG vector-index pipeline.

MONA explainer 13 min May 4, 2026

Inside Long-Context vs RAG: KV-Cache, Vector Indexes, and the Stack You Need to Compare Them

Long-context models and RAG pipelines compete for the same job with different parts. A component-by-component map of KV caches, vector indexes, and trade-offs.

Two diverging pathways representing long-context windows and retrieval-augmented generation handling knowledge in large language models

MONA explainer 10 min May 4, 2026

Long-Context vs RAG: How Each Handles Knowledge in 2026

Long-context and RAG sound interchangeable. They are not. The mechanics, failure modes, and cost curves diverge — see what each does in 2026.

Diagram of long-context attention dispersion vs RAG retrieval — accuracy degrades in the middle of a long input window

MONA explainer 12 min May 4, 2026

Lost in the Middle, 1,250x Cost: The Limits of Long-Context vs RAG

Long-context windows promise simplicity, but lost-in-the-middle, 1,250x cost gaps, and effective-context collapse at 32K make RAG indispensable at scale.

Three-layer diagram of RAG faithfulness: citation generation, confidence scoring, and abstention as separable stages

MONA explainer 13 min May 4, 2026

Citation, Confidence, and Abstention: The 3 Layers of RAG Faithfulness

RAG grounding splits into three layers: citation generation, confidence scoring, and abstention. See how each fails differently and what each actually measures.

Diagram of sparse retrieval: documents represented as weighted term vectors over a vocabulary, scored against a query through an inverted index

MONA explainer 12 min May 4, 2026

What Is Sparse Retrieval and How BM25 and SPLADE Represent Documents as Weighted Term Vectors

Sparse retrieval encodes documents as weighted term vectors. Here is how BM25 and SPLADE produce those weights and why they beat dense models on exact terms.

Layered diagram showing retrieval metrics like Recall and MRR feeding into generation metrics like Faithfulness for RAG evaluation

MONA explainer 11 min May 4, 2026

From Recall and MRR to Faithfulness: RAG Evaluation Prerequisites

RAG evaluation needs more than one accuracy score. Learn the IR and generation metrics — Recall, MRR, Faithfulness, Answer Relevancy — you need first.

MONA presenting a split RAG pipeline diagram where retrieval and generation stages are scored by separate evaluation metrics

MONA explainer 13 min May 4, 2026

RAG Evaluation Explained: Faithfulness, Relevance, Context Metrics

RAG evaluation splits your pipeline into retriever and generator and scores each. Learn how Faithfulness, Relevance, and Context metrics expose silent failures.

Visualization of sparse vector retrieval comparing lexical token matches against learned token expansions over an inverted index

MONA explainer 11 min May 4, 2026

From TF-IDF to Learned Sparse: Prerequisites and Hard Limits of BM25, SPLADE, and ELSER

Sparse retrieval starts with BM25 and ends with ELSER and SPLADE-v3. Learn the math, the prerequisites, and where each method actually breaks down.

A judge evaluating a retrieval pipeline that is also generating the judge's evidence — recursive RAG evaluation loop

MONA explainer 12 min May 4, 2026

LLM-as-Judge Bias and the Technical Limits of RAG Evaluation

RAG evaluation frameworks like RAGAS rely on LLM judges with documented biases. Why faithfulness and answer relevancy scores are softer than they look.

Diagram of a RAG pipeline split into three measurement points — retrieval relevance, generation faithfulness, answer relevance — with a triangle overlay

MONA explainer 12 min May 4, 2026

Prerequisites for RAG Grounding: Retrieval Quality, the RAG Triad, and Faithfulness Metrics

Before you bolt guardrails onto a RAG pipeline, learn the RAG Triad — context relevance, groundedness, answer relevance — and how faithfulness gets measured.

Diagram showing retrieved document chunks anchoring an LLM's generated tokens to verified evidence in a RAG pipeline

MONA explainer 11 min May 4, 2026

What Are RAG Guardrails and How Grounding Stops Hallucinations

RAG guardrails and grounding force generated answers to stay tied to retrieved sources. Learn how the mechanism works in 2026 — and why it still leaks.

Hallucination detection ceiling concept showing scored citations passing through layered RAG guardrail filters

MONA explainer 9 min May 4, 2026

Why RAG Grounding Still Fails: The Hallucination Detection Ceiling

RAG hallucination detection has a certified ceiling. Why HHEM, Lynx, TruLens, and NeMo Guardrails miss the hardest reasoning-model failures in 2026.

Layered prerequisite stack of retrieval primitives feeding an agent loop with branching reliability paths

MONA explainer 11 min May 3, 2026

From RAG to Agents: Prerequisites and Hard Limits of Agentic RAG

Agentic RAG is a stack with new failure modes, not an upgrade. Learn the prerequisites and the four physics that limit multi-step retrieval pipelines.

Diagram of document chunks with prepended context strings flowing into a hybrid retrieval index

MONA explainer 9 min May 3, 2026

Contextual Retrieval: How Prepended Context Reduces RAG Failures

Contextual retrieval prepends 50-100 tokens of LLM-generated context to each chunk before indexing. Anthropic reports a 67% drop in retrieval failures.

Diagram of chunking, hybrid search, and reranking layered into contextual retrieval, with hard scaling limits highlighted

MONA explainer 11 min May 3, 2026

Contextual Retrieval: Prerequisites and Hard Limits at Scale

Contextual Retrieval cuts RAG failure rates, but at a cost. Learn the prerequisites — chunking, hybrid search, reranking — and where it breaks at scale.

Diagram of an LLM agent routing a query across multiple retrieval sources before answering

MONA explainer 9 min May 3, 2026

What Is Agentic RAG and How LLM Agents Decide What to Retrieve

Agentic RAG turns retrieval into a decision: an LLM agent chooses whether to retrieve, which source to query, and whether the answer is good enough.

Diagram of query transformation closing the embedding-space gap between short user questions and long document passages

MONA explainer 11 min Apr 30, 2026

How HyDE, Multi-Query, and Step-Back Improve RAG Retrieval Recall

Query transformation rewrites user prompts before retrieval. Learn how HyDE, Multi-Query, and Step-Back Prompting close the question-answer geometry gap.

Cross-encoder reranker scaling: latency grows with candidate count and token length, plus MS MARCO domain drift

MONA explainer 14 min Apr 30, 2026

Cross-Encoder Reranker Limits: Latency Walls and Domain Drift

Cross-encoder rerankers hit two architectural walls: latency scales linearly with candidates and quadratically with tokens, plus MS-MARCO domain drift.

Two-stage retrieve-and-rerank pipeline where a fast bi-encoder retrieves candidates and a cross-encoder reorders them

MONA explainer 12 min Apr 30, 2026

Cross-Encoders, Bi-Encoders, and Listwise Scoring in Reranking

A reranker reorders the top candidates from vector search using a heavier model. Cross-encoders, bi-encoders, and listwise scoring explained.

Diagram of a compound query splitting into parallel retrievable sub-queries via decomposition, routing, and RAG-Fusion

MONA explainer 11 min Apr 30, 2026

From Recall Failures to RAG-Fusion: Prerequisites and Inner Workings of Query Decomposition and Routing

Vector retrievers lose compound questions to a single point. Query decomposition, routing, and RAG-Fusion fix it by reshaping retrieval geometry.

Three structural limits of query transformation: latency tax, query drift, hallucinated documents from LLM rewriters

MONA explainer 12 min Apr 30, 2026

Query Transformation Limits: Latency Tax, Drift, Hallucinated Documents

Query transformation in RAG hits three hard limits: latency tax from extra LLM calls, query drift on simple inputs, and hallucinated documents from HyDE.

Two-stage retrieval diagram showing bi-encoder candidate selection followed by cross-encoder reranking for higher precision

MONA explainer 11 min Apr 30, 2026

What Is Reranking and Why Cross-Encoders Rescore RAG Retrieval

Reranking splits recall and precision into two stages. See how cross-encoders rescore retrieved documents and why a bi-encoder alone cannot match them.

Diagram of hybrid search: BM25 lexical index and dense vector index merged by reciprocal rank fusion into one ranked list

MONA explainer 11 min Apr 29, 2026

BM25, SPLADE, and Reciprocal Rank Fusion: The Building Blocks of Production Hybrid Search

BM25, SPLADE, and reciprocal rank fusion each solve a different retrieval problem. Here's how the three combine into a production hybrid search system.

Two ranked retrieval lists — keyword and semantic — fusing into a single hybrid result for RAG pipelines

MONA explainer 12 min Apr 29, 2026

What Is Hybrid Search and How BM25 Plus Dense Vectors Beat Either Alone in RAG

Hybrid search fuses BM25 keyword retrieval with dense vector search using reciprocal rank fusion. Why two ranked lists beat either alone in RAG pipelines.

RAG pipeline as a chain of transformations: chunking, embedding, vector storage, retrieval, and reranking

MONA explainer 12 min Apr 29, 2026

From Chunking to Reranking: RAG Pipeline Components and Prerequisites

Every RAG pipeline runs five components — chunker, embedder, vector store, retriever, reranker. Here is what each one does and where each one breaks.

Hybrid search fusion: BM25 and vector score distributions colliding in a merge step that yields inconsistent rankings

MONA explainer 13 min Apr 29, 2026

Score Mismatch, Tuning Hell: The Hard Limits of Hybrid Search Fusion

Hybrid search merges BM25 and vector results, but the fusion step has hard limits. Score mismatch, RRF blindness, and tuning hell — explained.

Particles forming a knowledge retrieval graph that grounds an LLM response in source documents

MONA explainer 10 min Apr 29, 2026

What Is RAG and How LLMs Use Vector Search to Ground Their Answers

Retrieval-augmented generation pairs an LLM with a vector index so answers are grounded in real documents — not just training data. The mechanism, explained.

Three structural failure surfaces in production RAG: retrieval misses, position bias on long context, grounding conflicts

MONA explainer 11 min Apr 29, 2026

Why RAG Still Fails in Production: Retrieval, Chunking, Grounding

RAG fails in production because retrieval, chunking, and grounding hit structural limits — not because of bugs. Why correct retrieval still hallucinates.

Geometric visualization of sentence embedding vectors collapsing into a narrow cone in high-dimensional space

MONA explainer 11 min Mar 24, 2026

From Cosine Similarity to Anisotropy: Prerequisites and Hard Limits of Sentence-Level Embeddings

Sentence Transformers encode meaning as geometry. Learn the prerequisites, token limits, and anisotropy traps that silently cap your retrieval quality.

Geometric visualization of sentence vectors converging in embedding space through contrastive learning

MONA explainer 9 min Mar 24, 2026

What Is Sentence Transformers and How Contrastive Learning Produces Sentence-Level Embeddings

Sentence Transformers turns transformers into sentence encoders via contrastive learning. Covers bi-encoders, loss functions, pooling, and hard negative mining.

Comparison of single-vector and token-level multi-vector retrieval showing storage and latency cost explosion

MONA explainer 9 min Mar 24, 2026

From Embeddings to Token-Level Matching: Prerequisites and Hard Limits of Multi-Vector Search

Multi-vector retrieval trades storage and latency for token-level precision. Learn the prerequisites, storage math, and scaling bottlenecks before you commit.

Geometric grid of per-token vectors with MaxSim scoring paths connecting query and document token matrices

MONA explainer 10 min Mar 24, 2026

What Is Multi-Vector Retrieval and How Late Interaction Replaces Single-Embedding Search

Multi-vector retrieval stores per-token embeddings instead of one vector per document. Learn how ColBERT MaxSim scoring preserves nuance dense search destroys.

Geometric visualization of distance metrics converging into layered graph structures for nearest neighbor search

MONA explainer 10 min Mar 24, 2026

From Distance Metrics to Graph Traversal: Prerequisites for Understanding Vector Index Internals

Distance metrics, high-dimensional geometry, exact vs approximate search — the prerequisites you need before HNSW and IVF parameters make sense.

$Abstract visualization of expanding graph nodes consuming memory while search accuracy fractures at scale$

MONA explainer 10 min Mar 24, 2026