MAX guide 14 min read April 30, 2026

HyDE vs Multi-Query vs Step-Back: Choosing RAG Query Transforms

Q: When should you use HyDE instead of multi-query retrieval?

Use HyDE when the query is generic or exploratory and you have no labeled query–document pairs to tune on. Use multi-query when the query is ambiguous and could be phrased five different ways. The tell: if a colleague would ask “do you mean X or Y?” — that’s multi-query territory. If they’d say “I’m not sure what to ask, write me an example,” that’s HyDE.

Q: When does step-back prompting work better than query decomposition?

Step-back wins on single questions that are too narrow — a precise date, a specific model number, a technical clause where literal terms don’t appear in any passage. Decomposition wins on compound questions that genuinely span topics or sources. The diagnostic: can the question be answered from one document? If yes, step-back. If no, decompose.

Q: How do you choose between query routing and RAG-Fusion for production RAG?

They solve different problems. Routing picks which index to query when you have multiple sources — one cheap LLM call, one selector. RAG-Fusion picks which phrasing wins by running multiple retrievals against one index and merging via Reciprocal Rank Fusion. Use routing first if you have two or more sources. Reach for RAG-Fusion only inside a chosen index when ambiguity is the dominant failure mode and your latency budget can absorb N retrievals.

Decision tree for selecting a RAG query transformation: HyDE, multi-query, step-back, routing, and decomposition.

Table of Contents

TL;DR

The wrong query transformation costs you latency without buying recall — pick the one that matches your failure mode, not the trendy one.
HyDE wants generic, exploratory queries. Multi-query wants ambiguity. Step-back wants over-specific facts. Decomposition wants compound questions. Routing wants multiple sources.
In production, the choice should be made at runtime by an agent — not baked into the pipeline.

A team I worked with last quarter had stacked HyDE, multi-query rewriting, and a reranker on top of every retrieval call. Their p95 latency had tripled. Recall on the evaluation set had barely moved. They were paying for three transformations and getting the value of none.

The fix wasn’t a better library. It was a decision tree. Each transformation solves a specific class of query failure. Use the wrong one and you add overhead. Use the right one and the retriever stops missing.

Before You Start

You’ll need:

A working Retrieval Augmented Generation pipeline you can swap retrievers in and out of (LlamaIndex or LangChain v1 are the easy paths).
Understanding of Query Transformation as a category — and why dense retrieval misses on certain query shapes.
A held-out evaluation set of real user queries with known relevant documents. Without one, you cannot tell which transformation is helping versus hurting.

This guide teaches you: how to map your retrieval failures to a specific query transformation rather than stacking everything and hoping recall climbs.

The “Stack Everything” Anti-Pattern

You wire HyDE in front of your retriever because a blog said it lifts recall. You add a multi-query rewriter because another blog said it covers ambiguity. You bolt on RAG-Fusion because the demo looked clean. Now every user query fires several LLM calls before a single embedding lookup.

It worked on Friday. On Monday, p95 latency doubled because traffic shifted from broad search queries to short factoid lookups — exactly the queries where HyDE generates hallucinated specifics that the encoder still has to filter out.

Step 1: Map the Five Query Transformation Patterns

Before you pick anything, name the five patterns and what each one solves. Mixing them up is the most common failure mode I see.

Your transformation toolbox has these parts:

HyDE (Hypothetical Document Embeddings) — the LLM writes a plausible answer document, you embed that document, then retrieve against it. Best for zero-shot or low-supervision domains where you have no labeled query–document pairs and the corpus shifts faster than you can fine-tune. Per the HyDE paper, the encoder filters most hallucinated specifics, but the latency cost is real.
Multi-query rewriting — the LLM generates several alternative phrasings of the user’s question, you retrieve for each, you union the results. LangChain’s MultiQueryRetriever defaults to three rewrites (LangChain Reference). Best for ambiguous queries where the user’s wording underspecifies their intent.
Step-back prompting — the LLM rewrites an over-specific question into a higher-level abstraction, retrieves on the abstraction, then answers the original question grounded on those broader passages. Best when the literal query terms are too narrow for any single passage to match.
Query decomposition — the LLM splits one compound question into sub-questions, each routed to a tool or sub-index, then synthesizes the answers. LlamaIndex’s SubQuestionQueryEngine is the canonical implementation. Best for multi-hop questions that span sources.
Query routing — the LLM (or a small classifier) picks which index or tool to retrieve from before any embedding lookup happens. Selectors like LLMSingleSelector and LLMMultiSelector in LlamaIndex give you explicit control. Best when you have multiple distinct knowledge sources.

The Architect’s Rule: if you can’t name the failure mode each transformation solves, you’ll stack all five and pay for none of them.

A sixth pattern — RAG-Fusion — is multi-query plus Reciprocal Rank Fusion (Raudaschl’s GitHub repository). Treat it as a flavor of multi-query, not a separate axis.

Step 2: Lock Down the Decision Inputs

The transformation is downstream of three things you should specify before touching code.

Decision input checklist:

Query shape distribution — what fraction of real user queries are short factoids, what fraction are broad/exploratory, what fraction are compound? Sample your logged queries and tag them. This is the highest-signal input.
Corpus characteristics — is the domain well-specified (legal, medical records, internal docs) or fast-evolving (news, research, product changelogs)? HyDE’s value is highest when no labeled relevance data exists.
Latency budget — every transformation costs at least one extra LLM call. Multi-query and RAG-Fusion cost N retrievals. If your p95 budget is tight, HyDE on a small LLM alone can blow it.
Number of distinct knowledge sources — one corpus or many? Routing only earns its place when you have two or more sources where the right index needs to be picked first.
Confidence signal availability — can you compute a query–document confidence score (BM25 score, top-1 dense score, hybrid score from Hybrid Search)? Without one, the “fall back to plain RAG when confidence is high” pattern from HyDE research stays theoretical.

The Spec Test: if your transformation choice doesn’t change when you swap your query shape distribution or your latency budget, you haven’t actually made a decision — you’ve copied a default.

Step 3: Match the Transformation to the Failure Mode

Now wire each transformation to the specific failure it fixes. Build order matters because each later choice depends on inputs from the earlier ones.

Build order:

Routing first — if you have multiple sources, route before transforming. A bad transformation on the wrong index wastes more than it saves. Routing is also the cheapest layer to debug: one LLM call, one selector decision, traceable.
Decomposition second — if your query distribution shows compound questions (“compare X and Y”), add a SubQuestionQueryEngine-style decomposer above your retrievers (LlamaIndex Docs). Each sub-question becomes a routable unit, which is why decomposition slots above routing in the decision graph.
Step-back or HyDE third — pick one, not both. Step-back when queries are too specific (precise dates, narrow facts, model numbers). HyDE when queries are too generic and you have no labeled pairs to tune on. The original Step-Back paper reports gains on PaLM-2L of +27% on TimeQA and +7% on MuSiQue — those numbers are dataset-specific, but the pattern (abstraction lifts narrow-fact recall) generalizes.
Multi-query last — only add multi-query rewriting when your evaluation set shows ambiguity-driven misses after the previous layers are tuned. Otherwise you are paying N×retrieval cost for marginal recall lift.

For each transformation, your context must specify:

What it receives (raw user query? routed sub-query? abstracted query?)
What it returns (single transformed query? N queries? a fused ranked list?)
What it must NOT do (HyDE: don’t run on fact-bound personal queries without a confidence gate; multi-query: don’t run on already-disambiguated routed queries)
How to handle failure (LLM rewrite returns nonsense → fall back to original query and log the event)

Compatibility & caveats note:
LangChain v1 migration: MultiQueryRetriever moved from langchain.retrievers into the langchain_classic package, and the parser_key parameter was removed. Update import paths in any code generated from older tutorials (LangChain v1 migration guide).
LlamaIndex docs URL: canonical docs moved from docs.llamaindex.ai to developers.llamaindex.ai (301 redirect). Use the new domain in citations and bookmarks.
HyDE in fact-bound domains: a 2025 study on small Gemma models documented 25–60% latency overhead and elevated hallucination rates (“Assessing RAG and HyDE on Gemma”). Treat HyDE as “do not use for personal or private fact retrieval without a confidence gate.”

Step 4: Validate the Choice With Real Queries

A transformation either lifts your evaluation metrics or it doesn’t. Decide before you ship, not after.

Validation checklist:

Recall@k on held-out queries — run the same queries with and without the transformation. Failure looks like: recall@10 barely moves. That’s noise; rip the transformation out.
Per-query-shape segmentation — break recall down by query type (factoid, exploratory, compound). Failure looks like: HyDE lifts exploratory recall but tanks factoid recall. Add a query classifier and gate HyDE behind it.
Latency cost vs. recall delta — plot p95 latency on x, recall@10 on y, before vs. after. Failure looks like: latency up, recall flat or up by less than your SLA tolerance. Cut.
Hallucination check on transformed queries — log the LLM-generated documents (HyDE) or rewrites (multi-query) for a random sample. Failure looks like: rewrites change the user’s intent. Tighten the rewrite prompt or drop the transformation for that query class.
Self-consistency across reruns — run the same query several times. Failure looks like: top-3 documents differ across runs. Lower the temperature on the rewrite LLM or pin the seed.

Decision tree mapping query shape, corpus, and latency budget to HyDE, multi-query, step-back, decomposition, or routing — Each transformation solves a specific failure mode — pick by mapping your query distribution to the right branch.

Common Pitfalls

What You Did	Why AI Failed	The Fix
Stacked HyDE + multi-query + reranker on every query	Three transformations on a query that needed zero — pure latency tax	Route first; pick at most one transformation per query class
Used HyDE on personal/private fact retrieval	LLM hallucinated specifics that didn’t filter out on a small encoder	Add a confidence gate — fall back to plain RAG when query–doc score is high
Used multi-query on already-precise queries	Three rewrites of “what is X version Y” all retrieved the same docs	Gate multi-query behind an ambiguity classifier
Added step-back without changing the answer-time prompt	Retrieved on the abstraction, but answered without grounding	Specify the two-stage contract: retrieve on abstraction, answer on original question with abstracted documents as context
Skipped routing because “the LLM can figure it out”	Wrong index queried, recall was a lottery	Add an explicit `LLMSingleSelector` — one cheap call, one auditable decision

Pro Tip

The mature 2026 pattern isn’t picking one transformation. It’s wrapping the choice in an agent loop. Route the query, classify its shape, dispatch to the right transformation, retrieve, grade the retrieved set with a small judge model, and self-correct if the grade is below threshold. Per dmflow.chat’s RAG transformation guide, this is what production Agentic RAG systems are converging on. You design the decision graph; the agent walks it at runtime.

Frequently Asked Questions

Q: When should you use HyDE instead of multi-query retrieval? A: Use HyDE when the query is generic or exploratory and you have no labeled query–document pairs to tune on. Use multi-query when the query is ambiguous and could be phrased five different ways. The tell: if a colleague would ask “do you mean X or Y?” — that’s multi-query territory. If they’d say “I’m not sure what to ask, write me an example,” that’s HyDE.

Q: When does step-back prompting work better than query decomposition? A: Step-back wins on single questions that are too narrow — a precise date, a specific model number, a technical clause where literal terms don’t appear in any passage. Decomposition wins on compound questions that genuinely span topics or sources. The diagnostic: can the question be answered from one document? If yes, step-back. If no, decompose.

Q: How do you choose between query routing and RAG-Fusion for production RAG? A: They solve different problems. Routing picks which index to query when you have multiple sources — one cheap LLM call, one selector. RAG-Fusion picks which phrasing wins by running multiple retrievals against one index and merging via Reciprocal Rank Fusion. Use routing first if you have two or more sources. Reach for RAG-Fusion only inside a chosen index when ambiguity is the dominant failure mode and your latency budget can absorb N retrievals.

Your Spec Artifact

By the end of this guide, you should have:

A query shape distribution — your logged queries tagged as factoid, exploratory, compound, or multi-source.
A transformation decision tree — one branch per failure mode, with explicit constraints on which transformation runs and which is forbidden for that branch.
A validation harness — held-out evaluation set, recall@k baseline, latency p95 baseline, and a per-query-shape breakdown.

Your Implementation Prompt

Drop the prompt below into Claude Code, Cursor, or Codex when you’re ready to wire the decision tree into your retrieval layer. Fill the bracketed placeholders with values from Step 2.

You are designing the query transformation layer for a RAG system. Build a
decision graph that selects exactly one transformation per query, based on
the inputs below.

# Decision inputs (from Step 2)
- Query shape distribution: [% factoid] / [% exploratory] / [% compound]
- Corpus characteristics: [domain stability — stable | fast-evolving]
- Latency budget (p95 end-to-end): [budget_ms] ms
- Number of distinct knowledge sources: [n_sources]
- Confidence signal: [hybrid_score | bm25_only | none]

# Required components (from Step 1)
1. Router (only if n_sources >= 2): use LlamaIndex LLMSingleSelector or
   LLMMultiSelector.
2. Decomposer (only if % compound >= [threshold]): use SubQuestionQueryEngine.
3. One of {step-back, HyDE} — pick at most one:
   - step-back if the dominant failure is over-specific queries
   - HyDE if domain is fast-evolving AND no labeled pairs exist
   - HyDE MUST be gated by a confidence threshold on [hybrid_score]; fall back
     to plain RAG when score >= [confidence_threshold]
4. Multi-query rewriter: only as a last layer, gated by an ambiguity classifier.

# Build order (from Step 3)
Routing -> Decomposition -> (Step-back OR HyDE) -> Multi-query.

# Output contract for each transformation
- Receives: [raw_query | sub_query | abstracted_query]
- Returns: [single_query | n_queries | fused_ranked_list]
- On failure: fall back to the previous layer's output and log the event.

# Validation harness (from Step 4)
On a held-out set of [n_eval] queries, report:
- recall@10, segmented by query shape
- p95 latency delta vs. plain RAG
- hallucination rate on a sample of transformed queries
- self-consistency across N reruns

# Hard constraints
- LangChain v1: import MultiQueryRetriever from langchain_classic, NOT from
  langchain.retrievers.
- LlamaIndex docs canonical host is developers.llamaindex.ai.
- Do NOT stack HyDE and multi-query on the same query class.
- Do NOT run HyDE on personal/private fact retrieval without a confidence gate.

Generate: (1) the decision graph as code, (2) the validation harness, (3) a
config file binding the bracketed values above to runtime parameters.

Ship It

You now have a decision graph instead of a transformation stack. The next time someone says “let’s add HyDE on top of multi-query,” you have a checklist that names the failure mode each one solves and an evaluation harness that decides whether either earns its place. The transformations stop being decoration and start being engineering choices.

Sources

the HyDE paper: Precise Zero-Shot Dense Retrieval without Relevance Labels (arXiv 2212.10496) - HyDE mechanism, zero-shot retrieval claim, encoder-as-filter argument.
the Step-Back paper: Take a Step Back: Evoking Reasoning via Abstraction (arXiv 2310.06117) - Step-back prompting mechanism and PaLM-2L benchmark gains on TimeQA, MMLU, and MuSiQue.
Raudaschl’s GitHub repository: Raudaschl/rag-fusion - Canonical RAG-Fusion implementation and Reciprocal Rank Fusion recipe.
LangChain Reference: MultiQueryRetriever (langchain_classic) - Multi-query rewriter API and the default rewrite count.
LangChain v1 migration guide: Retrievers moved to langchain_classic - v1 migration: package move, removed parameters.
LlamaIndex Docs: Query Transform Cookbook - HyDE, sub-question, and selector implementations.
“Assessing RAG and HyDE on Gemma” (arXiv 2506.21568): Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLMs - HyDE latency overhead and hallucination evidence on small LLMs.
dmflow.chat’s RAG transformation guide: Six Advanced Query Transformation Architectures - Production trend toward agent-loop transformation selection.

Aha Moments

MONA

Each transformation is doing geometry on the query, not magic. HyDE moves the query toward a richer embedding region by generating an answer-shaped document — the dense encoder then matches on that richer point instead of the impoverished original. Step-back does the inverse: it lifts the query out of an over-specific corner and lands somewhere with denser passage neighbors. Multi-query covers a small ball around the original query because the LLM’s paraphrases scatter slightly in vector space. Once you see the transformations as movement in embedding space, the choice between them stops being mystery and becomes a question of which direction your queries need to move.

DAN

Mona is right that this is geometry, but the operational read is different. Stacking transformations is the RAG equivalent of feature creep — it looks like progress until the latency budget collapses and the support tickets start. The teams winning in production are the ones who treat each transformation as a billable line item. They measure recall lift per millisecond and they cut anything that doesn’t earn its slot. The pattern that scales is adaptive selection at runtime — the agent picks the cheapest transformation that solves the failure mode for this query, not a default applied to every query. That’s where the next round of differentiation lives.

ALAN

Both of you are describing optimization within a system whose decisions are increasingly opaque to the user. When the agent picks a transformation at runtime, the user has no idea their question was rewritten — or that the document the model is grounding on was synthesized by another LLM call before retrieval ever happened. HyDE in particular embeds a hallucinated answer and then retrieves against it, a procedure most users would not consent to if asked plainly. Reranking adds another opaque layer on top. The engineering is sound. The transparency story is not. So I’ll leave the question with you both: when the retrieval pipeline rewrites the user’s question without telling them, who is responsible for the answer they receive — the user, the model, or the engineer who wired the loop?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors