MONA explainer 12 min read April 30, 2026

Query Transformation Limits: Latency Tax, Drift, Hallucinated Documents

Three structural limits of query transformation: latency tax, query drift, hallucinated documents from LLM rewriters

Table of Contents

ELI5

Query Transformation rewrites a user query before retrieval. It works — but every rewrite pays three structural costs: an extra LLM call, drift that narrows what it tried to expand, and hallucinated content that grounds retrieval in fabricated text.

Query transformation sells the idea that retrieval fails because the user’s query is poorly phrased — rewrite it, and recall rises. Teams build that pattern, run it on real traffic, and watch latency climb on cheap queries while the difficult ones still miss. The bottleneck did not vanish; it moved. Three structural limits explain where it went.

The Latency Tax: Every Rewrite Buys an LLM Round-Trip Before Search

Vector search is fast. The transformer call that decides what to vector-search for is not. That asymmetry is the first structural limit, and the LangChain default makes it concrete: a MultiQueryRetriever issues N parallel rewrites, runs N searches, then unions and dedups (LangChain Docs). The retrieval gain has to clear the cost of one extra LLM round-trip before it shows up in user-visible latency.

What are the technical limitations and failure modes of query transformation?

There are three, and they compose. The first is the latency tax — every transformation puts at least one LLM call on the critical path before the vector search runs. The second is query drift: rewriting can over-refine simple queries, narrow on ambiguous ones, or fan out into redundant variants. The third is hallucinated documents — methods like HyDE generate fabricated content that the retriever then treats as ground truth. Each is a property of how the rewriter is wired into the pipeline, not a defect in any one technique, and the same three surfaces show up across HyDE, multi-query, step-back, decomposition, and trained rewriters.

The latency cost has a measurable floor. A 2025 production analysis of enterprise RAG deployments reported default agent configurations over-retrieving on 42% of simple factoid queries, adding 300–800 ms of latency that produced no recall gain (Mudassar Hakim — directional, single-source engineering analysis). Step-Back Prompting from Google DeepMind illustrates the same shape at the high end: gains of +7% on MMLU Physics, +11% on MMLU Chemistry, +27% on TimeQA, and +7% on MuSiQue (Zheng et al. 2023) — but every one of those queries paid for an additional abstraction-step LLM call before retrieval ran, and those benchmarks are reasoning-heavy multi-hop or temporal QA, not retrieval recall on its own. The pattern generalizes: end-to-end answer quality lifts on hard inputs, while every cheap input pays the same fixed tax.

The architecture, in one sentence: the rewrite runs before the search. Production patterns have evolved around it — a small, fast rewriter (GPT-4o-mini, Haiku, or Llama-3.1-8B) feeds a slower reader; an iteration cap of three cycles resolves roughly 95% of queries that benefit from re-retrieval before runaway cost (DEV Community, Kuldeep Paul). Both are workarounds for a tax that does not go away. They make it bearable.

Query Drift: Rewriting Refines What It Should Have Left Alone

The second limit is subtler because it sounds like the opposite of a problem. The whole point of the rewriter is to refine the query — except refinement, applied indiscriminately, is its own failure mode. The Q-PRM team named the pattern explicitly: for simple queries, rewriting methods “frequently introduce unnecessary steps, leading to over-refinement”; for complex ones, the same methods under-refine (Q-PRM 2025). The system has no internal signal about which side of that line the current query sits on.

Abe et al. mapped where the failure surfaces. Their two-failure-regime analysis identifies knowledge deficiency — where the LLM lacks the relevant domain knowledge and emits incorrect expansions — and query ambiguity, where the rewriter biases its refinements in ways that narrow search coverage instead of broadening it (Abe et al. 2025). The conclusion is sharp: query expansion can significantly degrade retrieval effectiveness when either condition holds. The rewriter has no signal that it is in those regimes.

Rewriting can narrow as easily as broaden — and the production version of this is what ZenML calls rewrite drift: small, unnoticed query errors that silently degrade end-to-end accuracy as the user-query distribution shifts beneath a static rewriting prompt (ZenML Blog). The rewriter keeps producing fluent, plausible reformulations; the retrieval keeps missing in ways no one notices until the eval set catches up. Bias amplification is the adversarial sibling: simple LLM-based rewriting cuts aggregate retriever bias materially under clean conditions but fails when multiple biases combine (bias-rewriting study, 2026). Each rewrite is a chance to over-correct.

Hallucinated Documents: HyDE Grounds Retrieval in Fabricated Text

The third limit is the one most teams underestimate, because the original Hypothetical Document Embeddings paper from Gao et al. argued the architecture defends against it. HyDE generates a hypothetical answer to the user’s query, encodes that synthetic document with an unsupervised contrastive encoder, then retrieves real corpus documents by vector similarity (Gao et al. 2022). The authors’ framing is that the encoder’s dense bottleneck “filters out incorrect/fabricated details” — the embedding flattens the false specifics and preserves the topical signal. That is the original claim. It is not consensus.

When the LLM lacks domain knowledge, the premise is wrong in ways the encoder cannot filter. HyDE retrieves real documents from a fake premise, which is exactly the failure mode Lei et al. mapped on a corpus of 3.4 million Stack Overflow Java and Python posts: standard HyDE struggled in 25% of sampled cases on concept-focused queries — retrieval fetched off-topic content, or the model addressed a broader variant of the question (an AngularJS-syntax query, for instance, returned service-pattern retrieval rather than syntax-level results) (Lei et al. 2025). For specific numerical or factual queries — population of city X, exact API rate limits, model parameter counts — HyDE generates a confidently wrong hypothetical, and the embedding of that wrong answer pulls the retrieval toward documents about the wrong fact (Mudassar Hakim).

The same study makes the case for not abandoning HyDE either. With a similarity-threshold fallback that routes around HyDE when the rewritten query is unlikely to help, Adaptive HyDE materially outperformed accepted Stack Overflow answers as a baseline, with mean LLM-as-judge scores of 6.05 versus 5.04 (p<10⁻⁸²) on developer-support questions (Lei et al. 2025). The threshold itself becomes a coverage-quality dial: at 0.9, only 0.7% of queries pass through HyDE but quality reaches 6.44; at 0.5, every query passes through HyDE and quality drops to 5.13. The architecture’s honesty is in the threshold knob — it admits HyDE is good for some query classes and harmful for others, and forces the system to decide.

Not a fix. A configuration.

Diagram of the three structural limits of query transformation: a latency tax from at least one LLM call before vector search, query drift from over-refinement on simple queries and ambiguity narrowing on hard ones, and hallucinated documents from HyDE grounding retrieval in fabricated text — Latency tax, query drift, and hallucinated documents stack into the three failure surfaces every query-transformation method pays for at the architectural level.

What These Limits Predict

Once the three surfaces are in mind, the symptoms in production resolve into something predictable rather than mysterious.

If your latency p95 jumped after adding a rewriter and your retrieval recall did not move, the queries paying the tax are mostly cheap factoids — the fix is a router that decides whether to rewrite at all, not a faster rewriter.
If retrieval recall on simple factoid queries got worse after rewriting, you are in over-refinement territory; classify query complexity before sending anything to the rewriter.
If retrieval misses cluster on unfamiliar-domain queries, you are watching the knowledge-deficiency failure mode of expansion (Abe et al. 2025) — verify the LLM has any signal about the topic before letting it refine the query.
If multi-query or decomposition expanded the candidate pool but answer quality dropped, the second-order penalty is in play. Liu et al. documented the U-shape: relevant information at the start or end of a long context is retrieved well, while relevant information in the middle is systematically under-attended (Liu et al. 2023). Over-filling context with redundant rewrites pushes the gold chunk into the dead zone.
If naive sub-query decomposition retrieves more documents than fit your context budget, you are seeing the tradeoff Petcu et al. quantified: a bandit that selects which decompositions to actually run delivered roughly 35% better doc precision and 15% better α-nDCG than uniform decomposition (Petcu et al. 2025). Selection beats fan-out.

The architectural answer the field converged on between 2024 and 2026 is not better rewrites. It is a layer that decides whether to rewrite. Agentic RAG systems route queries through a lightweight intent classifier first, fall back to direct retrieval when the raw query is good enough, and only spend the LLM budget on inputs that demonstrably need it. Corrective-RAG scores retrieved documents and triggers a rewrite-and-retry only when retrieval confidence is low. The router is the architectural fix; the rewriter is the tool the router decides whether to call.

Compatibility & known anti-patterns:
LangChain MultiQueryRetriever: Still maintained, but naive N-query fan-out without a router is a known anti-pattern for simple factoid queries — production guidance now favors adaptive routing (LangChain Docs).
Naive query decomposition (no selection): Demonstrated to hurt downstream generation when the retrieved sub-document set exceeds the context budget; bandit-based selection (Petcu et al. 2025) is the current SOTA pattern.
Static rewriting prompts (no eval loop): Silently degrade as the user-query distribution shifts (“rewrite drift”); production rewriters need online evals on representative traffic, not snapshot prompt tuning.
InstructGPT (text-davinci-003) for HyDE: The original 2022 prompts ran on a deprecated endpoint; modern HyDE uses GPT-4o-mini, Haiku, or Llama-3.1-8B as the rewriter, and older tutorials referencing davinci will fail at runtime.

Rule of thumb: treat the rewriter as a cost center first — every transformation pays a latency tax and risks drift; only spend the cost where retrieval recall on the raw query is demonstrably weak, and only behind a router that can fall back when the rewriter has nothing useful to add.

When it breaks: static rewriting prompts have no internal mechanism to detect their own drift. Without an online eval loop and a router that decides whether to rewrite, the system silently degrades as the user-query distribution shifts — the rewriter keeps sounding fluent while retrieval keeps missing on exactly the queries the eval set never anticipated. Wrap the rewriter in Hybrid Search and Reranking as parallel defenses, not as substitutes for the routing decision.

The Data Says

Query transformation in Retrieval Augmented Generation is not a free upgrade. The latency tax, query drift, and hallucinated documents are properties of the architecture, not bugs in any one method, and they compose to produce the second-order penalty of lost-in-the-middle when over-fanned-out context floods the model. The systems that work in production treat the rewriter as one tool among many, gated behind a router that decides per-query whether the LLM call is worth its cost.

Sources

Gao et al. (2022): Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) - Original HyDE method, including the encoder-as-fabrication-filter argument.
Lei et al. (2025): Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support - Evaluates HyDE on 3.4M Stack Overflow posts; documents the 25% concept-focused failure surface and the threshold-coverage tradeoff.
Zheng et al. (2023): Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models - Step-Back Prompting; PaLM-2L gains on MMLU Physics, Chemistry, TimeQA, MuSiQue with one extra LLM call per query.
Abe et al. (2025): LLM-based Query Expansion Fails for Unfamiliar and Ambiguous Queries - Two failure regimes for query expansion: knowledge deficiency and query ambiguity.
Q-PRM (2025): Q-PRM: Adaptive Query Rewriting for Retrieval-Augmented Generation - Documents over-refinement on simple queries and under-refinement on complex ones.
Liu et al. (2023): Lost in the Middle: How Language Models Use Long Contexts - U-shaped position bias on long-context tasks; the second-order penalty for over-fanned-out context.
Petcu et al. (2025): Query Decomposition for RAG: Balancing Exploration-Exploitation - Bandit-based decomposition selection; +35% doc precision, +15% α-nDCG vs. uniform decomposition.
LangChain Docs: MultiQueryRetriever — official API reference - Production reference for naive N-query fan-out and the small-rewriter / large-reader pattern.
ZenML Blog: Query Rewriting in RAG Isn’t Enough - Documents rewrite drift in production and the case for online evals.
Mudassar Hakim: Retrieval Is the Bottleneck - Engineering analysis citing the 42% over-retrieval and 300–800 ms unnecessary-latency findings (Tier-3, single-source).
DEV Community (Kuldeep Paul): From Query Understanding to Retrieval - Production guidance on iteration caps and classifier-based routing.

Aha Moments

MAX

Mona’s three limits map cleanly to a routing spec, which is what most teams skip when they bolt query transformation onto an existing RAG pipeline. The implementation question is not “which rewriter” — it is “what is the contract that decides whether to call the rewriter at all.” Write it explicitly: define a query classifier, define the cheap path for simple factoids, define the threshold at which the system falls back from HyDE to raw retrieval, define the iteration cap, and define what gets logged when each branch fires. Without that spec, the pipeline has no way to tell whether the rewriter is helping or quietly drifting. With it, the latency tax becomes a budget you spend on demand instead of paying on every query.

DAN

What Mona is describing as an architectural shift, the market is already pricing in. The vendors who shipped first-generation “drop your docs in” RAG as the cure for hallucinations are watching their buyers run real evals and walk away. The category that is moving — adaptive routing, agentic-RAG with conflict-aware retrieval, online eval loops as part of the product — is forming because the simple rewriter pattern does not survive contact with regulated industries. Banking and clinical buyers do not pay for recall lifts they cannot audit. This cycle belongs to the vendors who treat the router as the product and the rewriter as the dial, not the other way around.

ALAN

The three limits Mona surfaces and the routing spec Max prescribes are technical answers to a question the user never asked. The user asked something. The system rewrote it, retrieved against the rewritten version, and answered with a citation. Whose question got answered, exactly? When a HyDE pass quietly reformulates a factual query into something the model already assigns high probability to, the citation that comes back wears the authority of retrieval but the content of the rewriter’s prior. Dan is right that the market will price the failure into vendor selection. Max is right that a spec turns the failure into a logged decision. But the user reads the answer, not the log. Who owns the question after the rewriter has finished editing it?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors