DAN Analysis 9 min read April 29, 2026

Notion, Perplexity, and Glean: How Hybrid Search Powers Production RAG at Scale

Three branching retrieval pipelines converging into a unified ranking gate against a dark gradient background

Table of Contents

TL;DR

The shift: Production-scale RAG has standardized on hybrid retrieval — vector-only systems are now treated as a starting point, not a finish line.
Why it matters: Three independent leaders, in three different domains, converged on multi-signal architectures that combine lexical and semantic retrieval with multi-stage ranking.
What’s next: The retrieval layer is becoming infrastructure. Teams still running pure-vector setups will pay for it in answer quality, latency, or both.

For two years, the RAG playbook had a single default: embed everything, query a vector database, return the top-k. Cheap to build. Easy to demo. Brutal in production. Now the largest deployments in the field have moved on — and the pattern they all converged on is the same.

The Vector-Only Era Is Already Behind Us

Thesis: Hybrid Search has become the production default for Retrieval Augmented Generation at scale, and the leaders in three different domains — web search, enterprise search, and workspace AI — have all moved past pure-vector retrieval.

This isn’t an opinion. It’s a convergence.

Perplexity runs hybrid retrieval on its own Vespa.ai-based index, fusing lexical and embedding-based scorers with multi-stage ranking that ends in cross-encoder rerankers (Perplexity Research). Glean ships enterprise search built on lexical retrieval plus vector embeddings, layered over a proprietary knowledge graph (Glean Blog). Notion runs vector search at production scale on Turbopuffer while feeding a parallel keyword index from inside the same data lake (Notion Blog).

Three companies. Three domains. The same direction.

That’s not a coincidence. That’s a market signal.

Three Architectures, One Pattern

The evidence doesn’t organize by who shipped first. It organizes by what each system proves.

Signal 1 — Perplexity at web scale. The retrieval engine handles 200 million queries per day across an index of over 200 billion unique URLs (Perplexity Research). The architecture queries the search index via both modalities and merges results into a hybrid candidate set, then runs multi-stage ranking ending in cross-encoder rerankers. Median ranking latency lands at 358ms; p95 at 763ms. At that volume, a vector-only setup is not viable. The lexical signal isn’t an addition — it’s load-bearing.

Signal 2 — Glean inside the enterprise. Glean’s hybrid pipeline combines BM25-style lexical retrieval with vector embeddings, fused with Reciprocal Rank Fusion, layered on top of a self-learning model and a proprietary knowledge graph (Glean Blog). The company reports up to 30% reduction in irrelevant results compared to traditional methods (Glean perspectives). Note the framing: not “vector search wins” — the combination wins.

Signal 3 — Notion at workspace scale. Notion AI’s vector pipeline migrated from a dedicated-pod cluster to Turbopuffer, cutting query latency from 70-100ms to 50-70ms while delivering roughly 10x capacity at 90% lower cost over two years (Notion Blog). Inside the same data lake architecture, an Inverted Index-style keyword search runs on ElasticSearch, fed from the identical Hudi-on-S3 source pipeline (Notion Blog). The vector index and keyword index are physically separate but architecturally siblings.

The structural lesson lands the same way each time: at production scale, no single retrieval mode handles the full query distribution. Lexical and semantic are complementary, not competitive.

The Winners Are Already Built In

The companies positioned to capture this shift fall into three groups.

Search infrastructure providers. Vespa.ai is now the production substrate behind Perplexity’s traffic — cited as the only platform Perplexity evaluated as production-proven for real-time hybrid RAG (Vespa case study). Open-source vector databases like Weaviate and Qdrant face the same buyer pressure: customers stopped asking for pure-vector products.

Enterprise platforms with knowledge graphs. Glean’s combination of lexical, vector, and graph layers gives it a moat that pure embedding shops cannot match. Retrieval quality compounds with usage data, not with model upgrades alone.

Vector DB vendors that pivoted fast. Turbopuffer’s object-storage-native model is the architecture that let Notion 10x its scale at a fraction of the cost. Notion runs OpenAI’s zero-retention embedding API for vector generation, then keeps retrieval on dedicated infrastructure. The split is the future: third-party model APIs for embeddings, dedicated infrastructure for serving.

You’re either shipping hybrid retrieval or you’re shipping last year’s stack.

Pure-Vector RAG Just Became a Liability

Vendors that built their pitch around “embed everything, ship fast” now face a buyer with a different question: what do you do with rare keywords, exact-match queries, and acronyms your embedding model never saw?

Teams running embedding-only retrieval inside production applications are quietly leaking quality. Lexical-precise queries — product SKUs, error codes, compliance terminology — are exactly where vector similarity underperforms. Users notice. Engineering teams blame the model. The retrieval layer is the actual bottleneck.

And anyone treating retrieval as a static layer, separate from generation, is about to be flanked. Agentic RAG systems are absorbing retrieval into the orchestration loop — agents call retrieval as one tool among many, with reranking and query rewriting handled inside the reasoning step. Pure-vector backends without a hybrid path forward become legacy components inside that stack.

The vector-only retrieval pitch hasn’t died. It just got demoted.

What Happens Next

Base case (most likely): Hybrid retrieval becomes table stakes for production RAG within the next two quarters. Vector-only systems remain in prototypes, internal tools, and low-stakes use cases. Cross-encoder rerankers move from “advanced optimization” to default architectural assumption. Signal to watch: New vector DB vendors shipping native lexical-plus-vector fusion in their core product, not as an enterprise add-on. Timeline: Through end of 2026.

Bull case: Hybrid retrieval gets absorbed into agent frameworks as a primitive. Agents call a unified retrieval tool that handles lexical, semantic, and graph fusion automatically. Quality improvements compound with multi-stage ranking and learned rerankers tuned per domain. Signal: Major agent frameworks shipping built-in hybrid retrieval modules with cross-encoder rerankers as defaults. Timeline: Mid-to-late 2026.

Bear case: Hybrid retrieval pipelines prove harder to operate at scale than vendor marketing suggests. Multi-stage ranking introduces latency and tuning complexity that smaller teams cannot absorb. A wave of teams quietly retreats to vector-only setups for simplicity, accepting the quality trade-off. Signal: Public postmortems describing rerank-stage outages or hybrid pipeline rollbacks. Timeline: Could surface within six months for early adopters.

Security & compatibility notes:
Langflow RCE (CVE-2026-33017): Unauthenticated remote code execution in the flow-build endpoint, exploited within 20 hours of disclosure (Sysdig). Not a hybrid-search component, but a reminder that adjacent RAG-stack tooling is now an active attack surface. Patch promptly and audit exposed endpoints.

Frequently Asked Questions

Q: Which companies use hybrid search in production RAG systems? A: Perplexity runs hybrid retrieval on its Vespa.ai-based index at 200 million queries per day. Glean ships hybrid (lexical + vector + knowledge graph) for enterprise search. Notion runs production vector search on Turbopuffer alongside an ElasticSearch keyword index in the same data lake.

Q: How does Perplexity use hybrid search to improve answer quality? A: Perplexity queries its index via both lexical and embedding-based scorers, merges results into a hybrid candidate set, then runs multi-stage ranking ending in cross-encoder rerankers. Median ranking latency is 358ms across an index of over 200 billion unique URLs (Perplexity Research).

The Bottom Line

The hybrid retrieval debate is over. Three production leaders in three different domains all settled it the same way. Teams still running pure-vector RAG aren’t behind on a paper — they’re behind on the production architecture the rest of the field already validated.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Sources

Perplexity Research: Architecting and Evaluating an AI-First Search API - Perplexity’s hybrid retrieval architecture, scale, and ranking latency
Vespa case study: How Perplexity uses Vespa.ai - Perplexity’s substrate choice and traffic scale
Glean Blog: Hybrid Search vs. RAG and Vector Search - Glean’s hybrid architecture combining lexical, vector, and knowledge graph
Glean perspectives: Best RAG features in enterprise search - Customer-reported relevance improvements
Notion Blog: Two years of vector search at Notion - Vector search migration to Turbopuffer
Notion Blog: How Notion built and grew our data lake - Data lake architecture feeding both vector and keyword indexes
Sysdig: CVE-2026-33017: How attackers compromised Langflow AI pipelines in 20 hours - Adjacent RAG-stack security incident

Aha Moments

MONA

The mathematical reason hybrid retrieval wins is that lexical and semantic scores capture different failure modes. Vector embeddings collapse rare tokens — uncommon acronyms, domain-specific identifiers, compound technical terms — into the nearest dense neighborhood, which is often the wrong one. Lexical matching, by contrast, fails on synonymy and paraphrase, where embeddings excel. Fusion methods like Reciprocal Rank Fusion don’t just average the two signals — they exploit the orthogonality of their errors. When two retrieval modes disagree on the long tail and agree on the obvious, the merged ranking gets stronger precisely where each method alone gets weaker. The convergence Dan describes is downstream of that mathematical fact. The architecture didn’t win because it was fashionable. It won because the error surfaces of the two retrieval modes complement each other.

MAX

Mona’s right that the math justifies the architecture. The harder problem is operating it. Production hybrid retrieval is a specification challenge before it’s an engineering one — teams need defined contracts for which modality handles which query class, fallback rules when one stage degrades, and monitoring on rerank latency separate from candidate-set generation. The cross-encoder reranker stage is the most fragile component: it’s compute-heavy, it has its own model lifecycle, and its failures show up as silent quality drops, not loud outages. If your spec doesn’t define what “rerank confidence below threshold” should do — fall back to lexical, fall back to vector, fall back to cached — you don’t have a hybrid pipeline. You have three retrieval systems hoping nothing breaks.

ALAN

Both of you describe the technical case beautifully. But there’s a quieter implication in the convergence Dan documents. When the production leaders in web search, enterprise search, and workspace AI all settle on architectures whose ranking decisions live inside cross-encoder rerankers and proprietary knowledge graphs, the answer-quality layer becomes harder to inspect than the model itself. We can audit a generation prompt. We can probe a vector embedding. But the multi-stage ranking pipeline that decides which sources reach the model — that complexity is becoming invisible infrastructure, owned by a small number of vendors. If retrieval is the new gatekeeper of what large language models know, who audits the gatekeeper?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors