Sparse Retrieval

Also known as: lexical retrieval, bag-of-words retrieval, inverted-index retrieval

Sparse Retrieval
Sparse retrieval represents queries and documents as high-dimensional vectors over a vocabulary, with almost every coordinate zero. Matching uses an inverted index for efficient top-k lookup. The family includes classical scoring like BM25 and learned encoders like SPLADE.

Sparse retrieval is a family of search methods that represent queries and documents as high-dimensional vectors over a vocabulary, where almost every coordinate is zero — enabling fast keyword-style matching through an inverted index.

What It Is

Search engines face one fundamental problem: given a user’s query, find the most relevant documents from a collection of millions or billions, in milliseconds. Sparse retrieval solves this by treating both queries and documents as bags of weighted terms. A document about “machine learning frameworks” gets non-zero weights for the words “machine”, “learning”, and “framework”, and zero for every other word in the vocabulary. That sparsity is the whole point — it lets the system store everything in an inverted index and skip directly to documents containing the query’s terms, ignoring the rest of the corpus.

The classical version is BM25, a probabilistic scoring function that weights each term by how often it appears in a document, how rare it is across the collection, and how long the document is. According to Robertson & Zaragoza (2009), BM25 emerged from the Probabilistic Relevance Framework and remains the default baseline against which every modern retrieval system is measured. It has no learned parameters, runs on commodity hardware, and works in any language without retraining.

The newer version is learned sparse retrieval. Models like SPLADE use a transformer (typically BERT-based) to predict which terms a document should be indexed under — including terms that don’t literally appear in the text. According to arXiv 2107.05720, SPLADE keeps the inverted-index substrate but lets a neural network expand and reweight the terms. A document about “cardiac arrest” might get non-zero weight for “heart attack” even if those exact words never appear, closing the lexical gap that hurts pure keyword matching. The output stays sparse — most vocabulary entries still get zero weight — so existing search infrastructure works unchanged.

Both approaches share the same operational shape: vocabulary-sized vectors, inverted index, top-k by sum of matching term weights. What differs is who decides the weights — a handful of statistics (BM25) or a neural network (SPLADE).

How It’s Used in Practice

Most readers encounter sparse retrieval without knowing it. The search bar in your documentation site, the “find similar tickets” feature in your support tool, the keyword filter in an e-commerce app — all of these likely run BM25 underneath. In retrieval-augmented generation (RAG), sparse retrieval often handles the first pass: pull the top candidate chunks by BM25, then rerank them with a dense embedding model or a cross-encoder. This hybrid pattern often beats either method alone on out-of-domain queries, which is why production search stacks default to it.

Sparse retrieval is also the easiest baseline to set up. Pyserini wraps the Lucene/Anserini stack and gives you a working BM25 index in a few lines of Python. No GPU, no embedding service, no vector database. For teams shipping their first RAG prototype, that’s usually the quickest path to a working system.

Pro Tip: Before reaching for a vector database, run BM25 on your corpus and measure recall@20. If it’s already strong, your problem isn’t retrieval quality — it’s reranking, chunking, or the prompt. Adding a dense model on top of a healthy sparse baseline gives you incremental lift. Replacing a broken sparse baseline with a dense one rarely fixes the underlying issue.

When to Use / When Not

ScenarioUseAvoid
Keyword-heavy queries (product codes, names, exact phrases)
Cross-lingual or paraphrased queries needing synonym matching
Cold-start search with no labeled training data
Very short documents where term statistics are unreliable
Production systems needing low-latency retrieval on commodity CPUs
Deep semantic similarity tasks like long-form question answering

Common Misconception

Myth: Sparse retrieval is obsolete — dense vector embeddings replaced it years ago. Reality: Sparse retrieval remains the strongest single-method baseline on out-of-domain tasks, which is why the BEIR benchmark uses BM25 as its reference. Production search stacks at major vendors (Elasticsearch, OpenSearch, Vespa) ship with BM25 as the default and add dense retrieval as a complement, not a replacement. Most modern RAG systems use both, fused with reciprocal rank fusion (RRF), because each method catches relevance signals the other misses.

One Sentence to Remember

Sparse retrieval is the workhorse of search — fast, cheap, language-agnostic, and still strongest on the queries that matter most: rare words, exact matches, and out-of-distribution vocabulary; treat it as your default and add dense retrieval where evaluation shows it helps, not by reflex.

FAQ

Q: Is BM25 still used today? A: Yes. BM25 remains the default scoring function in Elasticsearch, OpenSearch, and Lucene, and it’s the standard baseline in retrieval research. Most production search systems use it directly or in hybrid setups alongside dense retrieval.

Q: What’s the difference between sparse and dense retrieval? A: Sparse vectors have one dimension per vocabulary term, mostly zeros, and match exact words. Dense vectors are short, fully populated, and match semantic similarity. Sparse is faster and more interpretable; dense handles paraphrases better.

Q: Do learned sparse models like SPLADE replace BM25? A: No. SPLADE typically beats BM25 in-domain but can degrade on unfamiliar collections. Most teams use BM25 as a robust default and add SPLADE or dense retrieval where evaluation on their own data shows clear lift.

Sources

Expert Takes

Sparse retrieval works because most words don’t appear in most documents — that’s an empirical fact about language, not a design choice. The inverted index exploits this skew directly. What learned sparse encoders add is a way to predict terms a document should be indexed under, even ones the author never typed. The substrate stays the same; the weighting function gets a neural upgrade. Sparsity is structure, not compression.

Spec the retrieval layer before the model. If your pipeline doesn’t define what relevance means for your queries, swapping a lexical scorer for a dense model just shifts the failure mode. Start with sparse, measure recall against a labeled set, then add dense retrieval only where evaluation shows lift. Most teams skip the measurement step and end up debugging a vector database when the real problem was chunking.

The vector-database hype cycle convinced engineers that lexical retrieval was legacy infrastructure. It isn’t. Every serious search vendor ships hybrid retrieval as the default today because the benchmarks forced them to. The teams that treat sparse retrieval as a second-class citizen lose on cost, latency, and out-of-domain queries — things their users actually feel.

Sparse retrieval has a property dense embeddings lack: you can read why a document was retrieved. Term weights are visible, attributable, debuggable. When a search system surfaces the wrong result, sparse lets you trace the failure to specific tokens. Dense vectors hide that reasoning inside an opaque space. As retrieval becomes the foundation of agentic systems, that auditability stops being a developer convenience and starts being a governance requirement.