MONA explainer 11 min read May 4, 2026

From TF-IDF to Learned Sparse: Prerequisites and Hard Limits of BM25, SPLADE, and ELSER

Visualization of sparse vector retrieval comparing lexical token matches against learned token expansions over an inverted index

Table of Contents

ELI5

Sparse retrieval scores documents by matching tokens, not concepts. BM25 weights raw words. SPLADE and ELSER let a transformer expand which words count. All three obey the same physics: posting lists and vocabulary mismatch.

BM25 was published in 1994. SPLADE-v3 dropped in March 2024. On paper, three decades of NLP progress should have buried the older algorithm. Yet on out-of-domain benchmarks, BM25 still beats most dense neural retrievers — and even SPLADE wins by a margin that depends entirely on what kind of corpus you feed it. The interesting question is not which method wins. It is which physical constraint each one accepts.

What You Need Before You Touch a Sparse Retriever

Every sparse retriever, classical or neural, lives inside the same data structure: an inverted index. Tokens point to lists of documents that contain them. The retriever’s only job is to assign weights to those tokens and sum them at query time. If you understand that one sentence, the rest of the field is a series of design decisions about which weights.

What do you need to know before learning sparse retrieval methods?

Three layers, in order — history, parameters, evaluation.

The first layer is where the weights came from. Karen Spärck Jones introduced “term specificity” in 1972, the idea that rare terms carry more information than common ones. Salton and McGill later formalized this as TF-IDF inside the vector-space model. TF-IDF is not a single 1972 invention — it is a combination that took twenty years to settle. BM25 came in 1994 from Robertson, Walker, and the Okapi system at TREC-3, then was formalized as the probabilistic relevance framework by Robertson and Zaragoza in 2009. The shift from TF-IDF to BM25 was not better embeddings. It was a saturation function on term frequency and a length normalization term — k1 ≈ 1.2 to 2.0 for saturation, b ≈ 0.75 for length, per Stanford IR Book.

The second layer is what those parameters do. k1 controls how quickly extra repetitions of a term stop helping. With k1 = 0, every match scores the same regardless of frequency. As k1 grows, the score keeps climbing but with diminishing returns — a hyperbolic curve, not a line. b controls how harshly long documents are penalized for length. At b = 0, length is ignored. At b = 1, scores are fully normalized by document length. The defaults are empirical, not derived — they happen to work across a wide range of TREC and MS MARCO settings.

The third layer is what you evaluate against. Two corpora dominate the literature: MS MARCO, with about 8.8 million passages from Bing-derived web documents and roughly one million anonymized real queries, and BEIR, a zero-shot benchmark of eighteen IR datasets across nine task types, with query lengths from three to 192 words and corpora ranging from 3,600 to 15 million documents. MS MARCO is the supervised training distribution. BEIR measures what happens when you take the model out of that distribution. The two tell different stories about the same models, and conflating them is the most common mistake in sparse-retrieval blog posts.

For tooling, Pyserini is the bridge between papers and replication. Castorini’s v2.0.0 release on April 19, 2026 supports BM25, uniCOIL, SPLADE / SPLADEv2, DeepImpact, and SLIM on top of Anserini and Lucene 9 (Pyserini GitHub). If you cannot run a paper’s numbers in Pyserini, you do not yet have a baseline.

What “Learned Sparse” Actually Learns

The conceptual leap from BM25 to Sparse Retrieval models like SPLADE and ELSER is not “neural networks read meaning.” It is a much narrower claim: a transformer can predict, at index time, which tokens should have been in a document but were not.

How SPLADE turns BERT into a sparse encoder

SPLADE — Sparse Lexical and Expansion Model for First Stage Ranking, originally proposed by Formal, Piwowarski, and Clinchant at SIGIR 2021 — uses a BERT masked language modeling head to score every token in BERT’s WordPiece vocabulary against a document, then a FLOPS regularizer to push most of those scores to zero (Pinecone Learn). What survives is a sparse, weighted bag of tokens over roughly thirty thousand dimensions. Some of those tokens appeared in the original document. Others did not — the model adds them because the masked-LM head expects them to belong there. That is the expansion mechanism.

The latest checkpoint, SPLADE-v3 from Lassance, Déjean, and colleagues in March 2024, reports more than 40 MRR@10 on the MS MARCO development set and roughly two percent improvement over SPLADE++ on BEIR (arXiv).

Not magic. A 30,000-dimensional sparse projection of a transformer’s beliefs about token presence.

How ELSER differs

ELSER is Elastic’s production-oriented learned-sparse encoder. The architecture is similar in spirit to SPLADE — sparse vector with weighted token expansions over a fixed vocabulary, stored in Elasticsearch’s sparse_vector or rank_features field types. The differences are operational, not theoretical. ELSER v2 is generally available; v1 is still labeled technical preview. The optimized v2 model ingests around twenty-six documents per second on the minimum 4 GB ML node; the cross-platform v2 ingests around sixteen, and v1 around fourteen (Elastic Docs). It is English-only — non-English content needs the multilingual E5 model, or, since Elasticsearch 9.3, Jina v5 via Elastic Inference Service.

The line between SPLADE and ELSER is not “research vs production.” It is licensing and language scope. SPLADE’s reference repository (naver/splade) is released under CC-BY-NC-SA 4.0, which means its weights cannot be deployed commercially without a separate license. ELSER ships under the Elastic license inside the search engine you are presumably already paying for.

Diagram comparing BM25's raw token weights against SPLADE and ELSER's transformer-expanded sparse vectors over an inverted index — From classical to learned: every sparse retriever assigns weights to tokens, but the source of those weights — and the cost of producing them — differs sharply.

Where Each Method Snaps

The reason no sparse retriever has retired the others is that each one fails differently. BM25 fails on vocabulary. SPLADE fails on cost and license. ELSER fails on language and lock-in.

What are the technical limitations of BM25 and learned sparse retrievers?

BM25 fails at vocabulary mismatch. BM25 matches strings, not concepts. If a query says “heart attack” and the document says “myocardial infarction,” BM25 sees no overlap. The Qdrant Article calls this the “vocabulary mismatch problem,” and it is structural, not tunable. You cannot fix it with k1 or b. You can only fix it by adding synonym lists, query expansion, or by switching to a model that learned synonyms during training.

SPLADE fails at latency and license. Learned sparse retrievers are slower than BM25 at both ends of the pipeline. At index time, every document must run through a transformer encoder; at query time, the expanded posting lists are longer, which means more list traversals during scoring. The Two-Step SPLADE paper from Lassance and colleagues exists precisely because this latency was treated as a research problem worth a paper. Static pruning, guided traversal, and two-step scoring all reduce the cost — but none eliminate it. On top of that, the reference SPLADE weights are non-commercial. A team that built a prototype on naver/splade cannot release it commercially without switching to a derivative encoder.

ELSER fails at language scope. ELSER does not handle non-English content. If your corpus is multilingual, ELSER alone will under-retrieve in every language except English. Elastic’s own recommendation is to switch to the E5 multilingual encoder, or to the newer Jina v5 path in Elasticsearch 9.3+. Treating ELSER as “Elastic’s general-purpose semantic search” is a misreading of the documentation.

The shared limit cuts across all three. BEIR is a zero-shot benchmark. It tells you how a retriever behaves on corpora it never saw. SPLADE wins zero-shot on average; on in-domain MS MARCO, dense retrievers and learned sparse models are usually competitive or stronger than BM25 (BEIR paper, arXiv). The framing “sparse beats dense” is overstated. Per-task and per-corpus picture is what determines what wins in practice.

Compatibility & licensing notes:
SPLADE weights — CC-BY-NC-SA 4.0: naver/splade is non-commercial. For production, use derivative open-licensed encoders (e.g., naver/efficient-splade-VI-BT-large-doc) or engine-provided alternatives (Qdrant miniCOIL, Elastic ELSER).
SPLADE repo maintenance: last release October 2023, last code activity November 2023 (naver/splade GitHub). Treat as a research codebase, not a maintained library.
ELSER v1 — technical preview: v2 is the supported choice for new deployments; v1 tutorials are stale.

What This Means for Your Index

The mechanism predicts a few things you can use without re-running benchmarks.

If your corpus is in-domain, narrow, and your queries match document vocabulary closely, BM25 with default k1 and b will be a brutally hard baseline to beat. Most “we replaced BM25 with embeddings” projects regress here and never publish the result.
If your queries paraphrase heavily and your latency budget allows for a transformer pass at index time, learned sparse will recover the synonyms that BM25 misses — but you pay for it on every retrieval.
If your content is multilingual and you reach for ELSER, expect English documents to dominate your top-k and other languages to under-retrieve until you switch encoders.
If you change k1 from default toward zero, expect rare-term matches to dominate; toward higher values, expect frequent-term matches to climb. The curve is not linear.

Rule of thumb: Use BM25 as your floor, log a RAG Evaluation run with both BM25 and a learned sparse model on your real query distribution, and only adopt the learned model where it wins on metrics you measured — not metrics from someone else’s corpus.

When it breaks: Learned sparse retrievers fail when latency budgets are tight, the corpus is multilingual, or the license forbids the use case — and BM25 fails when your users paraphrase faster than your synonym lists can keep up.

The Data Says

BM25 is the strongest sparse baseline because its parameters describe two physical phenomena — term saturation and document length — that no amount of pretraining can erase. SPLADE and ELSER win on metrics that reward synonym recovery and lose on metrics that reward speed and licensing flexibility. Picking the right sparse retriever is picking which failure mode you can absorb.

Sources

Stanford IR Book: Okapi BM25 — Introduction to Information Retrieval - BM25 default parameters and probabilistic ranking foundations
arXiv: SPLADE-v3: New baselines for SPLADE - SPLADE-v3 benchmarks (>40 MRR@10 on MS MARCO dev, +2% over SPLADE++ on BEIR)
naver/splade GitHub: Official SPLADE implementation - License (CC-BY-NC-SA 4.0) and repository maintenance status
Elastic Docs: ELSER — Elastic Learned Sparse EncodeR - ELSER v2 throughput, version status, English-only scope
arXiv: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models - Zero-shot evaluation methodology and BM25 baseline robustness
Pyserini GitHub: castorini/pyserini - v2.0.0 release (April 19, 2026), supported sparse retrievers
Qdrant Article: Modern Sparse Neural Retrieval: From Theory to Practice - Vocabulary mismatch problem in lexical retrieval
Pinecone Learn: SPLADE for Sparse Vector Search Explained - SPLADE mechanism (BERT MLM head + FLOPS regularizer)

Aha Moments

MAX

The architecture decision here is which constraint you spec first. If you write the spec as “search must handle paraphrase,” you have committed to a transformer pass at index time and you owe your reader an honest latency budget. If you write it as “search must be cheap and predictable at the tail,” BM25 is not a fallback — it is the design. The mistake I keep finding in retrieval specs is a single line that says “use semantic search” with no mention of corpus language, license boundary, or query distribution. Mona’s layered framing maps cleanly onto the spec sections most teams skip: history is your prior, parameters are your knobs, evaluation is your acceptance criteria. Without all of them, you are not specifying retrieval. You are wishing for it.

DAN

What MAX calls a spec gap, the market calls a procurement decision. The reason BM25 has not been displaced is not nostalgia. It is that the cost curve of learned sparse — encoder hardware, license fees, longer posting lists, observability for a model you do not own — only crosses the BM25 curve when paraphrase recovery is worth real money. For most internal search, it is not. For consumer-facing RAG over public corpora, it can be. The teams shipping SPLADE in production are also the teams that already had a reranker, a hybrid scorer, and a retrieval-evaluation harness. They are buying the last few points of recall. Everyone else who skips the harness and just turns on learned sparse pays the bill twice — once on infrastructure, once on the regression they cannot measure.

ALAN

Both prior comments treat the choice as engineering and economics. Worth pausing on what gets buried inside it. A learned sparse retriever decides which words “should have been” in a document — that is a model imposing its prior on every query you serve. BM25 has the dignity of being legible: a domain expert can read its formula and argue with it. SPLADE and ELSER make decisions about synonymy, expansion, and weight that the same expert cannot easily audit, and that a regulator cannot inspect without the weights themselves. When that retriever sits in front of a clinical RAG, a legal corpus, or an internal compliance archive, “what the model expanded into” becomes part of the answer the user reads. The question I would ask before any sparse-vs-sparse migration: who is accountable when the expansion is wrong, and how would they find out?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors