BEIR Benchmark

Also known as: BEIR, Benchmarking IR, BEIR evaluation suite

BEIR Benchmark: BEIR (Benchmarking Information Retrieval) is a heterogeneous zero-shot evaluation benchmark of 18 publicly available datasets across 9 task types. Models train on MS MARCO and are tested out-of-domain on BEIR using nDCG@10, measuring how well retrieval methods generalize beyond their training distribution.

The BEIR benchmark is a zero-shot information retrieval evaluation suite of 18 datasets across 9 task types, used to measure how well search models generalize beyond their training data.

What It Is

When a vendor demos a sparse or dense retrieval model that “beats BM25 by 30%,” the next question should always be: on which data? BEIR exists to answer that question honestly. Almost every retrieval model looks great on the dataset it was trained on. BEIR strips that advantage away — it forces models to perform on data they never saw during training, which is much closer to how search behaves the moment you point it at your own corpus.

According to arXiv, BEIR collects 18 publicly available retrieval datasets covering 9 task types: fact-checking (FEVER, Climate-FEVER), question answering (Natural Questions, HotpotQA), citation prediction (SCIDOCS), biomedical search (TREC-COVID, NFCorpus), argument retrieval, news search, duplicate-question detection, and entity retrieval. Corpus sizes range from roughly 3,600 documents up to 15 million. Average query length spans 3 to 192 words. The standard protocol is straightforward: train your model on MS MARCO, evaluate zero-shot on each BEIR dataset, and report nDCG@10 — a ranking metric that rewards placing relevant documents inside the top 10 results.

The reference Python toolkit is the beir-cellar/beir repository on GitHub, which handles data loading, scoring, and aggregation so different research groups produce comparable numbers. Models report per-dataset scores plus an average across the suite. The original paper’s headline finding — BM25 is a surprisingly tough zero-shot baseline, while many dense retrievers underperform it out-of-domain — shaped the next several years of IR research. By 2026, learned-sparse and dense models such as SPLADE, ELSER, E5, and BGE match or exceed BM25 on the BEIR mean, but the per-dataset picture stays mixed and BM25 still wins on several individual tasks.

How It’s Used in Practice

When you read a paper or vendor whitepaper about a new retrieval model — BM25 variants, SPLADE-v3, ELSER, ColBERT, dense bi-encoders — the BEIR average nDCG@10 is usually the first headline number reported. It functions as the IR equivalent of MMLU for language models: a single-number sanity check against a respected, multi-domain test set that everyone in the field already trusts.

If you’re evaluating retrieval for your own product, BEIR scores tell you which model class generalizes well across domains; they do not tell you which model wins on your specific data. A model that tops the BEIR average can still lose on your corpus because your documents and query patterns are not part of the BEIR mix. Treat BEIR as a shortlist filter, then run an in-domain evaluation on labeled queries from your actual users before committing to a production choice.

Pro Tip: Don’t trust the BEIR average alone. Open the per-dataset breakdown — some models are excellent on TREC-COVID but mediocre on FiQA. Find the BEIR datasets closest to your domain (scientific, financial, biomedical, news) and weigh those scores higher than the headline mean.

When to Use / When Not

Scenario	Use	Avoid
Comparing zero-shot generalization across retrieval models	✅
Final go/no-go decision for search on a proprietary corpus		❌
Sanity-checking a vendor’s “we beat BM25” claim	✅
Measuring multi-turn conversational or chat-style retrieval		❌
Building a shortlist before in-domain evaluation	✅
Benchmarking on a corpus where you already have labeled queries		❌

Common Misconception

Myth: The model with the highest BEIR average is the best retrieval model for your project. Reality: BEIR measures out-of-domain generalization across 18 specific public datasets. Your corpus is not one of them. A strong BEIR score makes a model worth shortlisting, but the only honest evaluation is to test on your own labeled queries against your own documents.

One Sentence to Remember

Treat BEIR as a generalization stress test, not a verdict — it tells you which retrievers travel well across domains, but only your own labeled data tells you which one wins on the search problem you actually have.

FAQ

Q: What does BEIR stand for? A: BEIR stands for Benchmarking Information Retrieval. It is a heterogeneous zero-shot evaluation suite published at NeurIPS 2021 that bundles 18 retrieval datasets covering 9 distinct task types.

Q: What metric does BEIR use? A: According to arXiv, BEIR reports nDCG@10 across all its datasets — a ranking metric that scores how well a retriever places relevant documents inside the top 10 results for each query.

Q: Does BM25 still beat dense retrievers on BEIR? A: Not on the average anymore. Modern dense and learned-sparse models match or exceed BM25 on the BEIR mean, but BM25 still wins on several individual datasets, so per-task results matter.

Sources

arXiv: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models - The original 2021 paper introducing BEIR by Thakur, Reimers, Rücklé, Srivastava, and Gurevych.
BEIR GitHub: beir-cellar/beir - Reference Python toolkit for loading datasets and evaluating retrieval models on the BEIR suite.

Expert Takes

MONA

The interesting move in BEIR is the methodological insistence on zero-shot evaluation. Train on one distribution, test on many. That single design choice exposed an uncomfortable truth: a lot of dense retrievers were memorizing MS MARCO regularities rather than learning generalizable ranking. Statistics, not magic. The benchmark forces an honest separation between in-domain fitting and out-of-domain capability — the same separation that distinguishes a useful retrieval model from a brittle one.

MAX

BEIR is essentially a contract between model authors and users: declare your retriever, train on the agreed source distribution, evaluate on the named datasets, report the named metric. The value is not the score itself; it is the shared specification. When you open a retrieval paper, the BEIR table tells you whether the authors followed the contract. Without that shared spec, claims like “we beat BM25” would be untestable across different labs and different evaluation pipelines.

DAN

BEIR became the unofficial entry ticket for retrieval models the moment vendors started quoting average nDCG@10 in their whitepapers. If your model isn’t on the BEIR table, buyers assume you are hiding something. That dynamic flipped the field: instead of competing on a single proprietary dataset, retrieval vendors now compete on out-of-domain generalization. The companies that internalized this shift early own the learned-sparse and embedding-API conversation today.

ALAN

BEIR is built from public datasets — that is its strength and its blind spot. The benchmark cannot tell you how a retriever behaves on legal contracts, medical records, or internal corporate prose, because that data cannot be released. So the leaderboard rewards models that generalize across what is publishable, not across what people actually search at work. Whose retrieval problem does this benchmark really represent?

Back to Glossary