BEIR

Beir Benchmark Evaluation Neural-Ir Zero-Shot Needs-Review

What it is

BEIR (Benchmarking IR, Thakur et al., 2021) is a meta-benchmark consisting of 18 retrieval datasets across 9 diverse domains (biomedical, scientific, news, finance, argument retrieval, fact checking, and more). Models are trained on MS MARCO and then evaluated on BEIR datasets without any domain-specific fine-tuning, measuring zero-shot generalization. BEIR revealed that strong MS MARCO performance does not reliably predict performance on other domains — a major finding that changed how the field evaluates retrieval.

[illustrate: Model trained on MS MARCO → evaluated zero-shot on 18 BEIR datasets; results show dense models can underperform BM25 on specialized domains]

Datasets included

Biomedical:  TREC-COVID, BioASQ, NFCorpus, MedMCQA
Science:     SciFact, SCIDOCS
News:        TREC-News, Robust04
Finance:     FiQA-2018
Argument:    ArguAna, Touche-2020
Fact-check:  FEVER, Climate-FEVER, DBPedia
Entity:      DBPedia Entity
Q&A:         NQ, HotpotQA, MSMARCO

Key findings from the original paper

BM25 is a strong zero-shot baseline: Dense models trained on MS MARCO often underperform BM25 on BEIR datasets, especially specialized domains
Domain gap matters: Models with high MS MARCO MRR@10 don’t necessarily generalize
Contriever (unsupervised) often outperforms supervised dense models on out-of-domain tasks
SPLADE achieves strong BEIR performance by learning sparse expansions that generalize better than dense embeddings

Evaluation protocol

Metric: NDCG@10 (primary)
Split: test set (no dev set tuning allowed for zero-shot evaluation)
Normalization: average NDCG@10 across all available datasets

Limitations

Some datasets have small corpora or limited annotations
“Zero-shot” is not always cleanly defined (some BEIR datasets overlap with MS MARCO topics)
Dominated by English; see MIRACL for multilingual zero-shot evaluation

Variants and history

BEIR (2021) is now the standard zero-shot generalization benchmark. MTEB (Massive Text Embedding Benchmark) extends the BEIR approach to 56 datasets across 8 task types. Domain adaptation methods (GPL, TART) specifically target BEIR performance. The BEIR finding — that domain shift breaks dense retrieval — motivated the entire line of unsupervised and domain adaptation retrieval research.