BEIR
What it is
BEIR (Benchmarking IR, Thakur et al., 2021) is a meta-benchmark consisting of 18 retrieval datasets across 9 diverse domains (biomedical, scientific, news, finance, argument retrieval, fact checking, and more). Models are trained on MS MARCO and then evaluated on BEIR datasets without any domain-specific fine-tuning, measuring zero-shot generalization. BEIR revealed that strong MS MARCO performance does not reliably predict performance on other domains — a major finding that changed how the field evaluates retrieval.
[illustrate: Model trained on MS MARCO → evaluated zero-shot on 18 BEIR datasets; results show dense models can underperform BM25 on specialized domains]
Datasets included
Biomedical: TREC-COVID, BioASQ, NFCorpus, MedMCQA
Science: SciFact, SCIDOCS
News: TREC-News, Robust04
Finance: FiQA-2018
Argument: ArguAna, Touche-2020
Fact-check: FEVER, Climate-FEVER, DBPedia
Entity: DBPedia Entity
Q&A: NQ, HotpotQA, MSMARCO
Key findings from the original paper
- BM25 is a strong zero-shot baseline: Dense models trained on MS MARCO often underperform BM25 on BEIR datasets, especially specialized domains
- Domain gap matters: Models with high MS MARCO MRR@10 don’t necessarily generalize
- Contriever (unsupervised) often outperforms supervised dense models on out-of-domain tasks
- SPLADE achieves strong BEIR performance by learning sparse expansions that generalize better than dense embeddings
Evaluation protocol
Metric: NDCG@10 (primary)
Split: test set (no dev set tuning allowed for zero-shot evaluation)
Normalization: average NDCG@10 across all available datasets
Limitations
- Some datasets have small corpora or limited annotations
- “Zero-shot” is not always cleanly defined (some BEIR datasets overlap with MS MARCO topics)
- Dominated by English; see MIRACL for multilingual zero-shot evaluation
Variants and history
BEIR (2021) is now the standard zero-shot generalization benchmark. MTEB (Massive Text Embedding Benchmark) extends the BEIR approach to 56 datasets across 8 task types. Domain adaptation methods (GPL, TART) specifically target BEIR performance. The BEIR finding — that domain shift breaks dense retrieval — motivated the entire line of unsupervised and domain adaptation retrieval research.