MS MARCO
What it is
MS MARCO (Microsoft MAchine Reading COmprehension, Nguyen et al., 2016) is the primary benchmark dataset for passage retrieval and document ranking. It contains 8.8 million passages from web documents, approximately 1 million training queries from Bing search logs, and sparse relevance annotations (typically 1 positive passage per query). Nearly every modern neural retrieval model is trained and evaluated on MS MARCO.
[illustrate: 8.8M passage corpus; 1M training queries; dev set of 6980 queries; each query has ~1 annotated relevant passage; evaluation by MRR@10 and Recall@1000]
Dataset structure
Passage corpus: 8,841,823 passages (avg ~60 words)
- Sourced from web documents indexed by Bing
- Passage-level (not full documents)
Training queries: ~502,000 (with at least 1 annotated positive)
Dev queries: 6,980
Test queries: withheld (evaluated via TREC submission)
Relevance labels:
- Binary: 1 (relevant) or 0 (not relevant)
- Sparse: typically 1 positive per query
- NOT exhaustively annotated (false negatives are common)
Standard tasks
Passage ranking: given a query and top-1000 candidates (from BM25), re-rank by relevance. Metric: MRR@10.
Passage retrieval: retrieve directly from all 8.8M passages. Metric: Recall@1000 (first stage), MRR@10 (after reranking).
Document ranking: similar task on full documents (v2 dataset). Metric: MRR@100, NDCG@10.
Representative results (passage ranking MRR@10)
BM25 baseline: ~18.4
MonoBERT (BERT-large): ~37.2
ColBERT: ~36.0
SPLADE++: ~38.5
MonoT5-3B: ~39.0
Limitations
- Sparse judgments: only ~1 positive per query; true recall cannot be measured
- Domain: web search queries; may not represent specialized domains
- Short passages: 60-word average may not reflect real document retrieval
- English only: no multilingual variant; use MIRACL or mMARCO for multilingual
Variants and history
MS MARCO v1 (2016) was a reading comprehension dataset; v2 repurposed it for ranking. TREC Deep Learning Track (2019–present) provides exhaustively annotated subsets of MS MARCO queries. MSMARCO v2 added 138M passages and 12M documents. mMARCO provides machine-translated multilingual versions.