Contriever
What it is
Contriever (Izacard et al., 2021) is a dense retrieval model that requires no labeled relevance data. It trains a bi-encoder using contrastive learning on unlabeled text, creating positive pairs by sampling two independent spans from the same document. The model learns that two passages from the same source should be closer in embedding space than passages from different documents — a document-level co-occurrence signal as a proxy for relevance.
[illustrate: Two spans from same document as positive pair; spans from different documents as negatives; contrastive training objective]
How it works
-
Unsupervised positive pair construction:
- Sample a document at random
- Draw two independent random spans (50–200 tokens each)
- These form the positive (query, passage) pair
-
Training objective:
- Contrastive loss (InfoNCE) over in-batch negatives
- Other documents in the batch serve as negatives
- No human relevance labels required
-
Data augmentation:
- Random cropping, inverse cloze task, and independent cropping tested
- Independent cropping performs best in practice
-
mContriever:
- Multilingual variant trained on 29 languages
- Particularly strong for cross-lingual retrieval without labeled translation pairs
Example
Document: "The Amazon rainforest covers over 5.5 million km²
and is home to 10% of all species on Earth..."
Positive pair:
span_1 (query proxy): "The Amazon rainforest covers over 5.5 million km²"
span_2 (passage proxy): "home to 10% of all species on Earth"
Negative: any span from a different document in the batch
Training signal: push span_1 and span_2 together,
push apart from all other-document spans
Variants and history
Contriever (2021, Meta AI) demonstrated that unsupervised dense retrieval is competitive with DPR on several BEIR tasks, which was surprising since DPR uses thousands of labeled examples. mContriever extended this to multilingual. Later work (DRAGON, E5) combines unsupervised pre-training with supervised fine-tuning. Contriever is particularly valuable for domain adaptation where labeled data is scarce.
When to use it
Use Contriever when:
- No labeled query-document pairs are available for the target domain
- Domain adaptation is needed (biomedical, legal, code)
- Multilingual retrieval across low-resource languages (mContriever)
- You want a strong initialization before supervised fine-tuning
For domains with labeled data, DPR or ANCE-trained models typically outperform Contriever.