Contriever

Contriever Dense-Retrieval Unsupervised Contrastive-Learning Neural-Ir Needs-Review

What it is

Contriever (Izacard et al., 2021) is a dense retrieval model that requires no labeled relevance data. It trains a bi-encoder using contrastive learning on unlabeled text, creating positive pairs by sampling two independent spans from the same document. The model learns that two passages from the same source should be closer in embedding space than passages from different documents — a document-level co-occurrence signal as a proxy for relevance.

[illustrate: Two spans from same document as positive pair; spans from different documents as negatives; contrastive training objective]

How it works

Unsupervised positive pair construction:
- Sample a document at random
- Draw two independent random spans (50–200 tokens each)
- These form the positive (query, passage) pair
Training objective:
- Contrastive loss (InfoNCE) over in-batch negatives
- Other documents in the batch serve as negatives
- No human relevance labels required
Data augmentation:
- Random cropping, inverse cloze task, and independent cropping tested
- Independent cropping performs best in practice
mContriever:
- Multilingual variant trained on 29 languages
- Particularly strong for cross-lingual retrieval without labeled translation pairs

Example

Document: "The Amazon rainforest covers over 5.5 million km²
           and is home to 10% of all species on Earth..."

Positive pair:
  span_1 (query proxy):  "The Amazon rainforest covers over 5.5 million km²"
  span_2 (passage proxy): "home to 10% of all species on Earth"

Negative: any span from a different document in the batch

Training signal: push span_1 and span_2 together,
                 push apart from all other-document spans

Variants and history

Contriever (2021, Meta AI) demonstrated that unsupervised dense retrieval is competitive with DPR on several BEIR tasks, which was surprising since DPR uses thousands of labeled examples. mContriever extended this to multilingual. Later work (DRAGON, E5) combines unsupervised pre-training with supervised fine-tuning. Contriever is particularly valuable for domain adaptation where labeled data is scarce.

When to use it

Use Contriever when:

No labeled query-document pairs are available for the target domain
Domain adaptation is needed (biomedical, legal, code)
Multilingual retrieval across low-resource languages (mContriever)
You want a strong initialization before supervised fine-tuning

For domains with labeled data, DPR or ANCE-trained models typically outperform Contriever.