Two-Stage Retrieval
What it is
Two-stage retrieval is the dominant architecture for neural search systems. Stage 1 (the retriever) scans the full corpus and returns top-k candidates efficiently — typically BM25, a dense bi-encoder, or a learned sparse model. Stage 2 (the reranker) sees only the k candidates and applies a more expensive, more accurate model (typically a cross-encoder) to produce the final ranking. The design decouples recall from precision: stage 1 maximizes recall (don’t miss relevant documents), stage 2 maximizes precision (rank the retrieved documents correctly).
[illustrate: Full corpus → fast retriever → top-1000 candidates → slow reranker → top-10 results; latency budget split between stages]
How it works
Stage 1: retrieval
- BM25: fast, lexical, no GPU needed; typical k = 100–1000
- Dense bi-encoder + FAISS: semantic, requires GPU for encoding; typical k = 100–500
- Hybrid (BM25 + dense, merged with RRF): combines lexical and semantic recall
Stage 2: reranking
- Cross-encoder (MonoBERT, MonoT5): joint query-document encoding; high accuracy, O(k) encoder calls
- ColBERT: late interaction MaxSim; between bi-encoder and cross-encoder in accuracy/speed
- LLM reranker (RankGPT): listwise LLM scoring; highest accuracy, highest latency
Latency budget allocation
Total latency budget: e.g., 100ms
Stage 1 (BM25 or ANN): 5–20ms
Stage 2 (cross-encoder, k=100):
MonoBERT-base: ~50ms (GPU)
MonoT5-3B: ~200ms (GPU) → too slow, reduce k or use smaller model
ColBERT (PLAID): ~15ms
Typical production: BM25 (k=1000) → dense reranker (k=100) → cross-encoder (k=20)
Three-stage pipelines
Many production systems add a third stage:
- BM25 or sparse → k=1000
- Dense bi-encoder rerank → k=100
- Cross-encoder rerank → k=10
Each stage reduces the candidate set for the next. This allows the most expensive model to operate on very few candidates.
Variants and history
The retrieve-then-rerank pipeline was formalized for neural IR by MonoBERT (2019). MS MARCO established the standard benchmarking setup (BM25 top-1000 → reranker). TREC Deep Learning evaluates both stages separately. Modern RAG systems add a fourth stage: an LLM reads the top-k reranked passages to generate an answer.
When to use it
Almost always. The only reason not to use two stages:
- Corpus is small enough that a cross-encoder over all documents is feasible
- Latency is too tight even for a bi-encoder first stage (use BM25 only)
- ColBERT PLAID provides sufficient accuracy without a separate reranker