Two-Stage Retrieval

What it is

Two-stage retrieval is the dominant architecture for neural search systems. Stage 1 (the retriever) scans the full corpus and returns top-k candidates efficiently — typically BM25, a dense bi-encoder, or a learned sparse model. Stage 2 (the reranker) sees only the k candidates and applies a more expensive, more accurate model (typically a cross-encoder) to produce the final ranking. The design decouples recall from precision: stage 1 maximizes recall (don’t miss relevant documents), stage 2 maximizes precision (rank the retrieved documents correctly).

[illustrate: Full corpus → fast retriever → top-1000 candidates → slow reranker → top-10 results; latency budget split between stages]

How it works

Stage 1: retrieval

BM25: fast, lexical, no GPU needed; typical k = 100–1000
Dense bi-encoder + FAISS: semantic, requires GPU for encoding; typical k = 100–500
Hybrid (BM25 + dense, merged with RRF): combines lexical and semantic recall

Stage 2: reranking

Cross-encoder (MonoBERT, MonoT5): joint query-document encoding; high accuracy, O(k) encoder calls
ColBERT: late interaction MaxSim; between bi-encoder and cross-encoder in accuracy/speed
LLM reranker (RankGPT): listwise LLM scoring; highest accuracy, highest latency

Latency budget allocation

Total latency budget: e.g., 100ms

Stage 1 (BM25 or ANN): 5–20ms
Stage 2 (cross-encoder, k=100):
  MonoBERT-base: ~50ms (GPU)
  MonoT5-3B: ~200ms (GPU) → too slow, reduce k or use smaller model
  ColBERT (PLAID): ~15ms

Typical production: BM25 (k=1000) → dense reranker (k=100) → cross-encoder (k=20)

Three-stage pipelines

Many production systems add a third stage:

BM25 or sparse → k=1000
Dense bi-encoder rerank → k=100
Cross-encoder rerank → k=10

Each stage reduces the candidate set for the next. This allows the most expensive model to operate on very few candidates.

Variants and history

The retrieve-then-rerank pipeline was formalized for neural IR by MonoBERT (2019). MS MARCO established the standard benchmarking setup (BM25 top-1000 → reranker). TREC Deep Learning evaluates both stages separately. Modern RAG systems add a fourth stage: an LLM reads the top-k reranked passages to generate an answer.

When to use it

Almost always. The only reason not to use two stages:

Corpus is small enough that a cross-encoder over all documents is feasible
Latency is too tight even for a bi-encoder first stage (use BM25 only)
ColBERT PLAID provides sufficient accuracy without a separate reranker