DPR (Dense Passage Retrieval)

What it is

DPR (Dense Passage Retrieval) is a bi-encoder retrieval model introduced by Karpukhin et al. (2020) that encodes queries and passages independently into dense vectors and retrieves by maximum inner product search. It demonstrated that dense retrieval could outperform BM25 for open-domain question answering, making it the foundational blueprint for modern neural retrieval.

[illustrate: Two BERT towers encoding query and passage independently; inner product score; FAISS ANN retrieval]

How it works

  1. Dual encoder architecture:

    • Two separate BERT encoders (question encoder + passage encoder)
    • Each produces a single vector from the [CLS] token
    • Query and passage encoders may share weights or be independent
  2. Training:

    • Positive pair: (question, relevant passage)
    • Negatives: BM25 hard negatives + in-batch random negatives
    • Loss: cross-entropy over dot product scores
  3. Indexing:

    • Pre-encode all passages at index time
    • Store dense vectors in FAISS index (flat or IVF)
    • Typical dimension: 768 (BERT-base)
  4. Retrieval:

    • Encode query at query time
    • FAISS MIPS (Maximum Inner Product Search) for top-k passages
    • O(1) or near O(1) with approximate ANN index

Example

from transformers import DPRQuestionEncoder, DPRContextEncoder, AutoTokenizer
import torch

q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
p_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

query = "Who wrote Hamlet?"
passage = "Hamlet is a tragedy written by William Shakespeare."

q_inputs = tokenizer(query, return_tensors="pt")
p_inputs = tokenizer(passage, return_tensors="pt")

q_vec = q_encoder(**q_inputs).pooler_output  # [1, 768]
p_vec = p_encoder(**p_inputs).pooler_output  # [1, 768]

score = torch.dot(q_vec[0], p_vec[0])

Variants and history

DPR (2020) was trained on Natural Questions, TriviaQA, and WebQ. It showed that with sufficient supervised data, dense retrieval beats BM25 on open-domain QA. Follow-on work: ANCE improved negatives via dynamic hard mining; TAS-B added distillation; Contriever removed the supervised data requirement. DPR’s dual-encoder blueprint is used in virtually every subsequent dense retrieval model.

When to use it

Use DPR when:

  • Open-domain QA or passage retrieval with supervised training data available
  • Semantic recall matters more than lexical precision
  • You have infrastructure for ANN indexing (FAISS, HNSW)

Prefer BM25 or hybrid search when lexical matching is critical (entity names, codes, rare terms) or training data is scarce.

See also