DocT5Query
What it is
DocT5Query (Nogueira et al., 2019) improves BM25 and sparse retrieval recall by expanding documents at index time. A T5 model fine-tuned on (passage, query) pairs generates 5–40 synthetic queries that the passage plausibly answers. These synthetic queries are appended to the document before indexing. The result: BM25 can now match on predicted query vocabulary that doesn’t appear in the original text, significantly improving recall without any changes to the retrieval infrastructure.
[illustrate: Document → T5 → synthetic queries → append to document → BM25 index; query at search time matches against expanded document]
How it works
-
T5 fine-tuning (offline, once per domain):
- Input: passage text
- Output: a relevant query
- Train on MS MARCO (passage, query) pairs
-
Document expansion (at index time):
- For each document, generate k synthetic queries (typically 5–40)
- Append them to the document text before indexing
- Index the expanded document with BM25 or sparse retrieval
-
Retrieval (unchanged):
- Standard BM25 or sparse retrieval over expanded index
- No change to query processing or scoring
-
Why it works:
- Bridging vocabulary gap: users often query with different words than documents use
- Predicted queries add terms that bridge this gap
Variants and history
DocT5Query (2019) was a surprisingly effective approach to improving BM25 — the combination with BM25 matched early dense retrieval results. uniCOIL + doc2query stacked learned impact scoring on top of expanded documents for further gains. The approach generalizes: any query generation model can expand documents. In RAG pipelines, synthetic question generation for chunked documents is the same idea applied to LLM contexts.
When to use it
Use DocT5Query when:
- Dense retrieval infrastructure is unavailable
- You want to improve BM25 recall cheaply at index time
- Vocabulary mismatch between queries and documents is a known problem
- Offline expansion cost (GPU inference over corpus) is acceptable