DocT5Query

Doct5query Document-Expansion Query-Expansion Neural-Ir Sequence-to-Sequence Needs-Review

What it is

DocT5Query (Nogueira et al., 2019) improves BM25 and sparse retrieval recall by expanding documents at index time. A T5 model fine-tuned on (passage, query) pairs generates 5–40 synthetic queries that the passage plausibly answers. These synthetic queries are appended to the document before indexing. The result: BM25 can now match on predicted query vocabulary that doesn’t appear in the original text, significantly improving recall without any changes to the retrieval infrastructure.

[illustrate: Document → T5 → synthetic queries → append to document → BM25 index; query at search time matches against expanded document]

How it works

T5 fine-tuning (offline, once per domain):
- Input: passage text
- Output: a relevant query
- Train on MS MARCO (passage, query) pairs
Document expansion (at index time):
- For each document, generate k synthetic queries (typically 5–40)
- Append them to the document text before indexing
- Index the expanded document with BM25 or sparse retrieval
Retrieval (unchanged):
- Standard BM25 or sparse retrieval over expanded index
- No change to query processing or scoring
Why it works:
- Bridging vocabulary gap: users often query with different words than documents use
- Predicted queries add terms that bridge this gap

Variants and history

DocT5Query (2019) was a surprisingly effective approach to improving BM25 — the combination with BM25 matched early dense retrieval results. uniCOIL + doc2query stacked learned impact scoring on top of expanded documents for further gains. The approach generalizes: any query generation model can expand documents. In RAG pipelines, synthetic question generation for chunked documents is the same idea applied to LLM contexts.

When to use it

Use DocT5Query when:

Dense retrieval infrastructure is unavailable
You want to improve BM25 recall cheaply at index time
Vocabulary mismatch between queries and documents is a known problem
Offline expansion cost (GPU inference over corpus) is acceptable

DocT5Query

What it is

How it works

Variants and history

When to use it

See also