DSI (Differentiable Search Index)

What it is

DSI (Differentiable Search Index, Tay et al., 2022) is a generative retrieval approach where a T5 model memorizes the entire document corpus in its parameters and retrieves documents by generating their identifiers directly from a query. There is no external index — the “index” is the model weights. This is a conceptually radical departure from all prior retrieval architectures.

[illustrate: Query → T5 → generates document ID string → look up document; no FAISS, no BM25, just token generation]

How it works

Document ID representation:
- Each document assigned a unique ID (string)
- Structured atomic IDs (e.g., “d1234”) or semantic IDs (hierarchical cluster labels)
- Semantic IDs perform better: encode document meaning into the ID structure
Indexing (training):
- Train T5 to map document text → document ID
- Also train to map queries → document IDs using relevance labels
- Both tasks in a single model via multi-task training
Retrieval:
- Beam search to generate top-k document IDs
- Constrained decoding to ensure only valid IDs are generated
Semantic document IDs:
- Hierarchically cluster documents by content
- Assign IDs based on cluster path (e.g., “2.14.3”)
- Model learns the document taxonomy as part of ID generation

Variants and history

DSI (2022, Google) launched the generative retrieval research direction. NCI (Neural Corpus Indexer) added prefix-aware learning for more consistent ID generation. GENRE (generative entity retrieval) applied the idea to entity linking. MINDER extended DSI to multi-view document IDs. The approach is exciting theoretically but scaling to millions of documents remains an open challenge — corpus memorization capacity grows with model size.

When to use it

DSI is primarily a research direction. Current limitations:

Does not scale gracefully to millions of documents without very large models
Adding new documents requires retraining (unlike ANN indexes which are updatable)
Beam search retrieval is slower than ANN lookup for large corpora

Watch this space: improvements in semantic IDs and corpus scale are active research areas.

DSI (Differentiable Search Index)

What it is

How it works

Variants and history

When to use it

See also