DSI (Differentiable Search Index)
What it is
DSI (Differentiable Search Index, Tay et al., 2022) is a generative retrieval approach where a T5 model memorizes the entire document corpus in its parameters and retrieves documents by generating their identifiers directly from a query. There is no external index — the “index” is the model weights. This is a conceptually radical departure from all prior retrieval architectures.
[illustrate: Query → T5 → generates document ID string → look up document; no FAISS, no BM25, just token generation]
How it works
-
Document ID representation:
- Each document assigned a unique ID (string)
- Structured atomic IDs (e.g., “d1234”) or semantic IDs (hierarchical cluster labels)
- Semantic IDs perform better: encode document meaning into the ID structure
-
Indexing (training):
- Train T5 to map document text → document ID
- Also train to map queries → document IDs using relevance labels
- Both tasks in a single model via multi-task training
-
Retrieval:
- Beam search to generate top-k document IDs
- Constrained decoding to ensure only valid IDs are generated
-
Semantic document IDs:
- Hierarchically cluster documents by content
- Assign IDs based on cluster path (e.g., “2.14.3”)
- Model learns the document taxonomy as part of ID generation
Variants and history
DSI (2022, Google) launched the generative retrieval research direction. NCI (Neural Corpus Indexer) added prefix-aware learning for more consistent ID generation. GENRE (generative entity retrieval) applied the idea to entity linking. MINDER extended DSI to multi-view document IDs. The approach is exciting theoretically but scaling to millions of documents remains an open challenge — corpus memorization capacity grows with model size.
When to use it
DSI is primarily a research direction. Current limitations:
- Does not scale gracefully to millions of documents without very large models
- Adding new documents requires retraining (unlike ANN indexes which are updatable)
- Beam search retrieval is slower than ANN lookup for large corpora
Watch this space: improvements in semantic IDs and corpus scale are active research areas.