DSI (Differentiable Search Index)

What it is

DSI (Differentiable Search Index, Tay et al., 2022) is a generative retrieval approach where a T5 model memorizes the entire document corpus in its parameters and retrieves documents by generating their identifiers directly from a query. There is no external index — the “index” is the model weights. This is a conceptually radical departure from all prior retrieval architectures.

[illustrate: Query → T5 → generates document ID string → look up document; no FAISS, no BM25, just token generation]

How it works

  1. Document ID representation:

    • Each document assigned a unique ID (string)
    • Structured atomic IDs (e.g., “d1234”) or semantic IDs (hierarchical cluster labels)
    • Semantic IDs perform better: encode document meaning into the ID structure
  2. Indexing (training):

    • Train T5 to map document text → document ID
    • Also train to map queries → document IDs using relevance labels
    • Both tasks in a single model via multi-task training
  3. Retrieval:

    • Beam search to generate top-k document IDs
    • Constrained decoding to ensure only valid IDs are generated
  4. Semantic document IDs:

    • Hierarchically cluster documents by content
    • Assign IDs based on cluster path (e.g., “2.14.3”)
    • Model learns the document taxonomy as part of ID generation

Variants and history

DSI (2022, Google) launched the generative retrieval research direction. NCI (Neural Corpus Indexer) added prefix-aware learning for more consistent ID generation. GENRE (generative entity retrieval) applied the idea to entity linking. MINDER extended DSI to multi-view document IDs. The approach is exciting theoretically but scaling to millions of documents remains an open challenge — corpus memorization capacity grows with model size.

When to use it

DSI is primarily a research direction. Current limitations:

  • Does not scale gracefully to millions of documents without very large models
  • Adding new documents requires retraining (unlike ANN indexes which are updatable)
  • Beam search retrieval is slower than ANN lookup for large corpora

Watch this space: improvements in semantic IDs and corpus scale are active research areas.

See also