PLAID

What it is

PLAID (Performance-optimized Late Interaction Driver, Santhanam et al., 2022) is the inference engine for ColBERTv2. The core problem with ColBERT at scale: computing MaxSim between a query and every document in a large corpus is prohibitively expensive. PLAID solves this by using centroid vectors as a fast approximation to filter candidates before computing exact MaxSim, reducing the number of documents that need full scoring by orders of magnitude.

[illustrate: Query token embeddings → centroid similarity scores → candidate set → decompression → exact MaxSim → reranked top-k]

How it works

  1. Centroid lookup (stage 1):

    • During index construction, cluster all document token embeddings into k centroids
    • For a query, compute similarity between each query token and all centroids
    • Retrieve posting lists for the top-t centroids per query token
    • Union these lists to get a candidate document set
  2. Decompressed scoring (stage 2):

    • For candidate documents, decompress token embeddings from (centroid ID, residual)
    • Compute MaxSim scores between query tokens and decompressed document tokens
    • Prune candidates below a threshold
  3. Exact reranking (stage 3, optional):

    • Re-score top-k candidates with full precision embeddings
    • Final ranking
  4. GPU-optimized kernels:

    • Batched matrix operations for centroid lookup
    • Custom CUDA kernels for MaxSim aggregation
    • Enables millisecond-scale retrieval over millions of documents

Variants and history

PLAID (2022) made ColBERTv2 practical for production deployment. Before PLAID, ColBERT required significant engineering to serve at scale. The PLAID paper showed that centroid-based filtering retains 99%+ of ColBERTv2’s effectiveness while reducing computation by 20–40x. RAGatouille wraps ColBERT + PLAID for Python users. The Stanford ColBERT library (colbert-ai/colbert) ships PLAID as its default serving path.

When to use it

PLAID is the standard serving engine whenever you deploy ColBERTv2. There is no practical reason to use ColBERT without PLAID in production. The choice is between:

  • ColBERTv2 + PLAID (high accuracy, larger index, millisecond queries)
  • Bi-encoder + FAISS (slightly lower accuracy, smaller index, faster)
  • Cross-encoder reranking (highest accuracy, highest latency)

See also