Product Quantisation

What it is

Product Quantisation (PQ) is a vector compression method that reduces memory footprint and accelerates distance computation by a factor of 10–100. Vectors are decomposed into d/m subspaces, each quantized independently to b bits, enabling approximate similarity search with minimal space.

[illustrate: 768-dim vector split into 8 subspaces (96-dim each); each subspace quantized with 256 centroids (8-bit); distance computation from lookup tables]

How it works

  1. Training:

    • Split each d-dimensional vector into m subspaces (typically 8–64)
    • Train m independent k-means clustering (k=256 for 8 bits)
    • Store m codebooks; each vector stored as m byte codes
  2. Compression:

    • For each subspace in a vector, find nearest centroid
    • Replace subspace with centroid ID (typically 1 byte for 256 centroids)
    • Compression: 768-dim float32 → 8 bytes (96x reduction)
  3. Distance computation:

    • Create lookup tables: distance(query_subspace_i, centroid_j) for all centroids
    • Sum distances across subspaces (8 lookups + 7 additions vs. 768 multiplications)

Example

# Original vector: 768 floats = 3072 bytes
# PQ(m=8, nbits=8): split into 8 × 96-dim subspaces
#   Each subspace quantized to 1 byte (256 centroids)
# Compressed: 8 bytes per vector

# Distance computation:
#   Exact: dot(query, document) = 768 operations
#   PQ: 8 lookups + 7 additions = 15 operations (50x faster)

# Memory: 1M vectors
#   Exact: 1M × 3072 bytes = 3GB
#   PQ: 1M × 8 bytes = 8MB (375x reduction)

Variants and history

PQ was introduced by Jégou et al. (2010) for large-scale image retrieval. OPQ (Optimized PQ) learns rotation for better subspace alignment. Composite Quantisation uses additive codes. Learned quantisation uses neural networks for codebook learning. PQ remains fundamental in production retrieval systems, combined with IVF and HNSW for massive-scale search.

When to use it

Use PQ when:

  • Vector memory is the bottleneck
  • You need 100+ million or billion-scale indexes
  • 98–99% recall is acceptable (1–2% loss)
  • Query latency is critical
  • Combined with partitioning (IVF+PQ) or graphs

PQ sacrifices 1–3% recall but reduces memory by 100x and accelerates distance computation. Trade-off: approximate distances and potential ranking errors.

See also