Locality-Sensitive Hashing

What it is

Locality-sensitive hashing (LSH) is a probabilistic technique that hashes similar items into the same bucket with high probability while dissimilar items collide rarely. It enables fast approximate nearest-neighbor search without examining all candidates.

How it works

Framework:

  1. Define a hash family that preserves similarity: P(h(a) = h(b)) ≈ sim(a, b)
  2. Concatenate multiple hash functions to form compound keys (bands)
  3. Index items by these keys
  4. At query time, hash the query, fetch items in the same buckets
  5. Candidates are likely similar; rank by actual similarity

LSH guarantees: similar items collide with high probability; dissimilar items rarely collide.

[illustrate: Items in high-dimensional space with hash buckets grouping similar items; show query retrieval from same bucket]

Example

For Jaccard similarity with MinHash: concatenate k hash functions into signature. Collision probability = Jaccard similarity. Use multiple hash tables (bands) to tune recall-precision tradeoff.

Variants and history

Introduced by Indyk and Motwani (1998). Specific LSH families for Hamming distance, cosine similarity, L2 distance, etc. Foundation for approximate nearest-neighbor systems at billion-item scale (search engines, recommendation). Modern variants: learned LSH, neural hashing.

When to use it

Billion-scale similarity search without scanning all items. Approximate nearest neighbors in high dimensions. Fast duplicate detection and clustering. Search-engine and e-commerce recommendation systems. Configure hash families and band/row parameters for target accuracy and speed.

See also