Locality-Sensitive Hashing

What it is

Locality-sensitive hashing (LSH) is a probabilistic technique that hashes similar items into the same bucket with high probability while dissimilar items collide rarely. It enables fast approximate nearest-neighbor search without examining all candidates.

How it works

Framework:

Define a hash family that preserves similarity: P(h(a) = h(b)) ≈ sim(a, b)
Concatenate multiple hash functions to form compound keys (bands)
Index items by these keys
At query time, hash the query, fetch items in the same buckets
Candidates are likely similar; rank by actual similarity

LSH guarantees: similar items collide with high probability; dissimilar items rarely collide.

[illustrate: Items in high-dimensional space with hash buckets grouping similar items; show query retrieval from same bucket]

Example

For Jaccard similarity with MinHash: concatenate k hash functions into signature. Collision probability = Jaccard similarity. Use multiple hash tables (bands) to tune recall-precision tradeoff.

Variants and history

Introduced by Indyk and Motwani (1998). Specific LSH families for Hamming distance, cosine similarity, L2 distance, etc. Foundation for approximate nearest-neighbor systems at billion-item scale (search engines, recommendation). Modern variants: learned LSH, neural hashing.

When to use it

Billion-scale similarity search without scanning all items. Approximate nearest neighbors in high dimensions. Fast duplicate detection and clustering. Search-engine and e-commerce recommendation systems. Configure hash families and band/row parameters for target accuracy and speed.