Locality-Sensitive Hashing
What it is
Locality-sensitive hashing (LSH) is a probabilistic technique that hashes similar items into the same bucket with high probability while dissimilar items collide rarely. It enables fast approximate nearest-neighbor search without examining all candidates.
How it works
Framework:
- Define a hash family that preserves similarity: P(h(a) = h(b)) ≈ sim(a, b)
- Concatenate multiple hash functions to form compound keys (bands)
- Index items by these keys
- At query time, hash the query, fetch items in the same buckets
- Candidates are likely similar; rank by actual similarity
LSH guarantees: similar items collide with high probability; dissimilar items rarely collide.
[illustrate: Items in high-dimensional space with hash buckets grouping similar items; show query retrieval from same bucket]
Example
For Jaccard similarity with MinHash: concatenate k hash functions into signature. Collision probability = Jaccard similarity. Use multiple hash tables (bands) to tune recall-precision tradeoff.
Variants and history
Introduced by Indyk and Motwani (1998). Specific LSH families for Hamming distance, cosine similarity, L2 distance, etc. Foundation for approximate nearest-neighbor systems at billion-item scale (search engines, recommendation). Modern variants: learned LSH, neural hashing.
When to use it
Billion-scale similarity search without scanning all items. Approximate nearest neighbors in high dimensions. Fast duplicate detection and clustering. Search-engine and e-commerce recommendation systems. Configure hash families and band/row parameters for target accuracy and speed.