MinHash
What it is
MinHash (minimum hash) is a probabilistic sketching technique approximating Jaccard similarity between sets. It generates a compact signature (sketch) from each set such that sketch similarity approximates set Jaccard similarity, enabling rapid estimation without computing actual overlap.
How it works
Algorithm:
- Define k independent hash functions
- For each set, hash all elements and keep the minimum hash value from each function
- Signature is the k-tuple of minimum values
- Estimated Jaccard = (fraction of hash functions where sketches match) / k
For sets with large intersection, many hash functions will return matching minima. This fraction approximates true Jaccard.
Time: O(n log k) per set (n elements, k hash functions).
[illustrate: Two sets with hash function computations showing minimum values and signature comparison; explain approximation quality]
Example
Sets A and B with 5 hash functions. Suppose signatures match in 3 of 5 functions. Estimated Jaccard ≈ 3/5 = 0.6
Actual Jaccard may be 0.58 or 0.62; error depends on k.
Variants and history
Developed by Broder (1997). Foundation for large-scale near-duplicate document detection (TREC, search engines). Generalises to Jaccard-like measures. Superseded by more efficient sketches (HyperLogLog for cardinality) in some applications but remains standard for similarity.
When to use it
Large-scale document clustering, near-duplicate detection, and stream processing. Approximate Jaccard when exact computation is infeasible. Combine with LSH for fast similarity search. Memory-efficient: constant-size signatures independent of set size. Probabilistic; configure k for acceptable error.