MinHash

What it is

MinHash (minimum hash) is a probabilistic sketching technique approximating Jaccard similarity between sets. It generates a compact signature (sketch) from each set such that sketch similarity approximates set Jaccard similarity, enabling rapid estimation without computing actual overlap.

How it works

Algorithm:

  1. Define k independent hash functions
  2. For each set, hash all elements and keep the minimum hash value from each function
  3. Signature is the k-tuple of minimum values
  4. Estimated Jaccard = (fraction of hash functions where sketches match) / k

For sets with large intersection, many hash functions will return matching minima. This fraction approximates true Jaccard.

Time: O(n log k) per set (n elements, k hash functions).

[illustrate: Two sets with hash function computations showing minimum values and signature comparison; explain approximation quality]

Example

Sets A and B with 5 hash functions. Suppose signatures match in 3 of 5 functions. Estimated Jaccard ≈ 3/5 = 0.6

Actual Jaccard may be 0.58 or 0.62; error depends on k.

Variants and history

Developed by Broder (1997). Foundation for large-scale near-duplicate document detection (TREC, search engines). Generalises to Jaccard-like measures. Superseded by more efficient sketches (HyperLogLog for cardinality) in some applications but remains standard for similarity.

When to use it

Large-scale document clustering, near-duplicate detection, and stream processing. Approximate Jaccard when exact computation is infeasible. Combine with LSH for fast similarity search. Memory-efficient: constant-size signatures independent of set size. Probabilistic; configure k for acceptable error.

See also