SimHash

What it is

SimHash generates a compact hash (fingerprint) of a document such that documents with similar content have similar hashes. Unlike cryptographic hashing, SimHash preserves proximity in hash space, enabling fast approximate similarity via Hamming distance.

How it works

Algorithm:

  1. Extract ngrams or features from document
  2. For each feature, compute hash (e.g., SHA-1)
  3. Weight each hash bit: if feature is important, add/subtract from accumulator
  4. Final fingerprint: hash accumulator to bit string (sign of each bit position)
  5. Similarity: Hamming distance between fingerprints approximates cosine similarity

Result: k-bit hash where Hamming distance ≤ d correlates with cosine similarity threshold.

[illustrate: Document feature extraction, hash bit accumulation, final fingerprint generation; show Hamming distance between two similar fingerprints]

Example

Document 1 features: ngrams {the, cat, sat, on, mat} Document 2 features: ngrams {the, cat, sat, on, mat, red} (95% overlap)

SimHashes differ in only 1-2 bits (out of 64), Hamming distance ≈ 2. Documents with 95% ngram overlap have small Hamming distance.

Variants and history

Developed at Google for near-duplicate detection in web search (Charikar, 2002). Efficient and effective for document deduplication at scale. Variants with different hash lengths and weighting schemes. Related to semantic hashing and learned similarity-preserving hashes.

When to use it

Near-duplicate document detection at billion-scale. Web crawling and search indexing. Efficient similarity threshold filtering: exact Hamming distance computation for candidate pruning. Fast approximate similarity check before expensive ranking. Balance fingerprint size (64-256 bits) for accuracy vs. speed.

See also