SimHash
What it is
SimHash generates a compact hash (fingerprint) of a document such that documents with similar content have similar hashes. Unlike cryptographic hashing, SimHash preserves proximity in hash space, enabling fast approximate similarity via Hamming distance.
How it works
Algorithm:
- Extract ngrams or features from document
- For each feature, compute hash (e.g., SHA-1)
- Weight each hash bit: if feature is important, add/subtract from accumulator
- Final fingerprint: hash accumulator to bit string (sign of each bit position)
- Similarity: Hamming distance between fingerprints approximates cosine similarity
Result: k-bit hash where Hamming distance ≤ d correlates with cosine similarity threshold.
[illustrate: Document feature extraction, hash bit accumulation, final fingerprint generation; show Hamming distance between two similar fingerprints]
Example
Document 1 features: ngrams {the, cat, sat, on, mat} Document 2 features: ngrams {the, cat, sat, on, mat, red} (95% overlap)
SimHashes differ in only 1-2 bits (out of 64), Hamming distance ≈ 2. Documents with 95% ngram overlap have small Hamming distance.
Variants and history
Developed at Google for near-duplicate detection in web search (Charikar, 2002). Efficient and effective for document deduplication at scale. Variants with different hash lengths and weighting schemes. Related to semantic hashing and learned similarity-preserving hashes.
When to use it
Near-duplicate document detection at billion-scale. Web crawling and search indexing. Efficient similarity threshold filtering: exact Hamming distance computation for candidate pruning. Fast approximate similarity check before expensive ranking. Balance fingerprint size (64-256 bits) for accuracy vs. speed.