Jaccard Similarity
What it is
Jaccard similarity (index) measures the overlap between two sets as the ratio of their intersection to their union. It ranges from 0 (disjoint) to 1 (identical sets). Also called Jaccard index or Tanimoto coefficient.
How it works
For sets A and B:
Jaccard(A, B) = |A ∩ B| / |A ∪ B|
Can be computed on token sets, shingle sets, or other discrete representations. Symmetric: Jaccard(A, B) = Jaccard(B, A).
Efficient computation via set membership testing or sorted iteration.
[illustrate: Two sets with intersection and union regions highlighted; demonstrate ratio calculation]
Example
A = {“the”, “quick”, “brown”, “fox”} B = {“the”, “brown”, “dog”}
Intersection: {“the”, “brown”} (size 2) Union: {“the”, “quick”, “brown”, “fox”, “dog”} (size 5)
Jaccard = 2/5 = 0.4
Variants and history
Named after Paul Jaccard. Widely used in text similarity, document clustering, and information retrieval. Foundation for MinHash approximation. Generalisable to weighted sets and probabilistic variants. Asymmetric variants exist (e.g., Dice uses intersection / smaller cardinality).
When to use it
Document similarity via shingles, tag-set matching, and clustering. Useful when set size variation is important. Symmetric metric ensures independence of argument order. Slower than cosine similarity for high-dimensional sparse data; use MinHash for approximation.