Jaccard Similarity

What it is

Jaccard similarity (index) measures the overlap between two sets as the ratio of their intersection to their union. It ranges from 0 (disjoint) to 1 (identical sets). Also called Jaccard index or Tanimoto coefficient.

How it works

For sets A and B:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Can be computed on token sets, shingle sets, or other discrete representations. Symmetric: Jaccard(A, B) = Jaccard(B, A).

Efficient computation via set membership testing or sorted iteration.

[illustrate: Two sets with intersection and union regions highlighted; demonstrate ratio calculation]

Example

A = {“the”, “quick”, “brown”, “fox”} B = {“the”, “brown”, “dog”}

Intersection: {“the”, “brown”} (size 2) Union: {“the”, “quick”, “brown”, “fox”, “dog”} (size 5)

Jaccard = 2/5 = 0.4

Variants and history

Named after Paul Jaccard. Widely used in text similarity, document clustering, and information retrieval. Foundation for MinHash approximation. Generalisable to weighted sets and probabilistic variants. Asymmetric variants exist (e.g., Dice uses intersection / smaller cardinality).

When to use it

Document similarity via shingles, tag-set matching, and clustering. Useful when set size variation is important. Symmetric metric ensures independence of argument order. Slower than cosine similarity for high-dimensional sparse data; use MinHash for approximation.

See also