Overlap Coefficient

What it is

The overlap coefficient measures set similarity as the intersection size divided by the smaller set’s cardinality. It captures containment: if A is a subset of B, overlap = 1 regardless of B’s size. Asymmetric but useful when containment semantics matter.

How it works

For sets A and B:

Overlap(A, B) = |A ∩ B| / min(|A|, |B|)

Asymmetric: Overlap(A, B) may differ from Overlap(B, A) depending on size difference. Ranges from 0 to 1.

Interpretation: “What fraction of the smaller set is shared?”

[illustrate: Sets of different sizes with overlap region highlighted; show how formula captures smaller set coverage]

Example

A = {“the”, “quick”, “brown”, “fox”} (|A| = 4) B = {“the”, “brown”} (|B| = 2)

Intersection: {“the”, “brown”} (size 2) min(|A|, |B|) = 2

Overlap(A, B) = 2/2 = 1.0 (B is contained in A) Overlap(B, A) = 2/4 = 0.5 (A is larger; only half its elements match B)

Variants and history

Also called Szymkiewicz-Simpson coefficient. Useful when subset relationships matter. Less commonly used than Jaccard or Dice in general similarity tasks. Relevant in document clustering when containment is semantically meaningful.

When to use it

Matching where one set is naturally smaller (e.g., query terms vs document terms). Capturing “does small set match large set” semantics. Useful in clustering when subset relations matter. Use with caution due to asymmetry; document expected direction in applications.