Dice Coefficient
What it is
The Dice coefficient (also Sorensen-Dice, F1 measure on sets) quantifies set similarity as twice the intersection divided by the sum of cardinalities. It ranges from 0 to 1 and can be interpreted as the harmonic mean of precision and recall.
How it works
For sets A and B:
Dice(A, B) = 2 * |A ∩ B| / (|A| + |B|)
Symmetric metric. Can be applied to token sets, ngrams, or any discrete elements. For text, often computed on word or ngram sets.
Relationship to Jaccard: Dice = 2 * Jaccard / (1 + Jaccard)
[illustrate: Two sets with intersection and cardinalities marked; show Dice vs Jaccard comparison]
Example
A = {“the”, “quick”, “brown”, “fox”} (|A| = 4) B = {“the”, “brown”, “dog”} (|B| = 3)
Intersection: {“the”, “brown”} (size 2)
Dice = (2 * 2) / (4 + 3) = 4/7 ≈ 0.571
Compare: Jaccard = 2/5 = 0.4 (different weighting)
Variants and history
Introduced by Dice (1945) and Sorensen (1948) independently. Equivalent to F1 score when sets represent true/false positives. Useful in ecology, bioinformatics, and information retrieval. Often used for bigram similarity and text document comparison.
When to use it
Document and text similarity via tokens or bigrams. Emphasises intersection more than Jaccard (reaches 1 more easily). Use when intersection importance outweighs symmetric difference. Common in duplicate detection and near-duplicate document clustering.