Jaro Similarity
What it is
The Jaro similarity metric scales from 0 (no similarity) to 1 (identical strings). It measures similarity based on the number and order of matching characters within a specific window, sensitive to transpositions.
How it works
Algorithm:
- Compute matching window:
max(len(a), len(b)) / 2 - 1 - Mark matching characters in both strings (must be within window distance)
- Count matching characters and transpositions (swapped matches)
- Compute:
jaro = (m/|a| + m/|b| + (m - t/2)/m) / 3- m: number of matches
- t: number of transpositions
- |a|, |b|: string lengths
Returns value in [0, 1] where 1 is perfect match.
[illustrate: Two strings with matching window highlighted, matching characters marked, transposition count shown]
Example
Jaro(“algorithm”, “altruism”):
- Matches: a, l, g (→t), r, i, m (6 matches)
- Transpositions: 1 (g and t swapped)
- Score: (6/9 + 6/8 + (6-1/2)/6) / 3 ≈ 0.81
Variants and history
Developed by Jaro (1989) for record linkage in census data. Effective for short strings but sometimes oversensitive to character order. Extended by Winkler to add prefix bonus. Widely used in data quality tools and entity matching.
When to use it
Record linkage, duplicate detection, and name matching. Effective for short strings (names, addresses). Less effective for longer strings. Pair with Winkler prefix bonus for better name matching. Suitable when transposition errors are common.