Jaro Similarity

What it is

The Jaro similarity metric scales from 0 (no similarity) to 1 (identical strings). It measures similarity based on the number and order of matching characters within a specific window, sensitive to transpositions.

How it works

Algorithm:

  1. Compute matching window: max(len(a), len(b)) / 2 - 1
  2. Mark matching characters in both strings (must be within window distance)
  3. Count matching characters and transpositions (swapped matches)
  4. Compute: jaro = (m/|a| + m/|b| + (m - t/2)/m) / 3
    • m: number of matches
    • t: number of transpositions
    • |a|, |b|: string lengths

Returns value in [0, 1] where 1 is perfect match.

[illustrate: Two strings with matching window highlighted, matching characters marked, transposition count shown]

Example

Jaro(“algorithm”, “altruism”):

  • Matches: a, l, g (→t), r, i, m (6 matches)
  • Transpositions: 1 (g and t swapped)
  • Score: (6/9 + 6/8 + (6-1/2)/6) / 3 ≈ 0.81

Variants and history

Developed by Jaro (1989) for record linkage in census data. Effective for short strings but sometimes oversensitive to character order. Extended by Winkler to add prefix bonus. Widely used in data quality tools and entity matching.

When to use it

Record linkage, duplicate detection, and name matching. Effective for short strings (names, addresses). Less effective for longer strings. Pair with Winkler prefix bonus for better name matching. Suitable when transposition errors are common.

See also