Edit Distance

What it is

Edit distance quantifies string similarity by counting the minimum number of atomic edit operations transforming one string into another. The fewer operations required, the more similar the strings.

How it works

Classic edit distance (Levenshtein) allows three operations:

  1. Insertion: add a character
  2. Deletion: remove a character
  3. Substitution: replace one character with another

Compute via dynamic programming: build a matrix where entry (i, j) represents the minimum edit distance between the first i characters of string A and first j characters of string B. Fill bottom-up using the recurrence:

dp[i][j] = min(
  dp[i-1][j] + 1,      // deletion
  dp[i][j-1] + 1,      // insertion
  dp[i-1][j-1] + cost  // substitution (cost=0 if chars match, 1 otherwise)
)

[illustrate: DP table for strings “kitten” and “sitting” showing cell values and traceback path]

Example

Distance(“kitten”, “sitting”) = 3:

  • Substitute k → s: “sitten”
  • Substitute e → i: “sittin”
  • Insert g: “sitting”

Variants and history

Restricted edit distance: subsets of operations. Weighted edit distance: different costs per operation (useful for OCR or phonetic matching). Damerau-Levenshtein: adds transposition. Foundational metric in record linkage, spell checking, and bioinformatics. O(mn) time and space complexity for strings of length m and n.

When to use it

Fundamental similarity metric for typo tolerance and fuzzy matching. Use Damerau-Levenshtein if transpositions are common. For long strings or large datasets, use approximate methods (BK-trees, LSH). Not suitable for semantic similarity; purely syntactic.

See also