Edit Distance
What it is
Edit distance quantifies string similarity by counting the minimum number of atomic edit operations transforming one string into another. The fewer operations required, the more similar the strings.
How it works
Classic edit distance (Levenshtein) allows three operations:
- Insertion: add a character
- Deletion: remove a character
- Substitution: replace one character with another
Compute via dynamic programming: build a matrix where entry (i, j) represents the minimum edit distance between the first i characters of string A and first j characters of string B. Fill bottom-up using the recurrence:
dp[i][j] = min(
dp[i-1][j] + 1, // deletion
dp[i][j-1] + 1, // insertion
dp[i-1][j-1] + cost // substitution (cost=0 if chars match, 1 otherwise)
)
[illustrate: DP table for strings “kitten” and “sitting” showing cell values and traceback path]
Example
Distance(“kitten”, “sitting”) = 3:
- Substitute k → s: “sitten”
- Substitute e → i: “sittin”
- Insert g: “sitting”
Variants and history
Restricted edit distance: subsets of operations. Weighted edit distance: different costs per operation (useful for OCR or phonetic matching). Damerau-Levenshtein: adds transposition. Foundational metric in record linkage, spell checking, and bioinformatics. O(mn) time and space complexity for strings of length m and n.
When to use it
Fundamental similarity metric for typo tolerance and fuzzy matching. Use Damerau-Levenshtein if transpositions are common. For long strings or large datasets, use approximate methods (BK-trees, LSH). Not suitable for semantic similarity; purely syntactic.