String-Similarity
-
Trigram Similarity
Jaccard similarity over character trigrams. Used by PostgreSQL pg_trgm for fast approximate matching.
-
Longest Common Substring
Longest contiguous character sequence common to two strings. Useful for plagiarism detection and similarity measurement.
-
Longest Common Subsequence
Longest sequence of characters common to two strings in order (not necessarily contiguous). Foundation for sequence alignment and diff algorithms.
-
Levenshtein Distance
Edit distance allowing insertions, deletions, and substitutions. Canonical metric for string similarity and typo tolerance.
-
Jaro-Winkler Similarity
Jaro similarity with prefix bonus for matching initial characters. Improves accuracy for name and record matching.
-
Jaro Similarity
String similarity metric for short strings based on matching characters and transpositions. Commonly used in record linkage and data quality.
-
Hamming Distance
Number of positions at which two equal-length strings differ. Efficient metric for fixed-length codes and binary data.
-
Edit Distance
Minimum number of single-character operations (insertions, deletions, substitutions) to transform one string into another. Foundation for similarity metrics.
-
Damerau-Levenshtein Distance
Edit distance including transpositions (swapping adjacent characters). Captures more common typos than Levenshtein alone.