Trigram Similarity
What it is
Trigram similarity applies Jaccard similarity to the set of 3-character ngrams (trigrams) extracted from strings. It provides fast approximate string matching without computing edit distance, leveraging PostgreSQL’s ngram indexing.
How it works
Algorithm:
- Extract all trigrams from both strings (with padding: “abc” → " ab", “abc”, “bc “)
- Compute Jaccard similarity over trigram sets
- Optional: apply similarity threshold for filtering
Trigrams are position-independent; order is ignored. Similar strings share many trigrams even if globally different.
[illustrate: Two strings with extracted trigrams shown as sets; Jaccard computation over trigram intersection and union]
Example
String A: “algorithm” Trigrams: " al”, “alg”, “lgo”, “gor”, “ori”, “rit”, “ith”, “thm”, “hm "
String B: “altruism” Trigrams: " al”, “alt”, “ltr”, “tru”, “rui”, “uis”, “ism”, “sm "
Shared: " al” (similarity = 1/15 ≈ 0.067)
Strings with prefix “al” share that trigram; longer commonalities increase similarity.
Variants and history
PostgreSQL pg_trgm extension uses trigrams for fast substring and fuzzy matching. Index support enables efficient queries without sequential scan. Threshold typically 0.3–0.5 for practical matching. Can use bigrams or larger ngrams; trigrams balance recall and speed.
When to use it
Fast approximate string matching when exact matching is too strict. Database-native support via pg_trgm indices. Efficient for substring matching and typo tolerance. Less accurate than edit distance but orders of magnitude faster. Suitable for large-scale string search.