Trigram Similarity

What it is

Trigram similarity applies Jaccard similarity to the set of 3-character ngrams (trigrams) extracted from strings. It provides fast approximate string matching without computing edit distance, leveraging PostgreSQL’s ngram indexing.

How it works

Algorithm:

  1. Extract all trigrams from both strings (with padding: “abc” → " ab", “abc”, “bc “)
  2. Compute Jaccard similarity over trigram sets
  3. Optional: apply similarity threshold for filtering

Trigrams are position-independent; order is ignored. Similar strings share many trigrams even if globally different.

[illustrate: Two strings with extracted trigrams shown as sets; Jaccard computation over trigram intersection and union]

Example

String A: “algorithm” Trigrams: " al”, “alg”, “lgo”, “gor”, “ori”, “rit”, “ith”, “thm”, “hm "

String B: “altruism” Trigrams: " al”, “alt”, “ltr”, “tru”, “rui”, “uis”, “ism”, “sm "

Shared: " al” (similarity = 1/15 ≈ 0.067)

Strings with prefix “al” share that trigram; longer commonalities increase similarity.

Variants and history

PostgreSQL pg_trgm extension uses trigrams for fast substring and fuzzy matching. Index support enables efficient queries without sequential scan. Threshold typically 0.3–0.5 for practical matching. Can use bigrams or larger ngrams; trigrams balance recall and speed.

When to use it

Fast approximate string matching when exact matching is too strict. Database-native support via pg_trgm indices. Efficient for substring matching and typo tolerance. Less accurate than edit distance but orders of magnitude faster. Suitable for large-scale string search.

See also