Character N-Gram

N-Gram Fuzzy-Matching Information-Retrieval Language-Modelling Tokenisation Similarity Needs-Review

What it is

A character n-gram is an n-gram computed over individual characters rather than words. Instead of sliding a window across a sequence of tokens, the window slides across the raw character sequence of a string — spaces, punctuation, and all.

From the word "colour" with n = 3, the character trigrams are: col, olo, lou, our.

From the phrase "ice cream" with n = 2, the character bigrams are: ic, ce, e , c, cr, re, ea, am — the space between words becomes a character like any other.

This single decision — operating on characters, not words — has a large practical consequence: character n-grams require no tokeniser, no language-specific word boundary rules, and no dictionary. A pipeline built on character n-grams works identically on English, Arabic, Finnish, Japanese, and code, because all of those are ultimately sequences of characters.

How it works

The extraction procedure is identical to word n-gram extraction, with the input sequence replaced by list(string):

def char_ngrams(text: str, n: int) -> list[str]:
    return [text[i : i + n] for i in range(len(text) - n + 1)]

char_ngrams("colour", 3)
# ['col', 'olo', 'lou', 'our']

char_ngrams("ice cream", 2)
# ['ic', 'ce', 'e ', ' c', 'cr', 're', 'ea', 'am']

A string of length L produces L − n + 1 character n-grams. For a 6-character word with n = 3, that is 4 trigrams. Whether to include spaces and punctuation is a design choice: including them encodes boundary information ( co and co are distinct); stripping them makes the index slightly more compact and reduces noise from spacing inconsistencies.

Example

Fuzzy search via trigram index. The word "brownie" has character trigrams: bro, row, owi, wie, ies.

A user types "browni" (missing the final e). Its trigrams are: bro, row, owi, win.

Shared trigrams: bro, row, owi — three out of four query trigrams match. A trigram index lookup retrieves "brownie" as a candidate, which an edit-distance ranker then confirms as the closest match. No full vocabulary scan was needed.

[illustrate: two token chips — “brownie” and “browni” — each expanded into their trigram sets shown as rows of chips below; shared trigrams highlighted with connecting lines between the two rows, unmatched trigrams in muted colour]

Variants and history

Trigram indexes for fuzzy search. The use of character trigrams to support approximate string matching was established by Ukkonen (1992) and later Gravano et al. The approach is now standard in PostgreSQL (pg_trgm), search engines, and spell-checkers: index every term’s trigram set at write time; at query time, retrieve terms sharing enough trigrams with the query; rerank by edit distance. Bigrams match too broadly; 4-grams miss single-character typos entirely. Trigrams occupy the practical sweet spot for most Western-script vocabularies.

Language identification. Each language has a characteristic fingerprint of character n-gram frequencies. Cavnar and Trenkle (1994) demonstrated that ranking the most frequent character trigrams and bigrams in a sample, then comparing that ranked profile against pre-built per-language profiles, reliably identifies language from as few as 50–100 characters. The technique requires no word segmentation, making it equally applicable to scripts where word boundaries are unmarked (Thai, Japanese) and to noisy or code-mixed text.

Morphologically rich languages. In languages such as Finnish, Turkish, or Arabic, a single word may carry the meaning of an entire English phrase through inflectional suffixes and prefixes. Word n-gram models struggle here because vocabulary size explodes. Character n-gram models are naturally robust: they capture shared substrings across inflected forms (talk, talker, talking, talked all share tal, alk) without requiring a morphological analyser.

Subword models. Byte pair encoding can be understood as learning which character n-grams are productive enough to be promoted to vocabulary entries. FastText word embeddings take a more direct approach: they represent each word as the sum of its character n-gram embeddings (typically 3–6 grams), making the model intrinsically robust to misspellings and morphological variation.

When to use it

Fuzzy and typo-tolerant search. Build a trigram index at index time; query it to retrieve candidates, then rerank by Levenshtein distance. PostgreSQL’s pg_trgm extension implements this out of the box; Elasticsearch exposes it via the ngram tokeniser filter.

Language identification. Character n-gram profiles (typically bigrams and trigrams) are a fast, dependency-free signal — useful when you cannot assume a language before processing, or when text is too short for a word-level model.

Morphologically rich or script-diverse text. If your corpus spans multiple languages, contains dense affixation (medical, legal, chemical), or comes from a domain with many proper nouns, character n-grams tolerate vocabulary explosion better than word n-grams.

Tradeoffs:

Character n-grams produce far more index terms than word n-grams. A 10-character word generates 8 trigrams; a 1 000-character document generates roughly 998. Storage grows with string length, not token count.
They encode surface similarity only. "colour" and "hue" share no character trigrams despite being near-synonyms. Dense embeddings are necessary for semantic matching.
Cross-script matching is unreliable. A query in Latin script shares no character n-grams with an equivalent term in Cyrillic or Arabic — transliteration or embedding-based approaches are required.