Trigram
What it is
A trigram is an n-gram of length 3: a window of three consecutive tokens extracted from a sequence. From the sentence "the quick brown fox", the word trigrams are the quick brown and quick brown fox. From the word "brown", the character trigrams are bro, row, own.
The trigram is the natural next step up from the bigram. It widens the context window by one token — a small change numerically, but one that resolves ambiguities that a one-token history cannot.
How it works
The trigram language model extends the bigram formulation by conditioning each token on the two tokens that precede it:
P(wᵢ | wᵢ₋₂, wᵢ₋₁) = count(wᵢ₋₂, wᵢ₋₁, wᵢ) / count(wᵢ₋₂, wᵢ₋₁)
This is a second-order Markov model. The probability table now has at most |V|³ entries — for a vocabulary of 50,000 words, that is 125 billion possible trigrams, the vast majority of which never appear in any finite corpus. This is the sparsity problem: counts are too thin to estimate probabilities reliably without smoothing.
[illustrate: step-by-step trigram extraction over [“the”, “quick”, “brown”, “fox”, “jumps”] — sliding window of width 3, each step showing the two conditioning tokens (wᵢ₋₂, wᵢ₋₁) shaded in one colour and the predicted token (wᵢ) in another, with the resulting trigram emitted below]
Example
Input: "new york times square"
Word trigrams: new york times, york times square
Comparing bigram vs trigram context for predicting the word after "york":
| Model | Context seen | Candidates for next word |
|---|---|---|
| Bigram | york |
times, city, based, … — many follow “york” |
| Trigram | new york |
times, city — far fewer follow “new york” |
The extra token of history rules out most of the bigram’s candidate set. Two tokens of context is often enough to pin down the likely completion; one token frequently is not.
[illustrate: before/after showing a bigram prediction fan-out from “york” with many labelled arrows vs a trigram prediction fan-out from “new york” with a smaller, tighter set of arrows — illustrating how the wider context window narrows the distribution]
Variants and history
Language model trigrams. Trigram models were the workhorse of production language modelling from the 1980s through to the early 2010s — used in speech recognition, machine translation, and spelling correction. Kneser-Ney smoothing became the standard technique for redistributing probability mass from observed trigrams to unseen ones, and remains a competitive baseline for count-based approaches.
Character trigrams. Character-level trigrams have two well-established applications independent of language modelling:
- Fuzzy search. A query term such as
"colour"yields character trigramscol,olo,lou,our. A misspelling like"collour"sharescol,olo,ourwith the correct form — a three-way overlap that a trigram index can retrieve cheaply, before edit distance reranks the candidates. Character trigrams strike a practical balance: bigrams match too broadly; 4-grams miss many one-character typos entirely. - Plagiarism and near-duplicate detection. Representing documents as sets of character trigrams and comparing those sets with Jaccard similarity — or approximating it with MinHash — is a fast, language-agnostic method for finding copied or near-identical passages. This is the same shingling approach used with word n-grams, applied at character level where it is more robust to minor edits.
When to use it
Use trigrams when:
- A bigram language model underfits your task — trigrams are the standard first upgrade when you have enough data to populate the larger table. Rule of thumb: aim for at least 10 observed counts per trigram type before relying on raw frequencies; below that, lean on smoothing.
- You are building a character trigram index for typo-tolerant search and want a window width that balances recall (catching one-character edits) against precision (avoiding too many false candidates).
- You need a fast, language-agnostic document fingerprint for near-duplicate detection; character trigram sets are cheap to compute and compare.
Prefer bigrams when your corpus is small and sparsity is already a concern. Prefer neural language models when you need context longer than two tokens, or when you need to handle unseen sequences gracefully without explicit smoothing.