Bigram
What it is
A bigram is an n-gram of length 2: a pair of consecutive tokens extracted from a sequence. From the sentence "the quick brown fox", the word bigrams are the quick, quick brown, and brown fox. From the word "fox", the character bigrams are fo and ox.
The bigram is the simplest n-gram that encodes order. Where a unigram model sees only which tokens exist, a bigram model also sees which tokens appear next to each other — a distinction that matters for phrase detection, collocations, fuzzy matching, and language identification.
How it works
The bigram language model assigns a probability to each token conditioned on the token immediately before it:
P(wᵢ | wᵢ₋₁) = count(wᵢ₋₁, wᵢ) / count(wᵢ₋₁)
The probability of an entire sequence is the product of these conditional probabilities:
P(w₁ w₂ … wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₂) × … × P(wₙ|wₙ₋₁)
This is a first-order Markov model: the probability of the next token depends only on the current token, not on any earlier context. That constraint keeps the probability table tractable — at most |V|² entries for a vocabulary of size |V| — while still capturing local word order that the unigram assumption discards entirely.
[illustrate: step-by-step probability table for a small corpus — rows are wᵢ₋₁, columns are wᵢ, cells show bigram counts and resulting conditional probabilities P(wᵢ|wᵢ₋₁); highlight the row for “not” to show how “not good” and “not bad” receive different probabilities]
Example
Corpus: "the cat sat on the mat"
Word bigrams extracted: the cat, cat sat, sat on, on the, the mat
| wᵢ₋₁ | wᵢ | count(wᵢ₋₁, wᵢ) | P(wᵢ | wᵢ₋₁) |
|---|---|---|---|
the |
cat |
1 | 0.5 |
the |
mat |
1 | 0.5 |
cat |
sat |
1 | 1.0 |
sat |
on |
1 | 1.0 |
on |
the |
1 | 1.0 |
Scoring the sequence "the cat sat":
P(the cat sat) = P(the) × P(cat|the) × P(sat|cat)
= 1.0 × 0.5 × 1.0
= 0.5
A unigram model would score "the sat cat" identically. The bigram model does not — P(sat|the) = 0 — because "the sat" never appeared in the corpus.
Variants and history
Character bigrams. Applying the bigram window at character level produces overlapping two-character slices with two practical applications:
- Fuzzy matching. Two strings that share many character bigrams are likely similar in spelling. Indexing character bigrams lets a search engine retrieve candidates for a mistyped query by intersection, before a costlier edit-distance pass reranks them.
- Language identification. Each language has a characteristic distribution of character bigrams. A classifier trained on bigram frequency profiles can identify a language from as few as 20–50 characters.
Collocations. A collocation is a bigram that occurs more frequently than chance would predict — "machine learning", "New York", "heavy rain". Pointwise mutual information (PMI) is the standard measure for surfacing them: high PMI flags pairs where the two tokens strongly prefer each other’s company. Phrase detection in word embedding pipelines (e.g., Word2Vec’s word2phrase step) uses bigram PMI scores to merge recurring pairs into single tokens before training.
When to use it
Use bigrams when you need the cheapest model that is sensitive to local word order:
- Phrase detection and collocation extraction — identifying multi-word units that should be treated as a single token.
- Classical language modelling on small corpora where trigrams would be too sparse without aggressive smoothing.
- Character-level indexing for typo-tolerant search or language identification, where the two-character window is wide enough to be distinctive but narrow enough to keep the index compact.
Prefer unigrams when order is genuinely irrelevant (keyword search, topic classification). Prefer trigrams or neural approaches when you need context longer than one token — for generation tasks, two tokens of history is rarely enough.