N-Gram Language Model

What it is

An n-gram language model estimates P(t_i | t_{i-n+1}…t_{i-1}) by counting occurrences of n-grams in a corpus and normalizing by lower-order counts. Simple, interpretable, and computationally fast, n-gram models dominated NLP before neural approaches.

[illustrate: Corpus with n-gram counts; probability table for bigrams; Markov assumption showing how context window simplifies computation]

How it works

For a bigram model (n=2):

P(t_i | t_{i-1}) = count(t_{i-1}, t_i) / count(t_{i-1})

For a trigram model (n=3):

P(t_i | t_{i-2}, t_{i-1}) = count(t_{i-2}, t_{i-1}, t_i) / count(t_{i-2}, t_{i-1})

Markov assumption: Current token depends only on previous n-1 tokens, not the entire history. This simplification is unrealistic but enables tractable computation.

Smoothing: Unseen n-grams have zero probability. Smoothing techniques (Laplace, Kneser-Ney) assign small probabilities to rare/unseen n-grams.

Example

Corpus: "the cat sat on the mat. the dog ran."

Bigrams:
(the, cat): 1
(cat, sat): 1
(sat, on): 1
(on, the): 1
(the, mat): 1
(mat, .): 1
(., the): 1
(the, dog): 1
(dog, ran): 1

P(cat | the) = count(the, cat) / count(the) = 1 / 3 ≈ 0.33
P(dog | the) = count(the, dog) / count(the) = 1 / 3 ≈ 0.33
P(mat | the) = count(the, mat) / count(the) = 1 / 3 ≈ 0.33

Variants and history

N-gram models emerged in the 1970s–80s as the primary method for language modeling, speech recognition, and machine translation. Kneser-Ney smoothing (1995) improved handling of rare events. Interpolation and backoff strategies combined n-grams of different orders. Neural language models (2000s) eventually superseded n-grams due to superior generalization and handling of long dependencies. N-grams remain useful for lightweight applications and baselines.

When to use it

Use n-gram models when:

Simplicity and interpretability are critical
Computational resources are limited
You need a quick baseline
Handling structured, domain-specific language (code, SQL)
Real-time inference on edge devices is required

N-gram models are fast and transparent but capture only shallow local context. Neural models outperform on diverse, open-domain text.