Unigram

What it is

A unigram is an n-gram of length 1: a single token considered without reference to its neighbours. The word "fox" extracted from "the quick brown fox" is a unigram; so is the character "f" extracted from a character-level tokenisation.

The term matters most as a shorthand for the unigram assumption — the modelling choice that each token in a document is generated independently of every other token. That assumption is almost never literally true, but it is enormously useful: it collapses an intractable joint probability over all token sequences into a simple product of per-token probabilities.

How it works

Under the unigram language model, the probability of a document is the product of the probabilities of its individual tokens:

P(document) = P(t₁) × P(t₂) × … × P(tₙ)

Each token probability is estimated from its frequency in a training corpus:

P(t) = count(t) / total_tokens_in_corpus

Because tokens are treated as independent, word order is discarded entirely. The documents "dog bites man" and "man bites dog" are identical under a unigram model — they have the same token frequencies and therefore the same probability.

Example

Corpus of two documents:

ID Text
D1 "the cat sat"
D2 "the dog sat"

Vocabulary frequencies across both documents: the:2, cat:1, sat:2, dog:1 (8 tokens total).

Unigram probabilities: P(the) = 0.25, P(sat) = 0.25, P(cat) = 0.125, P(dog) = 0.125.

Scoring D1 against the query "cat sat":

P(D1 | query) ∝ P(cat) × P(sat) = 0.125 × 0.25 = 0.031

D2 scores P(dog) × P(sat) = 0.125 × 0.25 = 0.031 — identical, because neither document contains both query terms more than once and their lengths are equal. This is the bag-of-words limitation: term position and co-occurrence are invisible.

Variants and history

The unigram language model is the simplest member of the n-gram language model family. Bigram models condition each token on the one before it — P(wᵢ | wᵢ₋₁) — capturing local order at the cost of a larger, sparser probability table.

TF-IDF and BM25 as unigram models. Both scoring functions operate under the unigram assumption. Each query term is scored against a document independently; term order is not modelled. This is why a BM25 query for "new york" does not inherently distinguish a document about "New York" from one that happens to contain both words separately. Phrase queries and positional indexes exist precisely to compensate for this limitation.

Unigram tokenisation in subword models. In the context of subword tokenisation, “unigram” refers to a specific algorithm — the Unigram Language Model tokeniser used by SentencePiece — which selects a vocabulary by iteratively pruning the subword inventory to maximise a unigram likelihood objective. This is a distinct (and more recent) use of the term.

When to use it

The unigram model is the right default when:

  • You need a simple, fast baseline for document scoring with no training overhead.
  • Token order is not meaningful for your task — topic classification and keyword search both tolerate the bag-of-words simplification well.

Reach for bigrams or higher-order models when:

  • Phrase semantics matter and you cannot use positional indexing — "not good" has the opposite meaning to "good".
  • You are building a language model for generation or perplexity scoring, where predicting the next token requires context.

See also