Unigram Language Model Tokeniser

Tokenisation Preprocessing Language-Modelling Subword-Tokenisation Embeddings Needs-Review

What it is

The Unigram Language Model tokeniser (Unigram LM) is a subword segmentation algorithm introduced by Kudo (2018). Unlike Byte Pair Encoding and WordPiece, which build a vocabulary bottom-up by merging characters into progressively larger units, Unigram LM works top-down: it starts with a large over-complete vocabulary and repeatedly discards entries until a target size is reached. The criterion for each discard is which removal costs the corpus the least in terms of log-likelihood.

The algorithm is the native tokeniser inside SentencePiece and underlies T5, mT5, and ALBERT. Its distinguishing capability — unavailable in BPE or WordPiece — is that it produces a probability distribution over all valid segmentations of a string, not a single deterministic output. This enables a training-time technique called subword regularisation.

How it works

Training — the EM-style pruning loop

Initialisation. Seed the vocabulary with all single characters in the corpus plus the most frequent substrings up to some maximum length. A typical seed is 2–3× the target vocabulary size V. Every candidate entry is assigned an initial probability proportional to its raw frequency in the corpus.

The objective. Given a vocabulary with probabilities, the probability of a segmentation x = (x₁, x₂, …, xₘ) of a word w is:

P(x) = P(x₁) × P(x₂) × … × P(xₘ)

The total corpus log-likelihood under the current vocabulary is:

ℒ = Σ_w  freq(w) × log P(best segmentation of w)

Expectation step. For each word in the corpus, compute the best segmentation under the current vocabulary and probabilities. Accumulate the expected counts of each subword unit.

Maximisation step. Re-estimate each subword’s probability as its expected count divided by the total expected count.

Pruning step. For each vocabulary entry, compute how much ℒ would decrease if it were removed. Drop the η% of entries with the smallest loss increase (typically η = 10–20%), retaining all single characters to guarantee full coverage. Repeat from the expectation step.

This is an EM-style loop: the E and M steps refine probabilities given a fixed vocabulary; the pruning step shrinks the vocabulary; the loop repeats until the vocabulary reaches V.

[illustrate: EM pruning loop across four iterations — seed vocabulary of 60 entries shrinking to 40, 30, and finally 20 target entries; bar chart per iteration showing retained vs pruned subwords; corpus log-loss plotted as a line that rises slightly at each prune step then recovers after the EM re-estimation; single characters highlighted as always-retained anchors]

Inference — Viterbi segmentation

Given the trained vocabulary and its probabilities, segmenting a new string is a shortest-path problem on a lattice. Every substring of the input that exists in the vocabulary is a candidate edge. Viterbi finds the path from the start to the end of the string that maximises the product of edge probabilities — equivalently, minimises the sum of negative log-probabilities.

For the string "▁unhooked", Viterbi constructs a lattice of candidate paths:

Position: 0 ─── 2 ─────── 6 ─── 8
           ▁un  │  hook   │  ed
           ──────────────────────
           ▁unhooked (single edge, if present)

The winning path is whichever combination yields the highest joint probability. If ▁unhooked is in the vocabulary with high probability it wins outright; if not, ["▁un", "hook", "ed"] wins if that product beats alternatives. No rule list to consult, no fallback logic to write.

[illustrate: Viterbi lattice for “▁unhooked” — nodes at character positions 0,2,4,6,8,9; directed edges labelled with candidate subwords and their log-probabilities; the winning path ["▁un",“hook”,“ed”] highlighted; a competing path ["▁u",“n”,“hook”,“e”,“d”] shown faded with its lower cumulative score]

Subword regularisation — stochastic sampling

Because Unigram LM defines a full probability distribution over segmentations, it can sample from that distribution rather than always returning the single best path. During model training, each input sentence is re-tokenised stochastically at every epoch: common segmentations appear more often, but lower-probability paths are occasionally drawn.

This is subword regularisation (Kudo, 2018). The effect is similar to dropout — the model cannot rely on one fixed tokenisation, making it more robust to inputs that would be segmented differently at test time. The sampling temperature α controls how sharply peaked the distribution is; α → 0 collapses to greedy Viterbi, α = 1 samples proportionally to probability.

[illustrate: two stochastic samples of “▁unhooked” drawn from the segmentation distribution — sample A: ["▁un",“hook”,“ed”]; sample B: ["▁u",“nhook”,“ed”]; probability bars next to each; contrast with the deterministic Viterbi output shown above both]

Example

A Unigram LM vocabulary (simplified) with log-probabilities:

Subword	log P
`▁lower`	−2.1
`▁low`	−2.6
`er`	−3.4
`▁lo`	−4.0
`w`	−5.1

Segmenting "▁lower":

Path	log P sum	Wins?
`["▁lower"]`	−2.1	Yes
`["▁low", "er"]`	−6.0	No
`["▁lo", "w", "er"]`	−12.5	No

Viterbi selects ["▁lower"] as a single token because the whole-word entry dominates. For an unseen word "▁lowered" where no whole-word entry exists, the algorithm finds the best decomposition from available pieces automatically.

Variants and history

Kudo (2018) introduced Unigram LM in Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. The algorithm was motivated by the observation that a single deterministic tokenisation is a strong inductive bias that hurts robustness when test-time text differs from training text.

The EM-style training procedure is related to standard Expectation-Maximisation for mixture models, with the vocabulary acting as a discrete latent structure jointly learnt alongside subword probabilities. Unlike standard EM it includes a hard pruning step, making it a variational rather than pure EM procedure.

Comparison with BPE and WordPiece. BPE is deterministic and order-dependent at inference; it cannot natively produce multiple segmentations. WordPiece is also deterministic (greedy longest-match). Unigram LM is the only major subword algorithm that natively supports stochastic segmentation — its primary practical advantage.

When to use it

Prefer Unigram LM when:

Training a sequence-to-sequence model (translation, summarisation) where subword regularisation improves generalisation. Use model_type="unigram" in SentencePiece and call SampleEncode during training.
Your text includes morphologically rich languages. The probabilistic objective tends to produce segmentations that align better with morpheme boundaries than frequency-driven merges.
You need robustness to spelling variation and domain shift — stochastic sampling acts as implicit data augmentation over the tokenisation space.

Prefer BPE when:

Training speed is a priority. BPE training is faster than Unigram LM’s iterative EM loop on large corpora.
You need byte-level coverage guarantees. Byte-level BPE has no unknown-token risk; Unigram LM’s character-level fallback is not byte-level by default.
You are fine-tuning an existing checkpoint. Use the tokeniser that ships with the checkpoint — mixing tokenisers silently corrupts model inputs.