Subword Tokenisation

What it is

Subword tokenisation is a family of strategies that segment text at a granularity smaller than the word but larger than the individual character. Instead of treating "tokenisation" as a single atomic unit or breaking it into twelve characters, a subword tokeniser might produce ["token", "isation"] — two fragments that appear frequently enough in the training corpus to earn their own vocabulary entries.

Every major transformer model — BERT, GPT, T5, LLaMA — uses subword tokenisation. The reason is the out-of-vocabulary (OOV) problem.

How it works

The OOV problem

A word-level vocabulary is bounded. If you fix a vocabulary of 50 000 words and your corpus contains "hospitalisation", "dehospitalisation", and a dozen domain-specific coinages, most of them will be absent. The usual fix is an [UNK] token standing in for anything unknown — but this destroys information. A model that sees [UNK] cannot infer that the missing word shares morphology with "hospital" or "isation".

Subword tokenisation removes the need for [UNK] entirely. Every string, however unusual, can be decomposed into fragments that exist in the vocabulary. In the worst case, each character is itself a vocabulary entry, so the fallback is always available without a special sentinel.

The three main algorithms

All three dominant approaches share the same goal — learn a vocabulary of subword units from a text corpus — but differ in how they select which units to keep.

Byte Pair Encoding (BPE) starts from individual characters and iteratively merges the most frequent adjacent symbol pair until a target vocabulary size is reached. Merge rules are replayed in training order at inference time. Used in GPT-2, GPT-3, GPT-4, LLaMA, and RoBERTa. See Byte Pair Encoding for the full algorithm.

WordPiece uses the same bottom-up merge strategy as BPE but selects each merge by maximising the likelihood of the training corpus under a unigram language model, rather than by raw pair frequency. Continuation fragments are prefixed with ## to distinguish them from word-initial tokens (play vs ##ing). Used in BERT and its derivatives.

Unigram Language Model (Unigram LM) works top-down: begin with a large over-complete vocabulary, then iteratively prune the entries whose removal causes the least increase in corpus loss. The result supports a probability distribution over multiple valid segmentations of the same string. Used in SentencePiece, which underlies T5, ALBERT, and some LLaMA variants.

Vocabulary size as a hyperparameter

Every subword tokeniser exposes a vocabulary size V that must be chosen before training.

Model Vocabulary size
BERT-base 30 522
GPT-2 50 257
LLaMA 2 32 000
T5 32 100

A larger vocabulary means more whole words are represented directly, producing shorter sequences. A smaller vocabulary pushes more words into multi-fragment representations — longer sequences, but fewer embedding parameters. Transformers scale quadratically with sequence length, so shorter is faster; but a vocabulary that is too large spreads probability mass thinly over rare entries, which are then poorly learnt.

The fertility problem

Fertility is the number of subword tokens a word produces. A word with fertility 1 maps to a single token; a word with fertility 5 requires five tokens. High-fertility inputs are a known pathology of fixed subword vocabularies.

The problem is most acute for morphologically rich languages. Finnish, Turkish, and Hungarian agglutinate meaning into long compound words. A vocabulary trained predominantly on English will segment these into many small, semantically opaque fragments, inflating sequence length and degrading model performance.

High fertility also arises for rare proper nouns, technical jargon, and code identifiers. A vocabulary of 32 000 general-purpose tokens will segment "deserialization" into three or four pieces; a domain-specific vocabulary trained on software corpora might keep it whole.

Example

Vocabulary trained on general English. Input: "unhappiness".

Algorithm Tokens Fertility
Word-level [UNK] 1 (information lost)
BPE ["un", "happiness"] 2
WordPiece ["un", "##happiness"] 2
Character ["u","n","h","a","p","p","i","n","e","s","s"] 11

Both BPE and WordPiece recover the prefix un- as a productive morpheme without any explicit linguistic rules — it emerges from corpus statistics.

Variants and history

The modern resurgence of subword tokenisation began with Sennrich, Haddow, and Birch (2016), who adapted BPE to reduce unknown-word rates in neural MT. Google published WordPiece as part of BERT (Devlin et al., 2018). The SentencePiece library (Kudo and Richardson, 2018) made both BPE and Unigram LM available as language-agnostic tools operating on raw Unicode — removing the dependency on whitespace pre-tokenisation and making the tokeniser fully invertible.

When to use it

Subword tokenisation is the correct choice when training or fine-tuning transformer models. It is the wrong choice for classic IR pipelines.

Use it when:

  • Training a language model that must handle open-vocabulary text — neologisms, misspellings, code, multilingual input.
  • Fine-tuning a pretrained checkpoint: use the tokeniser that ships with the checkpoint. It is calibrated to the model’s embedding matrix; substituting a different tokeniser corrupts model inputs silently.
  • Your text contains morphologically complex languages. SentencePiece with a multilingual vocabulary (mBERT, XLM-R) degrades far more gracefully than word-level tokenisation across typologically diverse languages.

Avoid it when:

  • Building a full-text search index. Subword tokens are model artefacts, not IR terms. They do not align with stemming conventions, stop-word lists, or phrase queries.
  • Interpretability matters at the token level. High-fertility splits obscure morphological and semantic units that are obvious to humans.

Vocabulary size guidance: 32 000–50 000 is a reasonable default for monolingual general-purpose models. Multilingual models benefit from larger vocabularies (100 000+). Domain-specific models can use smaller vocabularies (8 000–16 000) if the text is constrained.

See also