Byte Pair Encoding
What it is
Byte pair encoding (BPE) is a subword segmentation algorithm. Given a text corpus, it learns a fixed vocabulary of subword units — fragments that sit between individual characters and whole words — by repeatedly merging the most frequent pair of adjacent symbols in the training data. At inference time, the same merge rules are applied in order to encode any input string into a sequence of tokens drawn from that vocabulary.
BPE originated as a lossless data-compression scheme (Gage, 1994). Sennrich et al. adapted it for neural machine translation in 2016, and it became the tokenisation algorithm underlying GPT-2, GPT-3, GPT-4, LLaMA, and many other large language models.
The motivation is practical: a fixed word-level vocabulary cannot represent words it has never seen. A character-level vocabulary has no such gap but produces very long sequences. BPE occupies the middle ground — a compact, learnable vocabulary that handles rare and out-of-vocabulary words by decomposing them into known fragments.
How it works
BPE runs in two distinct phases: vocabulary training on a corpus, and encoding of individual strings at inference time.
Phase 1 — Vocabulary training
Initialisation. Split every word in the corpus into individual characters, inserting a special end-of-word marker (commonly </w>) after the final character. Count how many times each word appears. The initial symbol inventory is the character alphabet plus </w>.
For a tiny corpus where "low" appears 5 times, "lower" 2 times, "newest" 6 times, and "widest" 3 times:
l o w </w> × 5
l o w e r </w> × 2
n e w e s t </w> × 6
w i d e s t </w> × 3
The merge loop. Count all adjacent symbol pairs across all word occurrences. Find the pair with the highest count. Create a new symbol by concatenating the pair. Replace every occurrence of that pair with the new symbol. Repeat for a fixed number of iterations V — the target vocabulary size minus the initial alphabet size.
Iteration 1: the pair (e, s) appears in newest (×6) and widest (×3) — total 9. Merge → es.
l o w </w> × 5
l o w e r </w> × 2
n e w es t </w> × 6
w i d es t </w> × 3
Iteration 2: (es, t) — total 9. Merge → est.
l o w </w> × 5
l o w e r </w> × 2
n e w est </w> × 6
w i d est </w> × 3
Iteration 3: (l, o) — total 7. Merge → lo.
…and so on. Each iteration records one merge rule: an ordered pair (A, B) → AB. After V merges the training is complete. The final vocabulary is the initial character alphabet plus every merged symbol produced.
[illustrate: step-by-step merge loop over the four-word corpus above — table showing each iteration with the current word representations, the winning pair highlighted in each word, its count, and the resulting new symbol; vocabulary size counter incrementing at each step]
Byte-level BPE. GPT-2 and its descendants use a byte-level variant: the initial alphabet is the 256 possible byte values (not Unicode characters), which guarantees that any input string — regardless of encoding or language — can be represented without an unknown-token fallback. Merges then operate on byte sequences rather than characters.
Phase 2 — Encoding at inference
Given the ordered list of merge rules learned during training, encode a new string as follows:
- Normalise and split the string into individual characters (or bytes); append
</w>to each word. - Apply merge rules in the order they were learned, from first to last. For each rule
(A, B): scan the current symbol sequence and replace every adjacentA BwithAB. - The resulting symbol sequence is the token output.
Order matters. A pair that was merged early in training may itself be part of a later merge rule; applying rules out of order would produce a different — and incorrect — segmentation.
Example
Corpus (after many training iterations, simplified):
Merge rules (in order):
1. (e, s) → es
2. (es, t) → est
3. (l, o) → lo
4. (lo, w) → low
5. (n, e) → ne
6. (ne, w) → new
7. (new, est) → newest
Encoding "lowest":
| Step | Sequence | Rule applied |
|---|---|---|
| Init | l o w e s t </w> |
— |
| 1 | l o w es t </w> |
(e,s) → es |
| 2 | l o w est </w> |
(es,t) → est |
| 3 | lo w est </w> |
(l,o) → lo |
| 4 | low est </w> |
(lo,w) → low |
| — | — | Rules 5–7 don’t match |
Output tokens: ["low", "est</w>"]
Encoding "lowest" yields a split the model has never explicitly seen as a whole word, but both fragments appear in the vocabulary as productive units. This is how BPE handles rare and out-of-vocabulary words: it decomposes them into whatever familiar pieces the merge rules support, falling back to individual characters if necessary.
Variants and history
Origin. Philip Gage described the original compression algorithm in a 1994 issue of C Users Journal. The idea is simple: find the most common byte pair in a file, replace it with an unused byte, record the substitution, and repeat. Sennrich, Haddow, and Birch (2016) applied the same inductive principle to text tokenisation for neural MT, replacing “unused byte” with “new vocabulary entry”.
WordPiece. Google’s WordPiece (used in BERT) applies the same iterative-merge principle but selects merges by maximising the likelihood of the training data under a language model, rather than by raw pair frequency. In practice the vocabularies produced are similar; the key surface difference is that WordPiece marks the start of a continuation fragment with ## (e.g. play, ##ing) rather than marking end-of-word.
Unigram LM. SentencePiece’s Unigram tokeniser (used in T5, ALBERT, and some LLaMA variants) takes the opposite approach: start with a large vocabulary and iteratively prune the entries that least reduce corpus likelihood, rather than building up from characters. This tends to produce probabilistic segmentations and supports multiple valid tokenisations of the same string.
SentencePiece. A library by Kudo and Richardson (2018) that implements both BPE and Unigram LM. SentencePiece operates directly on raw Unicode without a pre-tokenisation step, making it language-agnostic. It also treats whitespace as an explicit symbol (▁), meaning the tokeniser is fully reversible — the original string can be reconstructed exactly from the token sequence.
When to use it
BPE is the right choice when training or fine-tuning a transformer model that needs to handle a broad, open vocabulary, including misspellings, code identifiers, URLs, and non-English text.
Use BPE (or its relatives) when:
- You are building or adapting a language model and need a fixed vocabulary that scales predictably. Choose a vocabulary size between 32k and 100k for general-purpose models; smaller for domain-specific work with constrained text.
- Your corpus contains morphologically complex languages, technical jargon, or code. BPE will learn productive suffixes and prefixes (
-tion,-ing,un-) as first-class tokens. - You need guaranteed coverage with no unknown-token fallback — use byte-level BPE.
Do not use BPE for full-text search indexing. Search engines (Lucene, OpenSearch, Solr) work with term-based inverted indexes. BPE tokens are not terms in the information-retrieval sense — they are model-specific artefacts. A BPE vocabulary trained on a language model corpus will not align with query intent, phrase matching, or stemming conventions. Use a standard analysis chain instead.
Tradeoffs:
- BPE vocabularies are learned once and then fixed. Adding new domains with novel terminology requires retraining the tokeniser, not just the model.
- Subword tokens are not always linguistically meaningful. BPE may split
"unfortunately"as["un", "fort", "un", "ately"]— correct in terms of reconstruction but opaque to humans inspecting model behaviour. - Tokenisation is model-specific. The tokeniser trained with a checkpoint must be used with that checkpoint; mixing tokenisers silently corrupts model inputs.