WordPiece

What it is

WordPiece is a subword tokenisation algorithm developed at Google. Like Byte Pair Encoding, it starts from individual characters and iteratively merges adjacent symbol pairs to grow a vocabulary of subword units. The key difference is in how it selects which pair to merge at each step: WordPiece chooses the merge that maximises the likelihood of the training corpus under a unigram language model, rather than the merge that simply occurs most often.

WordPiece was introduced by Schuster and Nakamura (2012) for Google’s Japanese and Korean voice search system. It became widely known as the tokeniser underlying BERT (Devlin et al., 2018) and is used by BERT, DistilBERT, ELECTRA, and most other models in the BERT family.

How it works

WordPiece runs in two phases: vocabulary training on a corpus, and encoding of new strings at inference time.

Phase 1 — Vocabulary training

Initialisation. Split every word in the corpus into individual characters. Prefix every character that is not word-initial with ## — the continuation marker. Count word frequencies. The initial vocabulary is the character alphabet, each character also appearing in its ##-prefixed form.

For the word "playing" appearing 40 times, the initial representation is:

p l ##a ##y ##i ##n ##g    × 40

The merge criterion. At each iteration, WordPiece scores every adjacent symbol pair (A, B) by:

score(A, B) = count(AB) / (count(A) × count(B))

This is a pointwise mutual information (PMI)-style ratio. A pair scores highly when AB appears together far more often than the independent frequencies of A and B would predict — i.e., when merging them genuinely increases the corpus likelihood beyond what treating them independently already achieves. Raw frequency alone would prefer merging common but independent symbols; this criterion prefers merges where the co-occurrence is statistically concentrated.

The merge loop. Select the highest-scoring pair, create a new symbol by concatenating the two, replace every occurrence in the working corpus, and record the merge. Repeat until the vocabulary reaches the target size.

[illustrate: step-by-step WordPiece merge loop over a short corpus — table showing each iteration with the current word representations, the candidate pair scored by count(AB)/(count(A)×count(B)) highlighted alongside a competing high-frequency pair that scores lower, and the winning merge applied; vocabulary size counter incrementing]

Phase 2 — Encoding at inference (greedy longest-match)

At inference time, WordPiece does not replay merge rules in order (as BPE does). Instead it uses a greedy longest-match algorithm, sometimes called MaxMatch:

  1. Normalise and whitespace-split the input string into words.
  2. For each word, starting from the left:
    • Find the longest prefix of the remaining string that exists in the vocabulary.
    • Emit that token. If it is not the first token of the word, prefix it with ##.
    • Advance past the matched prefix and repeat from the remaining characters.
  3. If at any point no single character matches a vocabulary entry, emit [UNK] for the whole word.

This is a greedy, deterministic procedure — it always produces the same segmentation for a given input without needing to store or replay an ordered rule list.

Example

BERT’s vocabulary (30 522 entries) contains play and ##ing as distinct entries.

Encoding "playing":

Step Remaining Longest match Token emitted
1 playing play play
2 ing (continuation) ##ing ##ing

Output: ["play", "##ing"]

Encoding "unplayed":

Step Remaining Longest match Token emitted
1 unplayed un un
2 played (continuation) ##play ##play
3 ed (continuation) ##ed ##ed

Output: ["un", "##play", "##ed"]

The ## prefix is a recoverable marker. To reconstruct the original word, strip ## from every continuation token and concatenate: un + play + edunplayed.

Variants and history

WordPiece was first described by Schuster and Nakamura (2012) for syllable-based segmentation for East Asian scripts. Google re-applied it to multilingual subword modelling for BERT.

The ## continuation convention is specific to WordPiece. BPE uses an end-of-word marker (</w>) appended to the final character of each word instead. The two conventions are not interchangeable: a BERT checkpoint trained with WordPiece tokenisation must always be used with a WordPiece tokeniser, not a BPE one.

BERT’s vocabulary is fixed at 30 522 tokens. Unlike BPE (which can be trained for any target size), BERT’s specific vocabulary was trained once on BookCorpus and English Wikipedia. Fine-tuning BERT does not change its tokeniser — the vocabulary is frozen and shipped with the checkpoint.

When to use it

In practice, you will not train WordPiece from scratch. You will use the tokeniser that ships with a BERT-family checkpoint via the Hugging Face transformers library:

from transformers import BertTokenizer

tokeniser = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokeniser.tokenize("unplayed")
# → ["un", "##play", "##ed"]

Use the shipped WordPiece tokeniser when:

  • Fine-tuning any BERT-family model. The vocabulary is calibrated to the model’s embedding matrix; substituting a different tokeniser corrupts inputs without raising an error.
  • Building a classification, NER, or question-answering pipeline on top of BERT. The [CLS] and [SEP] special tokens are part of the WordPiece vocabulary and carry positional meaning within the model.

Do not use WordPiece for full-text search indexing. WordPiece tokens are model-specific artefacts — ##play is not a meaningful IR term and will not align with stemming conventions, phrase queries, or stop-word lists.

Tradeoffs vs BPE:

  • The likelihood-maximisation criterion produces vocabularies that are qualitatively similar to BPE in practice. The most visible surface difference is the ## prefix convention, not the merge scoring.
  • WordPiece’s greedy longest-match encoder is simpler to implement at inference time than BPE’s ordered-rule replay, but is tied to the fixed vocabulary — it cannot be extended without retraining.
  • BPE (especially byte-level BPE) guarantees no [UNK] output; WordPiece falls back to [UNK] for any word containing a character absent from the vocabulary.

See also