SentencePiece

What it is

SentencePiece is a subword tokenisation library developed by Kudo and Richardson (2018) at Google. It differs from Byte Pair Encoding and WordPiece in one fundamental way: it requires no pre-tokenisation step. Where BPE and WordPiece split the input into whitespace-delimited words first, then segment those words into subwords, SentencePiece treats the raw Unicode character stream — spaces included — as its input. Whitespace is encoded as the special prefix symbol (U+2581, LOWER ONE EIGHTH BLOCK), which becomes a first-class part of the vocabulary rather than a delimiter that is silently discarded.

The practical consequences are significant. Because whitespace is preserved as a symbol, the tokenisation is fully reversible: the original string can be recovered exactly from the token sequence by replacing with a space character and concatenating. No information is lost at the tokenisation boundary.

SentencePiece is used by T5, mT5, ALBERT, LLaMA (1 and 2), XLM-R, and many other multilingual and sequence-to-sequence models.

How it works

SentencePiece is a library, not a single algorithm. When training a model, you choose one of two underlying algorithms:

BPE inside SentencePiece

The merge procedure is the same as standard Byte Pair Encoding: begin from individual characters, iteratively merge the most frequent adjacent symbol pair, stop when the vocabulary reaches the target size. The trained artefact is an ordered list of merge rules plus the resulting vocabulary.

The only structural difference from text-preprocessing BPE is that the input to the merge loop is the raw character stream, not a pre-split word list. The symbol appears alongside ordinary characters from the first iteration and can itself participate in merges. Common words preceded by a space — ▁the, ▁and, ▁is — typically survive to the final vocabulary as whole tokens because they are highly frequent.

Unigram Language Model inside SentencePiece

The Unigram LM algorithm takes the opposite direction to BPE. It begins with a large seed vocabulary (typically 2–3× the target size, seeded with all character unigrams and the most frequent substrings in the corpus) and iteratively prunes it. At each pruning step it removes the entries whose deletion increases total corpus log-loss the least, until the vocabulary reaches the target size.

Because the algorithm models each segmentation as a probability, it can compute the most probable segmentation of any string — but it can also sample from the distribution over valid segmentations, which is useful during training as a form of data augmentation.

[illustrate: Unigram LM pruning — a large seed vocabulary shrinking over several iterations, with bars representing retained and pruned entries; corpus log-loss shown increasing slightly with each prune step, targeting a final vocabulary of N entries]

Encoding at inference

Once trained, SentencePiece encodes a string using the Viterbi algorithm over the vocabulary to find the segmentation with the maximum probability (Unigram LM) or by replaying merge rules in order (BPE). The output is a sequence of subword tokens drawn from the learned vocabulary.

Example

Training a SentencePiece model and encoding a string using the Python API:

import sentencepiece as spm

# Train on a plain-text file, 8 000-token vocabulary, Unigram LM algorithm
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="mymodel",
    vocab_size=8000,
    model_type="unigram",   # or "bpe"
)

sp = spm.SentencePieceProcessor(model_file="mymodel.model")

tokens = sp.encode("lower temperature", out_type=str)
# → ["▁lower", "▁temp", "er", "ature"]

ids = sp.encode("lower temperature")
# → [142, 521, 89, 304]  (integer IDs; exact values depend on the trained vocab)

reconstructed = sp.decode(tokens)
# → "lower temperature"   (exact original string, spaces restored)

The decode call is lossless: carries the positional information needed to restore the original spacing exactly, with no heuristics required.

Variants and history

Origin. Kudo and Richardson presented SentencePiece at EMNLP 2018. The stated motivation was reproducibility: all prior subword systems depended on language-specific pre-tokenisers (Moses tokeniser for European languages, KyTea or MeCab for Japanese), making results difficult to compare across languages. Operating on raw Unicode removed that dependency entirely.

Unigram LM. Introduced by Kudo in a companion 2018 paper, its probabilistic framing makes it the only major subword method that natively supports subword regularisation: at training time, the model is exposed to multiple stochastic segmentations of the same string rather than a single deterministic one, improving robustness to out-of-distribution text.

Models that use SentencePiece.

Model Algorithm Vocabulary size
T5 / mT5 Unigram LM 32 100
ALBERT Unigram LM 30 000
LLaMA 1 & 2 BPE 32 000
XLM-R BPE 250 000

When to use it

SentencePiece is the right choice when you need a subword tokeniser that is reproducible across languages and fully reversible.

Use SentencePiece when:

  • Training a multilingual or cross-lingual model. The absence of a pre-tokeniser means SentencePiece handles scripts without whitespace word boundaries (Thai, Japanese, Chinese) the same way it handles English.
  • You need exact string reconstruction from tokens. The encoding guarantees this; WordPiece and standard BPE require stripping markers and re-joining, which can silently fail on non-standard whitespace.
  • You want subword regularisation during training. Use model_type="unigram" and the SampleEncode method to draw stochastic segmentations per batch.
  • Fine-tuning a T5, LLaMA, or ALBERT checkpoint. These models ship with a SentencePiece .model file; use it as-is. Do not substitute a different tokeniser.

Tradeoffs:

  • SentencePiece vocabularies include -prefixed tokens as distinct entries. ▁lower and lower are different IDs — the same string encodes differently depending on whether it appears at word-start or mid-word, which is correct behaviour but can surprise users inspecting vocabulary files directly.
  • Unigram LM training is slower than BPE training on large corpora. BPE is the better choice if training speed matters more than probabilistic segmentation.
  • Like all subword tokenisers, a SentencePiece model trained on a general corpus will produce high-fertility segmentations for rare technical terms. Train or fine-tune the tokeniser on domain data if fertility is a concern.

See also