Tokeniser Vocabulary

What it is

A tokeniser vocabulary is the fixed set of subword units (tokens) used by a tokeniser to break text into discrete units. Modern vocabularies contain 32k–128k tokens, learned from training data via algorithms like byte-pair encoding (BPE) or SentencePiece. Vocabulary size represents a compression-flexibility tradeoff: larger vocabularies handle more words directly but increase model size and computation.

[illustrate: Vocabulary spectrum from character-level (26 units) to word-level (millions) to subword (50k tokens); word segmentation showing how text is broken into vocabulary tokens]

How it works

  1. Vocabulary construction:

    • Start with characters or raw bytes
    • Apply BPE or SentencePiece to merge frequent units
    • Iterate: merge most frequent pairs until vocabulary size reached
    • Result: vocabulary with 32k–128k tokens
  2. Encoding:

    • For unseen text, greedily segment using vocabulary
    • Example: “unbelievable” → [“un”, “believe”, “able”] if in vocab
    • OOV (out-of-vocabulary) handling: fallback to subwords or characters
  3. Properties:

    • Fixed size: enables consistent model architecture
    • Language-aware: characters/subwords learned from language patterns
    • Compression: longer tokens save embedding parameters

Example

# Vocabulary snippet (subword units):
["a", "b", "c", ..., "the", "and", "cat", "dog",
 "un", "ing", "ly", "able", "tion", ..., "51200"]

# Encoding text: "I am unable to believe this"
"I" → ID 456
"am" → ID 234
"unable" → IDs [789, 234] (un, able)
"to" → ID 567
"believe" → IDs [891, 234] (believe, able) — but "believe" is also ID 567
"this" → ID 890

# Result: Token sequence [456, 234, 789, 234, 567, 567, 234, 890]

Variants and history

Character-level models are simple but require many tokens per word. Word-level vocabularies don’t generalize to OOV or morphology. Byte-pair encoding (Sennrich et al., 2015) became standard. SentencePiece (Kudo & Richardson, 2018) added language-agnostic handling. WordPiece (Devlin et al., 2018) used likelihood-based merging. Modern transformers use SentencePiece or BPE with 50k–256k tokens.

When to use it

Choose vocabulary size when:

  • Building a new tokeniser for a domain
  • Balancing model size (larger vocab = larger embeddings)
  • Optimizing for language coverage (rarer languages need larger vocab)
  • Considering inference speed (smaller vocab = fewer tokens per word)

Standard sizes: 32k for efficient models, 50k–128k for well-rounded performance, 256k+ for maximum flexibility on diverse languages. Empirically, 50k tokens is a sweet spot.

See also