Tokeniser Vocabulary
What it is
A tokeniser vocabulary is the fixed set of subword units (tokens) used by a tokeniser to break text into discrete units. Modern vocabularies contain 32k–128k tokens, learned from training data via algorithms like byte-pair encoding (BPE) or SentencePiece. Vocabulary size represents a compression-flexibility tradeoff: larger vocabularies handle more words directly but increase model size and computation.
[illustrate: Vocabulary spectrum from character-level (26 units) to word-level (millions) to subword (50k tokens); word segmentation showing how text is broken into vocabulary tokens]
How it works
-
Vocabulary construction:
- Start with characters or raw bytes
- Apply BPE or SentencePiece to merge frequent units
- Iterate: merge most frequent pairs until vocabulary size reached
- Result: vocabulary with 32k–128k tokens
-
Encoding:
- For unseen text, greedily segment using vocabulary
- Example: “unbelievable” → [“un”, “believe”, “able”] if in vocab
- OOV (out-of-vocabulary) handling: fallback to subwords or characters
-
Properties:
- Fixed size: enables consistent model architecture
- Language-aware: characters/subwords learned from language patterns
- Compression: longer tokens save embedding parameters
Example
# Vocabulary snippet (subword units):
["a", "b", "c", ..., "the", "and", "cat", "dog",
"un", "ing", "ly", "able", "tion", ..., "51200"]
# Encoding text: "I am unable to believe this"
"I" → ID 456
"am" → ID 234
"unable" → IDs [789, 234] (un, able)
"to" → ID 567
"believe" → IDs [891, 234] (believe, able) — but "believe" is also ID 567
"this" → ID 890
# Result: Token sequence [456, 234, 789, 234, 567, 567, 234, 890]
Variants and history
Character-level models are simple but require many tokens per word. Word-level vocabularies don’t generalize to OOV or morphology. Byte-pair encoding (Sennrich et al., 2015) became standard. SentencePiece (Kudo & Richardson, 2018) added language-agnostic handling. WordPiece (Devlin et al., 2018) used likelihood-based merging. Modern transformers use SentencePiece or BPE with 50k–256k tokens.
When to use it
Choose vocabulary size when:
- Building a new tokeniser for a domain
- Balancing model size (larger vocab = larger embeddings)
- Optimizing for language coverage (rarer languages need larger vocab)
- Considering inference speed (smaller vocab = fewer tokens per word)
Standard sizes: 32k for efficient models, 50k–128k for well-rounded performance, 256k+ for maximum flexibility on diverse languages. Empirically, 50k tokens is a sweet spot.