Vocabulary
What it is
Vocabulary is the set of all unique terms (words, subwords, or characters) appearing in a corpus. Size of vocabulary depends on text collection and how terms are defined (word-level, subword, or character-level). Understanding vocabulary size, growth, and distribution is fundamental to corpus analysis, IR, and NLP system design.
[illustrate: Vocabulary spectrum from small domain (medical corpus ~5k unique terms) to large general corpus (Wikipedia ~1M unique words)]
How it works
-
Definition:
- V = set of unique terms in corpus
- |V| = vocabulary size (unique term count)
- Contrasts with token count (total occurrences)
-
Vocabulary growth:
- As corpus size increases, vocabulary grows (Heaps’ Law: V ≈ k × N^β)
- β ≈ 0.4–0.6 depending on language and domain
- Larger β: diverse vocabulary; smaller β: repetitive text
-
Measurement:
- Type-token ratio (TTR): |V| / token_count
- High TTR: lexically diverse (literary text)
- Low TTR: repetitive (children’s books, technical docs)
Example
Corpus: "The cat sat on the mat. The cat was fat."
Tokens: 14 (including punctuation)
Types (vocabulary): {the, cat, sat, on, mat, was, fat, .} = 8 unique terms
TTR: 8 / 14 ≈ 0.57
# Scaling:
Small corpus (1k tokens) → ~300 unique terms
Wikipedia (4B tokens) → ~170k unique words (English)
SQuAD dataset (100M tokens) → ~40k unique words
Variants and history
Vocabulary analysis dates to computational linguistics (1960s–70s). Type-token ratio and vocabulary growth are classical measures. Zipf’s Law explains vocabulary distribution (most words are rare). Heaps’ Law predicts vocabulary size from corpus size. Modern NLP uses subword vocabularies (50k–256k tokens) rather than word-level, learned via BPE or SentencePiece.
When to use it
Understand vocabulary when:
- Designing tokenisers and preprocessing
- Analyzing corpus diversity and richness
- Estimating model capacity (embeddings × vocab size)
- Comparing corpora across languages or domains
- Understanding language learning curves
Vocabulary size influences model design: BERT uses 30k; GPT-3 uses 50k. Larger vocabularies help with diverse text; smaller vocabularies reduce model size.