Vocabulary

What it is

Vocabulary is the set of all unique terms (words, subwords, or characters) appearing in a corpus. Size of vocabulary depends on text collection and how terms are defined (word-level, subword, or character-level). Understanding vocabulary size, growth, and distribution is fundamental to corpus analysis, IR, and NLP system design.

[illustrate: Vocabulary spectrum from small domain (medical corpus ~5k unique terms) to large general corpus (Wikipedia ~1M unique words)]

How it works

Definition:
- V = set of unique terms in corpus
- |V| = vocabulary size (unique term count)
- Contrasts with token count (total occurrences)
Vocabulary growth:
- As corpus size increases, vocabulary grows (Heaps’ Law: V ≈ k × N^β)
- β ≈ 0.4–0.6 depending on language and domain
- Larger β: diverse vocabulary; smaller β: repetitive text
Measurement:
- Type-token ratio (TTR): |V| / token_count
- High TTR: lexically diverse (literary text)
- Low TTR: repetitive (children’s books, technical docs)

Example

Corpus: "The cat sat on the mat. The cat was fat."

Tokens: 14 (including punctuation)
Types (vocabulary): {the, cat, sat, on, mat, was, fat, .} = 8 unique terms
TTR: 8 / 14 ≈ 0.57

# Scaling:
Small corpus (1k tokens) → ~300 unique terms
Wikipedia (4B tokens) → ~170k unique words (English)
SQuAD dataset (100M tokens) → ~40k unique words

Variants and history

Vocabulary analysis dates to computational linguistics (1960s–70s). Type-token ratio and vocabulary growth are classical measures. Zipf’s Law explains vocabulary distribution (most words are rare). Heaps’ Law predicts vocabulary size from corpus size. Modern NLP uses subword vocabularies (50k–256k tokens) rather than word-level, learned via BPE or SentencePiece.

When to use it

Understand vocabulary when:

Designing tokenisers and preprocessing
Analyzing corpus diversity and richness
Estimating model capacity (embeddings × vocab size)
Comparing corpora across languages or domains
Understanding language learning curves

Vocabulary size influences model design: BERT uses 30k; GPT-3 uses 50k. Larger vocabularies help with diverse text; smaller vocabularies reduce model size.