Heaps' Law
What it is
Heaps’ Law describes how vocabulary size grows as corpus size increases. The law states vocabulary size V grows as a power law of corpus size N: V = K × N^β, where K and β are constants (typically 10 ≤ K ≤ 100, 0.4 ≤ β ≤ 0.6). This means vocabulary grows sub-linearly: adding more text yields diminishing returns in new words.
[illustrate: Plot of corpus size (x-axis) vs. vocabulary size (y-axis) showing power-law curve; comparison across languages and domains]
How it works
Heaps’ Law:
- V(N) = K × N^β
- log(V) = log(K) + β × log(N)
Typical parameters:
- β ≈ 0.4–0.6 (sub-linear growth)
- β ≈ 0.4: highly repetitive text (children’s books, technical docs)
- β ≈ 0.6: diverse text (literary works, news)
Implications:
- Doubling corpus size increases vocabulary by ~30% (if β = 0.5)
- Vocabulary is bounded (approaches asymptote)
- Different languages have different β (English ~0.5, highly inflectional languages higher)
Example
Corpus growth (English Wikipedia):
Tokens | Vocabulary | Predicted (β=0.5, K=44)
100 | 420 | 440
1k | 1,900 | 1,400
10k | 5,300 | 4,400
100k | 11,300 | 13,900
1M | 28,300 | 44,000
10M | 72,000 | 139,000
Observation: Heaps' Law explains sub-linear growth
After 1M tokens: 28k unique words (28% of corpus types)
After 10M tokens: 72k unique words (0.72% of corpus types)
Variants and history
Heaps’ Law formulated by Harold Stanley Heaps (1978). Empirical variations around β; different language families show different exponents. Theoretical explanations: information theory, random graph models, lognormal distribution. Refinements: include stopwords, normalize by language, account for morphology. Law holds across natural languages, programming languages, and even biological sequences.
When to use it
Use Heaps’ Law when:
- Estimating vocabulary size for corpus of given size
- Predicting growth of vocabulary as text is added
- Comparing language richness across corpora
- Planning corpus annotation effort (diminishing returns)
- Understanding language learning curves
Practical: To achieve 95% vocabulary coverage, need much larger corpus than for 80% coverage (Heaps’ Law suggests exponential growth).