Heaps' Law

What it is

Heaps’ Law describes how vocabulary size grows as corpus size increases. The law states vocabulary size V grows as a power law of corpus size N: V = K × N^β, where K and β are constants (typically 10 ≤ K ≤ 100, 0.4 ≤ β ≤ 0.6). This means vocabulary grows sub-linearly: adding more text yields diminishing returns in new words.

[illustrate: Plot of corpus size (x-axis) vs. vocabulary size (y-axis) showing power-law curve; comparison across languages and domains]

How it works

Heaps’ Law:

V(N) = K × N^β
log(V) = log(K) + β × log(N)

Typical parameters:

β ≈ 0.4–0.6 (sub-linear growth)
β ≈ 0.4: highly repetitive text (children’s books, technical docs)
β ≈ 0.6: diverse text (literary works, news)

Implications:

Doubling corpus size increases vocabulary by ~30% (if β = 0.5)
Vocabulary is bounded (approaches asymptote)
Different languages have different β (English ~0.5, highly inflectional languages higher)

Example

Corpus growth (English Wikipedia):
Tokens  | Vocabulary | Predicted (β=0.5, K=44)
100     | 420        | 440
1k      | 1,900      | 1,400
10k     | 5,300      | 4,400
100k    | 11,300     | 13,900
1M      | 28,300     | 44,000
10M     | 72,000     | 139,000

Observation: Heaps' Law explains sub-linear growth
After 1M tokens: 28k unique words (28% of corpus types)
After 10M tokens: 72k unique words (0.72% of corpus types)

Variants and history

Heaps’ Law formulated by Harold Stanley Heaps (1978). Empirical variations around β; different language families show different exponents. Theoretical explanations: information theory, random graph models, lognormal distribution. Refinements: include stopwords, normalize by language, account for morphology. Law holds across natural languages, programming languages, and even biological sequences.

When to use it

Use Heaps’ Law when:

Estimating vocabulary size for corpus of given size
Predicting growth of vocabulary as text is added
Comparing language richness across corpora
Planning corpus annotation effort (diminishing returns)
Understanding language learning curves

Practical: To achieve 95% vocabulary coverage, need much larger corpus than for 80% coverage (Heaps’ Law suggests exponential growth).