Heaps' Law

What it is

Heaps’ Law describes how vocabulary size grows as corpus size increases. The law states vocabulary size V grows as a power law of corpus size N: V = K × N^β, where K and β are constants (typically 10 ≤ K ≤ 100, 0.4 ≤ β ≤ 0.6). This means vocabulary grows sub-linearly: adding more text yields diminishing returns in new words.

[illustrate: Plot of corpus size (x-axis) vs. vocabulary size (y-axis) showing power-law curve; comparison across languages and domains]

How it works

Heaps’ Law:

  • V(N) = K × N^β
  • log(V) = log(K) + β × log(N)

Typical parameters:

  • β ≈ 0.4–0.6 (sub-linear growth)
  • β ≈ 0.4: highly repetitive text (children’s books, technical docs)
  • β ≈ 0.6: diverse text (literary works, news)

Implications:

  • Doubling corpus size increases vocabulary by ~30% (if β = 0.5)
  • Vocabulary is bounded (approaches asymptote)
  • Different languages have different β (English ~0.5, highly inflectional languages higher)

Example

Corpus growth (English Wikipedia):
Tokens  | Vocabulary | Predicted (β=0.5, K=44)
100     | 420        | 440
1k      | 1,900      | 1,400
10k     | 5,300      | 4,400
100k    | 11,300     | 13,900
1M      | 28,300     | 44,000
10M     | 72,000     | 139,000

Observation: Heaps' Law explains sub-linear growth
After 1M tokens: 28k unique words (28% of corpus types)
After 10M tokens: 72k unique words (0.72% of corpus types)

Variants and history

Heaps’ Law formulated by Harold Stanley Heaps (1978). Empirical variations around β; different language families show different exponents. Theoretical explanations: information theory, random graph models, lognormal distribution. Refinements: include stopwords, normalize by language, account for morphology. Law holds across natural languages, programming languages, and even biological sequences.

When to use it

Use Heaps’ Law when:

  • Estimating vocabulary size for corpus of given size
  • Predicting growth of vocabulary as text is added
  • Comparing language richness across corpora
  • Planning corpus annotation effort (diminishing returns)
  • Understanding language learning curves

Practical: To achieve 95% vocabulary coverage, need much larger corpus than for 80% coverage (Heaps’ Law suggests exponential growth).

See also