Zipf's Law

What it is

Zipf’s Law states that in natural language text, word frequency is inversely proportional to its frequency rank: f(r) = k / r, where f is frequency, r is rank, and k is a constant. The law explains why a few very common words (the, and, to) dominate while most words are rare. Holds approximately across languages and corpora.

[illustrate: Log-log plot of rank vs. frequency showing straight line; “the” at top-left with high frequency, rare words at bottom-right with frequency ~1]

How it works

Zipf’s Law (rank-frequency):

f(r) ≈ k / r
Frequency = constant / rank
log(f) ≈ log(k) - log(r) (linear in log-log space)

Examples:

Rank 1 (most common): “the” ~7% of corpus
Rank 2: “and” ~3.6% of corpus
Rank 10: ~0.7% of corpus
Rank 100: ~0.07% of corpus
Rank 1000: ~0.007% of corpus

Implications:

Few words (top ~1000) cover ~80% of text
Most words (hapaxes) appear once or twice
Power-law distribution, not Gaussian

Example

Corpus: 1M token English text

Rank	Word	Frequency	Proportion
1	the	70,000	7%
2	and	36,000	3.6%
3	to	28,000	2.8%
10	that	7,000	0.7%
100	back	700	0.07%
1000	vain	70	0.007%

Log-log plot: straight line with slope −1 (Zipf exponent)

Variants and history

Zipf’s Law was formulated by George Kingsley Zipf (1949) from empirical observations. Later studies showed deviations from pure 1/r; modified Zipf’s law with exponent α ≠ 1. Mandelbrot variant: (r + b)^-α fits better. Law holds across languages, corpora, and even non-linguistic systems (city sizes, income distribution). Origin debated: compression, optimization, random walk on language dynamics.

When to use it

Remember Zipf’s Law when:

Designing IR indexes (few words carry most discriminative signal)
Building language models (smoothing crucial for rare words)
Estimating vocabulary coverage (top 1k words cover ~80%)
Corpus analysis (predicting vocabulary size from token count)
Understanding natural language distribution

Zipf’s Law has practical implications: 80–20 rule (80% coverage with 20% of vocabulary) useful for language learning and IR optimization.