Zipf's Law
What it is
Zipf’s Law states that in natural language text, word frequency is inversely proportional to its frequency rank: f(r) = k / r, where f is frequency, r is rank, and k is a constant. The law explains why a few very common words (the, and, to) dominate while most words are rare. Holds approximately across languages and corpora.
[illustrate: Log-log plot of rank vs. frequency showing straight line; “the” at top-left with high frequency, rare words at bottom-right with frequency ~1]
How it works
Zipf’s Law (rank-frequency):
- f(r) ≈ k / r
- Frequency = constant / rank
- log(f) ≈ log(k) - log(r) (linear in log-log space)
Examples:
- Rank 1 (most common): “the” ~7% of corpus
- Rank 2: “and” ~3.6% of corpus
- Rank 10: ~0.7% of corpus
- Rank 100: ~0.07% of corpus
- Rank 1000: ~0.007% of corpus
Implications:
- Few words (top ~1000) cover ~80% of text
- Most words (hapaxes) appear once or twice
- Power-law distribution, not Gaussian
Example
Corpus: 1M token English text
| Rank | Word | Frequency | Proportion |
|---|---|---|---|
| 1 | the | 70,000 | 7% |
| 2 | and | 36,000 | 3.6% |
| 3 | to | 28,000 | 2.8% |
| 10 | that | 7,000 | 0.7% |
| 100 | back | 700 | 0.07% |
| 1000 | vain | 70 | 0.007% |
Log-log plot: straight line with slope −1 (Zipf exponent)
Variants and history
Zipf’s Law was formulated by George Kingsley Zipf (1949) from empirical observations. Later studies showed deviations from pure 1/r; modified Zipf’s law with exponent α ≠ 1. Mandelbrot variant: (r + b)^-α fits better. Law holds across languages, corpora, and even non-linguistic systems (city sizes, income distribution). Origin debated: compression, optimization, random walk on language dynamics.
When to use it
Remember Zipf’s Law when:
- Designing IR indexes (few words carry most discriminative signal)
- Building language models (smoothing crucial for rare words)
- Estimating vocabulary coverage (top 1k words cover ~80%)
- Corpus analysis (predicting vocabulary size from token count)
- Understanding natural language distribution
Zipf’s Law has practical implications: 80–20 rule (80% coverage with 20% of vocabulary) useful for language learning and IR optimization.