Zipf's Law

What it is

Zipf’s Law states that in natural language text, word frequency is inversely proportional to its frequency rank: f(r) = k / r, where f is frequency, r is rank, and k is a constant. The law explains why a few very common words (the, and, to) dominate while most words are rare. Holds approximately across languages and corpora.

[illustrate: Log-log plot of rank vs. frequency showing straight line; “the” at top-left with high frequency, rare words at bottom-right with frequency ~1]

How it works

Zipf’s Law (rank-frequency):

  • f(r) ≈ k / r
  • Frequency = constant / rank
  • log(f) ≈ log(k) - log(r) (linear in log-log space)

Examples:

  • Rank 1 (most common): “the” ~7% of corpus
  • Rank 2: “and” ~3.6% of corpus
  • Rank 10: ~0.7% of corpus
  • Rank 100: ~0.07% of corpus
  • Rank 1000: ~0.007% of corpus

Implications:

  • Few words (top ~1000) cover ~80% of text
  • Most words (hapaxes) appear once or twice
  • Power-law distribution, not Gaussian

Example

Corpus: 1M token English text

Rank Word Frequency Proportion
1 the 70,000 7%
2 and 36,000 3.6%
3 to 28,000 2.8%
10 that 7,000 0.7%
100 back 700 0.07%
1000 vain 70 0.007%

Log-log plot: straight line with slope −1 (Zipf exponent)

Variants and history

Zipf’s Law was formulated by George Kingsley Zipf (1949) from empirical observations. Later studies showed deviations from pure 1/r; modified Zipf’s law with exponent α ≠ 1. Mandelbrot variant: (r + b)^-α fits better. Law holds across languages, corpora, and even non-linguistic systems (city sizes, income distribution). Origin debated: compression, optimization, random walk on language dynamics.

When to use it

Remember Zipf’s Law when:

  • Designing IR indexes (few words carry most discriminative signal)
  • Building language models (smoothing crucial for rare words)
  • Estimating vocabulary coverage (top 1k words cover ~80%)
  • Corpus analysis (predicting vocabulary size from token count)
  • Understanding natural language distribution

Zipf’s Law has practical implications: 80–20 rule (80% coverage with 20% of vocabulary) useful for language learning and IR optimization.

See also