Hapax Legomenon
What it is
A hapax legomenon (plural: hapaxes) is a word or term that appears exactly once in a corpus. Hapaxes are common in large corpora and indicate vocabulary breadth; they pose challenges to language models (difficulty learning from single occurrence) and IR systems (sparse representation). Understanding hapax distribution is important for corpus analysis and smoothing strategies.
[illustrate: Distribution of word frequencies showing hapax region (frequency = 1); proportion of vocabulary as hapaxes (~30–50% of vocabulary)]
How it works
Vocabulary distribution (Zipf’s Law):
- Most words are rare: ~30–50% of vocabulary appears once (hapaxes)
- Example: Wikipedia English: ~1M unique words; ~500k are hapaxes
- Remainder: ~500k words appear 2+ times
Implications:
- Language models: Hapaxes are difficult to learn; rare data leads to poor generalization
- Smoothing: Must assign probability to unseen words; backoff strategies help
- IR: Hapaxes provide discriminative signal (unique documents) but sparse representation
Example
Corpus: "The quick brown fox jumps over the lazy dog. Sphinx of black quartz judge my vow."
Frequencies:
the: 2
quick: 1 (hapax)
brown: 1 (hapax)
fox: 1 (hapax)
jumps: 1 (hapax)
over: 1 (hapax)
lazy: 1 (hapax)
dog: 1 (hapax)
Sphinx: 1 (hapax)
... (most are hapaxes)
Hapaxes: ~40% of vocabulary (8 of 20 unique words)
# Scaling to real corpus:
Wikipedia: ~3.8B tokens, ~170k unique words (types)
Hapaxes: ~85k words (50% of vocabulary)
Variants and history
Term coined in Greek linguistics studies (hapax legomenon = “said once”). Studied systematically in corpus linguistics and computational linguistics. Heaps’ Law and Zipf’s Law explain hapax prevalence. Smoothing techniques (Laplace, Kneser-Ney) address hapax problem in n-gram models. Modern neural models handle hapaxes better through subword tokenization.
When to use it
Consider hapaxes when:
- Analyzing corpus vocabulary and richness
- Building language models (smoothing strategy matters)
- Understanding model generalization challenges
- Domain adaptation (many domain-specific hapaxes)
- Comparing corpora across languages
Hapax frequency indicates vocabulary richness: higher hapax % suggests diverse, open-vocabulary text.