Corpus
-
Zipf's Law
Empirical observation that term frequency is inversely proportional to frequency rank; explains why few words dominate corpus.
-
Vocabulary
Set of unique terms appearing in a corpus; fundamental to information retrieval and language analysis. Distinguished from types and tokens.
-
Heaps' Law
Vocabulary grows sub-linearly with corpus size; predicts vocabulary size from token count via power law.
-
Hapax Legomenon
Term occurring exactly once in a corpus; indicates vocabulary richness and poses challenges for language models and IR systems.