Vocabulary
-
Vocabulary
Set of unique terms appearing in a corpus; fundamental to information retrieval and language analysis. Distinguished from types and tokens.
-
Type
A unique word form in a corpus; distinguished from token (single occurrence). Vocabulary = set of all types.
-
Tokeniser Vocabulary
Fixed set of subword units learned or predefined for tokenisation; typically 32k–128k tokens, balancing compression and flexibility.
-
Hapax Legomenon
Term occurring exactly once in a corpus; indicates vocabulary richness and poses challenges for language models and IR systems.