Type
What it is
A type is a unique word form (or term) in a corpus. For example, “cat” is a single type but can appear multiple times in text (multiple tokens of that type). The distinction between type and token is fundamental: vocabulary is the set of types; corpus size is measured in tokens.
[illustrate: Text with highlighted tokens of same type; vocabulary set showing unique types]
How it works
Type vs. Token:
- Type: Unique form (e.g., “cat” is one type)
- Token: Single occurrence (e.g., “The cat sat on the cat” has two tokens of type “cat”)
- Vocabulary: Set of all types in corpus
- Corpus size: Token count (total word occurrences)
Measurements:
- Type count: |V| (vocabulary size)
- Token count: N (corpus size)
- Type-token ratio: |V| / N (lexical diversity)
Example
Text: "The quick brown fox jumps over the lazy dog dog."
Tokens: 11 (or 10 if excluding punctuation)
Types: {the, quick, brown, fox, jumps, over, lazy, dog} = 8 types
(Note: "the" appears twice but is one type)
Vocabulary: {the, quick, brown, fox, jumps, over, lazy, dog}
Type-token ratio: 8 / 11 ≈ 0.73
# Contrast:
Word "dog": 1 type, 2 tokens
Word "the": 1 type, 2 tokens
Variants and history
Type-token distinction is classical in linguistics and IR. Vocabulary size (type count) is fundamental to language science. Hapax legomena are types appearing once. Core vocabulary studies frequent types (often ~1000 types cover ~80% of text). Modern NLP uses types as building blocks for embeddings and models.
When to use it
Understand types when:
- Analyzing vocabulary richness and diversity
- Studying rare and frequent words
- Building IR systems (index on types)
- Language learning (core vocabulary size)
- Comparing corpora (vocabulary coverage)
Type frequency follows Zipf’s Law: a few common types dominate; most types are rare.