Type

What it is

A type is a unique word form (or term) in a corpus. For example, “cat” is a single type but can appear multiple times in text (multiple tokens of that type). The distinction between type and token is fundamental: vocabulary is the set of types; corpus size is measured in tokens.

[illustrate: Text with highlighted tokens of same type; vocabulary set showing unique types]

How it works

Type vs. Token:

  • Type: Unique form (e.g., “cat” is one type)
  • Token: Single occurrence (e.g., “The cat sat on the cat” has two tokens of type “cat”)
  • Vocabulary: Set of all types in corpus
  • Corpus size: Token count (total word occurrences)

Measurements:

  • Type count: |V| (vocabulary size)
  • Token count: N (corpus size)
  • Type-token ratio: |V| / N (lexical diversity)

Example

Text: "The quick brown fox jumps over the lazy dog dog."

Tokens: 11 (or 10 if excluding punctuation)
Types: {the, quick, brown, fox, jumps, over, lazy, dog} = 8 types
  (Note: "the" appears twice but is one type)

Vocabulary: {the, quick, brown, fox, jumps, over, lazy, dog}
Type-token ratio: 8 / 11 ≈ 0.73

# Contrast:
Word "dog": 1 type, 2 tokens
Word "the": 1 type, 2 tokens

Variants and history

Type-token distinction is classical in linguistics and IR. Vocabulary size (type count) is fundamental to language science. Hapax legomena are types appearing once. Core vocabulary studies frequent types (often ~1000 types cover ~80% of text). Modern NLP uses types as building blocks for embeddings and models.

When to use it

Understand types when:

  • Analyzing vocabulary richness and diversity
  • Studying rare and frequent words
  • Building IR systems (index on types)
  • Language learning (core vocabulary size)
  • Comparing corpora (vocabulary coverage)

Type frequency follows Zipf’s Law: a few common types dominate; most types are rare.

See also