Collocation

What it is

A collocation is a sequence of words that co-occur more frequently than expected by chance. Collocations are meaningful multi-word expressions: “strong tea” (not “powerful tea”), “black coffee” (not “dark coffee”). Identifying collocations is important for understanding meaning, building language models, and information extraction.

[illustrate: Examples of collocations (fixed phrases) vs. non-collocations (random combinations); frequency distribution showing collocations cluster above chance]

How it works

  1. Statistical significance:

    • Collocations co-occur significantly more than chance
    • Measured by PMI, t-test, chi-square, or log-likelihood ratio
  2. Extraction:

    • Identify word pairs with high association scores
    • Threshold or rank by statistical significance
    • Extract n-grams (bigrams, trigrams, etc.)
  3. Examples:

    • “strong tea” (habitual collocation)
    • “catch my eye” (phrasal verb)
    • “break the ice” (idiom)

Example

Corpus analysis:

Bigram: ("strong", "tea")
Observed: 50 times in 1M tokens
Expected (if independent): (10k × 5k) / 1M = 50
Ratio: 50 / 50 = 1.0 (no special significance)

Bigram: ("strong", "coffee")
Observed: 80 times
Expected: (10k × 3k) / 1M = 30
Ratio: 80 / 30 = 2.67 (significant collocation!)

PMI("strong", "coffee") > PMI("strong", "tea")
→ "strong coffee" is a recognized collocation

Variants and history

Collocation concept dates to Firth (1957, “You shall know a word by the company it keeps”). Computational collocation detection emerged in the 1990s–2000s via statistical tests (t-test, chi-square, PMI). Phrasal verbs and idioms are special cases of collocations. Modern methods use neural embeddings; words that frequently co-occur have similar neighborhoods in embedding space.

When to use it

Identify collocations for:

  • Building phrase-based IR indexes
  • Language learning (teaching natural phrases)
  • Text generation (using collocations improves fluency)
  • Lexicography (dictionary construction)
  • Machine translation (handling idiomatic phrases)

Collocation detection improves language model quality and captures semantic units beyond individual words.

See also