Collocation

What it is

A collocation is a sequence of words that co-occur more frequently than expected by chance. Collocations are meaningful multi-word expressions: “strong tea” (not “powerful tea”), “black coffee” (not “dark coffee”). Identifying collocations is important for understanding meaning, building language models, and information extraction.

[illustrate: Examples of collocations (fixed phrases) vs. non-collocations (random combinations); frequency distribution showing collocations cluster above chance]

How it works

Statistical significance:
- Collocations co-occur significantly more than chance
- Measured by PMI, t-test, chi-square, or log-likelihood ratio
Extraction:
- Identify word pairs with high association scores
- Threshold or rank by statistical significance
- Extract n-grams (bigrams, trigrams, etc.)
Examples:
- “strong tea” (habitual collocation)
- “catch my eye” (phrasal verb)
- “break the ice” (idiom)

Example

Corpus analysis:

Bigram: ("strong", "tea")
Observed: 50 times in 1M tokens
Expected (if independent): (10k × 5k) / 1M = 50
Ratio: 50 / 50 = 1.0 (no special significance)

Bigram: ("strong", "coffee")
Observed: 80 times
Expected: (10k × 3k) / 1M = 30
Ratio: 80 / 30 = 2.67 (significant collocation!)

PMI("strong", "coffee") > PMI("strong", "tea")
→ "strong coffee" is a recognized collocation

Variants and history

Collocation concept dates to Firth (1957, “You shall know a word by the company it keeps”). Computational collocation detection emerged in the 1990s–2000s via statistical tests (t-test, chi-square, PMI). Phrasal verbs and idioms are special cases of collocations. Modern methods use neural embeddings; words that frequently co-occur have similar neighborhoods in embedding space.

When to use it

Identify collocations for:

Building phrase-based IR indexes
Language learning (teaching natural phrases)
Text generation (using collocations improves fluency)
Lexicography (dictionary construction)
Machine translation (handling idiomatic phrases)

Collocation detection improves language model quality and captures semantic units beyond individual words.