Collocation
What it is
A collocation is a sequence of words that co-occur more frequently than expected by chance. Collocations are meaningful multi-word expressions: “strong tea” (not “powerful tea”), “black coffee” (not “dark coffee”). Identifying collocations is important for understanding meaning, building language models, and information extraction.
[illustrate: Examples of collocations (fixed phrases) vs. non-collocations (random combinations); frequency distribution showing collocations cluster above chance]
How it works
-
Statistical significance:
- Collocations co-occur significantly more than chance
- Measured by PMI, t-test, chi-square, or log-likelihood ratio
-
Extraction:
- Identify word pairs with high association scores
- Threshold or rank by statistical significance
- Extract n-grams (bigrams, trigrams, etc.)
-
Examples:
- “strong tea” (habitual collocation)
- “catch my eye” (phrasal verb)
- “break the ice” (idiom)
Example
Corpus analysis:
Bigram: ("strong", "tea")
Observed: 50 times in 1M tokens
Expected (if independent): (10k × 5k) / 1M = 50
Ratio: 50 / 50 = 1.0 (no special significance)
Bigram: ("strong", "coffee")
Observed: 80 times
Expected: (10k × 3k) / 1M = 30
Ratio: 80 / 30 = 2.67 (significant collocation!)
PMI("strong", "coffee") > PMI("strong", "tea")
→ "strong coffee" is a recognized collocation
Variants and history
Collocation concept dates to Firth (1957, “You shall know a word by the company it keeps”). Computational collocation detection emerged in the 1990s–2000s via statistical tests (t-test, chi-square, PMI). Phrasal verbs and idioms are special cases of collocations. Modern methods use neural embeddings; words that frequently co-occur have similar neighborhoods in embedding space.
When to use it
Identify collocations for:
- Building phrase-based IR indexes
- Language learning (teaching natural phrases)
- Text generation (using collocations improves fluency)
- Lexicography (dictionary construction)
- Machine translation (handling idiomatic phrases)
Collocation detection improves language model quality and captures semantic units beyond individual words.