Decompounding

Tokenisation Morphological-Analysis Preprocessing Information-Retrieval Multilingual Needs-Review

What it is

Decompounding is the process of splitting compound words into their constituent parts during text analysis. A compound word is formed by joining two or more free morphemes — words that can stand alone — into a single written form, without spaces or hyphens.

This is not an edge case. In German, Dutch, Swedish, Danish, Norwegian, and Finnish, compounding is one of the primary mechanisms for creating new vocabulary. Where English uses a noun phrase (“steam ship company”), German writes a single token: Dampfschifffahrtsgesellschaft. To a search engine, that token is opaque — a search for Schiff (ship) will not match it unless the compound has been split during indexing.

How it works

The core task is to find one or more split points in a token such that each resulting component is a valid word. Two main strategies exist.

Dictionary-based decompounding takes a word list and tries all possible binary or recursive splits of the input string. A split is accepted when every component meets a minimum length threshold and appears in the dictionary. Given Arbeitsplatz:

Try A + rbeitsplatz — A is not a valid word; discard.
Continue until Arbeit + splatz — splatz is not valid; discard.
Try Arbeits + platz — Arbeits is not in the dictionary but Arbeit is, and the trailing s is a known linking morpheme (Fugen-s); strip it, check Arbeit + Platz — both valid. Accept.

This approach is linguistically principled and produces clean splits, but it requires a domain-appropriate dictionary and explicit rules for linking morphemes.

Statistical decompounding scores candidate splits using character n-gram frequencies or corpus word frequencies, without a dictionary. Splits where both halves are high-frequency words score higher. This generalises across languages and handles neologisms, but it is noisier — it will sometimes split words that should remain whole.

The Fugen-s problem

German compound formation frequently inserts a linking element between components. The most common is -s: Arbeit + s + Platz → Arbeitsplatz. Others include -es, -en, -er, and -e. A decompounding filter must strip these linking elements before dictionary lookup and must not confuse them with inflectional suffixes or plural forms.

Over-decompounding

Without guardrails, a dictionary-based splitter will eagerly split words that are not compounds. Inflation contains in and flation — in is a valid preposition in many dictionaries. Two parameters control this:

min_subword_size — the minimum character length a component must reach before it is accepted (typically 3–4 characters).
max_subword_size — the maximum length of a component, used to limit search space.

Both components must clear min_subword_size independently, which rules out splits like in + flation.

Example

Input token: Donaudampfschifffahrtsgesellschaft

Dictionary split (greedy longest-prefix, Fugen-s aware):

Component	Notes
`Donau`	Danube
`Dampf`	steam
`Schiff`	ship
`Fahrt`	voyage
`Gesellschaft`	company

The index receives six tokens for this one surface form: the original compound plus all five components. A query for Schiff now matches.

Variants and history

Lucene’s compound word token filters provide two implementations used by Elasticsearch and OpenSearch:

DictionaryCompoundWordTokenFilter — accepts a word list; splits greedily on the longest matching prefix or suffix. Straightforward to configure; quality depends entirely on dictionary coverage.
HyphenationCompoundWordTokenFilter — uses Liang hyphenation grammars (the same algorithm TeX uses for line breaking) to propose split points, then validates each candidate against a word list. This handles edge cases where greedy prefix matching would miss a valid split.

Both filters expose min_subword_size (default 2, recommended ≥ 3) and max_subword_size (default 15) parameters.

Kuromoji (Japanese) and Nori (Korean) include decompound modes native to their tokenisers — decompound_mode: decompose or discard — because Japanese and Korean also write compound or agglutinative forms as single tokens. These are discussed in the CJK Tokeniser citation.

Scandinavian languages — Swedish, Danish, Norwegian — exhibit the same productive compounding as German and Dutch. Standard Snowball stemmers for these languages offer no decompounding; a dictionary-based filter is the practical approach, though pre-built word lists are less readily available than for German.

When to use it

Decompounding is worth enabling whenever:

Your corpus contains German, Dutch, Swedish, Danish, or Norwegian text.
Users are likely to search for component words that appear only inside compounds in the index.
Recall on noun-heavy domain vocabulary (engineering, legal, medical) is below expectations.

Apply decompounding at both index time and query time. If you decompound only at index time, a compound query term — Dampfschifffahrtsgesellschaft — will not match the split index tokens. If you decompound only at query time, a compound document term will not be found by a simple component query. Both sides of the pipeline must agree.

The cost is index bloat: each compound generates additional tokens. For most use cases this is acceptable. If storage is constrained, use discard mode (emit only the components, not the original) rather than inject mode (emit both); accept that you lose exact-compound matching in return.

Dictionary quality dominates output quality. A word list tuned to your domain — medical German, legal Dutch — will substantially outperform a generic one. Statistical methods are a pragmatic fallback when a suitable dictionary does not exist, at the cost of some spurious splits.