Hunspell
What it is
Hunspell is an open-source morphological analyser, spell checker, and lemmatiser. Given an inflected word form — “running”, “geese”, “unremarkable” — it strips affixes according to a set of language-specific rules and returns the base dictionary form (the lemma). Unlike algorithmic stemmers, which apply heuristic suffix-removal rules to approximate a root, Hunspell works against an actual dictionary: a word is only valid if the resulting base form exists in that dictionary.
It is the spell-checking engine embedded in LibreOffice, OpenOffice, Firefox, Chrome, and macOS, and is used as a lemmatiser in NLP pipelines — most notably as a token filter in Elasticsearch.
How it works
Hunspell loads two files at startup:
.dic— the dictionary file. One entry per line, each consisting of a word stem followed by a/and a set of flag characters:run/ABCDE. The flags are compact identifiers that indicate which affix rules this word participates in..aff— the affix file. Defines all PREFIX and SUFFIX rules. Each rule specifies: the flag it belongs to, a stripping pattern, an addition pattern, and a condition that must match the word before the rule fires.
Together, these two files let Hunspell expand a single dictionary entry into every valid inflected form for that word — and, critically, reverse the process during analysis.
Morphological analysis (lemmatisation) works as follows:
- Take the input word.
- Try every applicable suffix rule: strip the suffix, check whether the condition is met, and look up the resulting candidate stem in the
.dicfile. - If the candidate stem exists and carries the flag that authorises the suffix rule that was stripped, the analysis succeeds — that stem is the lemma.
- Repeat for prefixes and compound rules.
- If no rule produces a known stem, the word is out-of-vocabulary and analysis fails.
Spell checking uses the same mechanism in reverse: generate all valid surface forms from dictionary entries and check whether the input matches any of them.
Example
Given an English Hunspell dictionary, analysing the word “dogs”:
- Try suffix rule for
-s: strip “s”, candidate stem “dog”. - Look up “dog” in
.dic: found, with a flag that permits the plural suffix rule. - Return lemma:
dog.
Analysing “ran”:
- Try suffix rules — none strip to a candidate present in
.dicas a regular inflection of “ran”. - Many Hunspell dictionaries handle irregular forms by listing them explicitly in
.dicwith a morphological data tag (e.g.ran/V is:past_tense), returning lemmarun. - If the dictionary does not include “ran” as an irregular entry, analysis fails and the word is returned unchanged or as unknown.
Analysing a neologism like “tokenising” (not yet in the dictionary):
- Strip “-ing” → “tokenis” → not found.
- Strip “-sing” → “tokeni” → not found.
- No rule succeeds. Result: no lemma returned.
Variants and history
Hunspell was created by László Németh and first released in 2002 as a successor to MySpell (itself a successor to Ispell, which dates to 1971). It introduced a more expressive affix rule system, support for compound words, and better handling of agglutinative languages. It is released under a tri-licence: GPL, LGPL, and MPL.
Spylls is a pure-Python reimplementation of Hunspell, useful for environments where compiling the C extension is impractical. PyHunspell provides Python bindings to the original C library.
Elasticsearch exposes Hunspell via the hunspell token filter, which requires a Hunspell dictionary to be installed on the filesystem of each node. The filter performs morphological analysis at index and query time, making it a genuine lemmatiser rather than a stemmer.
Language coverage spans dozens of languages. Community-maintained dictionaries for English, French, German, and Spanish are mature and well-tested; minority language dictionaries vary in completeness and maintenance frequency.
When to use it
Use Hunspell when:
- Precision matters more than recall. Hunspell only returns a lemma when it is certain — the form is in the dictionary. Algorithmic stemmers will happily over-stem novel words into meaningless fragments; Hunspell will simply return nothing.
- You are working in Elasticsearch and need a lemmatiser without deploying a full ML model. The
hunspelltoken filter is the standard route. - You need spell checking alongside lemmatisation — both come from the same engine and the same dictionary files at no extra cost.
Prefer an algorithmic stemmer (Porter2, KStem) when:
- You need full recall, including for out-of-vocabulary words. Algorithmic stemmers produce approximate roots for any string; Hunspell fails silently on unknown words.
- You cannot ship or install dictionary files with your deployment.
- Stemming quality is sufficient for your retrieval task and you want lower operational overhead.
Prefer a model-based lemmatiser (spaCy, Stanford CoreNLP) when:
- Your corpus contains significant quantities of informal, novel, or domain-specific vocabulary that dictionary coverage cannot handle.
- You need part-of-speech-sensitive lemmatisation (“flies” as a verb → “fly” vs “flies” as a noun → “fly” — context matters, and Hunspell without POS input cannot reliably disambiguate).
Operational note: dictionary quality is the ceiling on Hunspell’s performance. Deploying Hunspell without verifying dictionary coverage for your domain vocabulary is a common source of silent lemmatisation failures — words are returned unchanged with no error signal.