Hunspell

Lemmatisation Stemming Morphological-Analysis Spell-Checking Preprocessing Indexing Information-Retrieval Needs-Review

What it is

Hunspell is an open-source morphological analyser, spell checker, and lemmatiser. Given an inflected word form — “running”, “geese”, “unremarkable” — it strips affixes according to a set of language-specific rules and returns the base dictionary form (the lemma). Unlike algorithmic stemmers, which apply heuristic suffix-removal rules to approximate a root, Hunspell works against an actual dictionary: a word is only valid if the resulting base form exists in that dictionary.

It is the spell-checking engine embedded in LibreOffice, OpenOffice, Firefox, Chrome, and macOS, and is used as a lemmatiser in NLP pipelines — most notably as a token filter in Elasticsearch.

How it works

Hunspell loads two files at startup:

.dic — the dictionary file. One entry per line, each consisting of a word stem followed by a / and a set of flag characters: run/ABCDE. The flags are compact identifiers that indicate which affix rules this word participates in.
.aff — the affix file. Defines all PREFIX and SUFFIX rules. Each rule specifies: the flag it belongs to, a stripping pattern, an addition pattern, and a condition that must match the word before the rule fires.

Together, these two files let Hunspell expand a single dictionary entry into every valid inflected form for that word — and, critically, reverse the process during analysis.

Morphological analysis (lemmatisation) works as follows:

Take the input word.
Try every applicable suffix rule: strip the suffix, check whether the condition is met, and look up the resulting candidate stem in the .dic file.
If the candidate stem exists and carries the flag that authorises the suffix rule that was stripped, the analysis succeeds — that stem is the lemma.
Repeat for prefixes and compound rules.
If no rule produces a known stem, the word is out-of-vocabulary and analysis fails.

Spell checking uses the same mechanism in reverse: generate all valid surface forms from dictionary entries and check whether the input matches any of them.

Example

Given an English Hunspell dictionary, analysing the word “dogs”:

Try suffix rule for -s: strip “s”, candidate stem “dog”.
Look up “dog” in .dic: found, with a flag that permits the plural suffix rule.
Return lemma: dog.

Analysing “ran”:

Try suffix rules — none strip to a candidate present in .dic as a regular inflection of “ran”.
Many Hunspell dictionaries handle irregular forms by listing them explicitly in .dic with a morphological data tag (e.g. ran/V is:past_tense), returning lemma run.
If the dictionary does not include “ran” as an irregular entry, analysis fails and the word is returned unchanged or as unknown.

Analysing a neologism like “tokenising” (not yet in the dictionary):

Strip “-ing” → “tokenis” → not found.
Strip “-sing” → “tokeni” → not found.
No rule succeeds. Result: no lemma returned.

Variants and history

Hunspell was created by László Németh and first released in 2002 as a successor to MySpell (itself a successor to Ispell, which dates to 1971). It introduced a more expressive affix rule system, support for compound words, and better handling of agglutinative languages. It is released under a tri-licence: GPL, LGPL, and MPL.

Spylls is a pure-Python reimplementation of Hunspell, useful for environments where compiling the C extension is impractical. PyHunspell provides Python bindings to the original C library.

Elasticsearch exposes Hunspell via the hunspell token filter, which requires a Hunspell dictionary to be installed on the filesystem of each node. The filter performs morphological analysis at index and query time, making it a genuine lemmatiser rather than a stemmer.

Language coverage spans dozens of languages. Community-maintained dictionaries for English, French, German, and Spanish are mature and well-tested; minority language dictionaries vary in completeness and maintenance frequency.

When to use it

Use Hunspell when:

Precision matters more than recall. Hunspell only returns a lemma when it is certain — the form is in the dictionary. Algorithmic stemmers will happily over-stem novel words into meaningless fragments; Hunspell will simply return nothing.
You are working in Elasticsearch and need a lemmatiser without deploying a full ML model. The hunspell token filter is the standard route.
You need spell checking alongside lemmatisation — both come from the same engine and the same dictionary files at no extra cost.

Prefer an algorithmic stemmer (Porter2, KStem) when:

You need full recall, including for out-of-vocabulary words. Algorithmic stemmers produce approximate roots for any string; Hunspell fails silently on unknown words.
You cannot ship or install dictionary files with your deployment.
Stemming quality is sufficient for your retrieval task and you want lower operational overhead.

Prefer a model-based lemmatiser (spaCy, Stanford CoreNLP) when:

Your corpus contains significant quantities of informal, novel, or domain-specific vocabulary that dictionary coverage cannot handle.
You need part-of-speech-sensitive lemmatisation (“flies” as a verb → “fly” vs “flies” as a noun → “fly” — context matters, and Hunspell without POS input cannot reliably disambiguate).

Operational note: dictionary quality is the ceiling on Hunspell’s performance. Deploying Hunspell without verifying dictionary coverage for your domain vocabulary is a common source of silent lemmatisation failures — words are returned unchanged with no error signal.