Lemmatisation

Lemmatisation Preprocessing Morphology Information-Retrieval Language-Modelling Needs-Review

What it is

Lemmatisation is a text normalisation step that maps an inflected or derived word form to its canonical dictionary entry — its lemma. Where a stemmer chops suffixes by rule, a lemmatiser looks up (or derives) the grammatically correct base form: "better" → "good", "was" → "be", "corpora" → "corpus".

The lemma is always a real, dictionary-valid word. That distinguishes lemmatisation from stemming, which produces a stem that may not exist in any dictionary ("happiness" → "happi").

Two words with very different surface forms can share a lemma: "am", "is", "are", "was", "were" all lemmatise to "be". This conflation is linguistically correct — they are forms of the same lexeme — and it is something rule-based stemmers cannot achieve.

How it works

A lemmatiser typically combines two resources:

Part-of-speech (POS) tagging — the word’s grammatical role in the sentence must be known before the correct lemma can be chosen. The word "saw" lemmatises to "see" as a verb but to "saw" as a noun. Without POS information, a lemmatiser must guess or return multiple candidates.
Lexicon lookup — the tagger’s decision is used to index into a morphological lexicon (such as WordNet for English) that maps (surface_form, POS) pairs to lemmas. Entries are compiled from hand-curated dictionaries and, in modern systems, statistical morphological analysers trained on annotated corpora.

The lookup proceeds roughly as follows:

token = "running"
pos   = VERB          # from the POS tagger

entry = lexicon.lookup(token, pos)
# → ("run", VERB)

lemma = entry.base_form
# → "run"

When a form is absent from the lexicon, most lemmatisers fall back to rule-based morphological analysis — essentially a softer form of stemming applied only in the OOV case.

Example

Input	POS	Lemma
`running`	VERB	`run`
`better`	ADJ	`good`
`corpora`	NOUN	`corpus`
`geese`	NOUN	`goose`
`was`	VERB	`be`
`saw`	VERB	`see`
`saw`	NOUN	`saw`
`studies`	VERB	`study`
`studies`	NOUN	`study`

The "saw" row illustrates why POS context is indispensable: the same surface form has two different lemmas depending on grammatical role.

Variants and history

WordNet lemmatiser (morphy). WordNet’s built-in morphological engine, morphy, uses a list of exception forms combined with detachment rules for English. It is the lemmatiser exposed by NLTK’s WordNetLemmatizer. Simple and fast, but requires explicit POS input and degrades on verbs with complex inflections.

spaCy. Modern spaCy models include an integrated lemmatiser trained jointly with the tagger and parser. The POS tag is inferred automatically; the user receives lemmas directly from token.lemma_. Coverage and accuracy depend on the language model: en_core_web_sm vs en_core_web_trf produce meaningfully different results on difficult cases.

UDPipe and Stanza. Universal Dependencies-based pipelines (UDPipe, Stanford Stanza) produce lemmas as part of full morphosyntactic analysis. These are better choices for non-English languages or when downstream tasks need full morphological features.

Rule-based lemmatisers. Some systems — including older Lucene MorfologikFilter integrations — use compiled finite-state transducers (FSTs) rather than hash-table lookups, trading flexibility for speed and compact memory footprint.

Historical context. Lemmatisation in computational linguistics predates modern NLP toolkits; morphological analysers for Latin and Greek were built in the 1970s for corpus studies. English lemmatisation became common in information retrieval research in the 1990s as WordNet matured.

When to use it

Lemmatisation costs more than stemming — it requires a POS tagger, a lexicon, and correspondingly more runtime — but the payoff is accuracy and readability.

Use lemmatisation when:

Your application surfaces lemmatised terms to users (search suggestions, facets, analytics dashboards) and stems like "univers" would look broken.
You need linguistically correct base forms for downstream tasks — named entity recognition, relation extraction, or semantic parsing work better with valid words.
Your domain involves irregular forms that stemmers mishandle: medical terms, proper names, irregular plurals ("criteria" → "criterion").
You are building a pipeline for a morphologically rich language (German, Turkish, Finnish) where stemming heuristics generalise poorly and a trained morphological analyser is available.

Prefer stemming when:

Index-time throughput matters more than correctness — a Porter stemmer runs in microseconds per token; a full lemmatiser with POS tagging is an order of magnitude slower.
You are prototyping and do not yet have an annotated training corpus or a reliable POS model.
Your search engine’s built-in analysis chain already uses stemming and the retrieval quality is acceptable.

A note on consistency: the same rule that applies to stemming applies here — use the same lemmatiser at index time and at query time. A mismatch will silently break term matching.