Stemming

Stemming Preprocessing Information-Retrieval Tokenisation Query-Parsing Needs-Review

What it is

Stemming is a text normalisation step that reduces inflected or derived word forms to a common base, called a stem. The stem is not necessarily a valid dictionary word — it is simply a truncated string that multiple surface forms share. "running", "runner", and "runs" all stem to "run" under most English stemmers; "happiness" and "happily" both reduce to "happi".

The goal is to collapse vocabulary: instead of indexing three separate terms for "run", "runs", and "running", a search engine that applies stemming stores them all under one entry. A query for "running shoes" then also retrieves documents mentioning "run shoes" or "runs shoes" without any extra query logic.

Stemming operates by rule, not by understanding. It does not consult a dictionary or parse morphology — it applies string-rewriting patterns, typically suffix-stripping rules, and lives or dies by how well those rules generalise.

How it works

A stemmer reads a token and applies a cascade of suffix-stripping rules. The Porter Stemmer — the most widely used English stemmer — organises its rules into five sequential passes, each targeting a different class of suffix. Rules fire only when a structural condition called the measure (a count of vowel-consonant alternations in the stem) is satisfied, which prevents over-stemming short words.

The general shape of a rule:

IF word ends with <suffix> AND stem has measure > <threshold>
THEN replace <suffix> with <replacement>

For example, one of Porter’s Step 1b rules:

(m > 0) EED → EE      "agreed" → "agree"
        ED  → ""       "plastered" → "plaster"
        ING → ""       "motoring" → "motor"

[illustrate: step-by-step Porter Stemmer transformation of “generalisation” — the token passing through five labelled stages (Step 1a, 1b, 1c, 2, 3, 4, 5), with the active suffix highlighted and the word fragment mutating at each stage until the final stem “generalis” is produced]

Example

Input	Stem
`caresses`	`caress`
`ponies`	`poni`
`catering`	`cater`
`generalisation`	`generalis`
`relational`	`relat`
`national`	`nation`
`electrically`	`electr`

The stem "poni" is not a real word — that is expected. The stemmer’s contract is consistency, not readability. Both "ponies" and "pony" reduce to "poni", so queries and documents align.

[illustrate: before/after showing six input tokens on the left and their stems on the right, connected by arrows, with stems that are not valid words flagged with a visual marker]

Variants and history

Porter Stemmer (1980). Martin Porter published the algorithm in Program journal. It remains the default English stemmer in Lucene, Solr, Elasticsearch, and most IR toolkits. A revised version, Porter2 (the Snowball English stemmer), corrects known over- and under-stemming cases and is generally preferred for new systems.

Snowball. Porter later created the Snowball framework — a small string-processing language for writing stemming algorithms — and published stemmers for over a dozen languages: German, French, Spanish, Finnish, Russian, and others. These are exposed in Lucene as SnowballFilter and are the standard choice for multilingual search pipelines.

Lovins Stemmer (1968). The first published English stemming algorithm, using a single-pass suffix-stripping table. More aggressive than Porter; now largely of historical interest.

Paice/Husk (Lancaster) Stemmer. An iterative, table-driven stemmer that tends to over-stem more than Porter but achieves more uniform stem lengths.

Algorithmic vs dictionary stemming. All rule-based stemmers apply heuristics without reference to a lexicon. Dictionary-based approaches map each form explicitly to a canonical form — more accurate but require a maintained vocabulary. Systems that need dictionary accuracy should consider lemmatisation instead.

When to use it

Stemming is most useful when recall matters more than precision — returning more results at the cost of some irrelevant matches. A customer support search is a good fit. A legal retrieval system where "filed" and "filing" have distinct meanings is not.

Use stemming when:

Building a full-text search index in a Lucene-based engine — add PorterStemFilter or SnowballFilter to the analysis chain.
Your corpus is in a morphologically rich language (Finnish, Turkish, German) with an appropriate Snowball stemmer available.
Index size is a constraint — stemming reduces vocabulary, shrinking the inverted index.

Prefer lemmatisation when:

Stemming accuracy degrades results. "universe" and "university" both stem to "univers" under Porter — a false conflation that lemmatisation avoids.
You need linguistically valid base forms for downstream tasks (NER, relation extraction).

Tradeoffs: Stemming is fast — a few microseconds per token — and requires no external resources. Apply the same stemmer at index time and query time; a mismatch silently breaks matching.