Porter Stemmer

Stemming Preprocessing Information-Retrieval String-Matching Query-Parsing Needs-Review

What it is

The Porter Stemmer is an English suffix-stripping algorithm published by Martin Porter in 1980 in the journal Program: Electronic Library and Information Systems. It is the most widely deployed stemming algorithm in existence — still the default English stemmer in Apache Lucene, and therefore in Solr and Elasticsearch — and it established the design pattern that most subsequent stemmers follow.

Unlike ad-hoc suffix lists, the Porter Stemmer is fully formalised: a finite set of rewriting rules, each gated by a structural condition, applied in a fixed order. The algorithm makes no dictionary lookups and has no runtime dependencies beyond the word itself.

How it works

The measure m

Every rule in the algorithm is conditioned on the measure of the candidate stem — written m — which counts the number of vowel-consonant (VC) transitions in the stem, ignoring any leading consonants and trailing vowels.

Define the sequence of a word using C (consonant) and V (vowel). Y is treated as a consonant at the start of a word and as a vowel elsewhere. The measure is the count of VC pairs in the interior of the string:

[C](VC)^m [V]

Working through three examples:

Word	CV pattern	m
`be`	VV	0
`tree`	CCVV	0
`trouble`	CCVCVCV	2
`oaten`	VVCVC	1
`relational`	CVVCVCVCVC	4

A rule conditioned on (m > 0) will not fire on single-syllable stems, preventing the algorithm from over-aggressively stripping short words. A rule conditioned on (m > 1) requires at least two VC pairs, restricting it to longer stems.

[illustrate: step-by-step measure calculation for “relational” — characters laid out in a row, each labelled C or V, VC pairs bracketed and numbered, final count m=4 shown; then the same for “trouble” (m=2) and “be” (m=0) side by side for contrast]

The five steps

Rules are written in the form:

(condition) suffix → replacement

Where condition is typically a measure threshold, optionally combined with tests such as *v* (stem contains a vowel) or *d (stem ends in a double consonant).

Step 1a — plural and third-person present forms:

SSES → SS       "caresses"  → "caress"
IES  → I        "ponies"    → "poni"
SS   → SS       "caress"    → "caress"
S    → ""        "cats"      → "cat"

Step 1b — past tense and progressive:

(m > 0) EED → EE    "agreed"     → "agree"
(*v*)   ED  → ""     "plastered"  → "plaster"
(*v*)   ING → ""     "motoring"   → "motor"

If the ED or ING rule fires, a second sub-pass tidies up the resulting stem:

AT → ATE        "plat" → "plate"
BL → BLE        "bl"   → "ble"
IZ → IZE        "siz"  → "size"

Step 2 — longer derivational suffixes:

(m > 0) ATIONAL → ATE      "relational"    → "relate"
(m > 0) TIONAL  → TION     "conditional"   → "condition"
(m > 0) ENCI    → ENCE     "valenci"       → "valence"
(m > 0) IZER    → IZE      "digitizer"     → "digitize"
(m > 0) ALISM   → AL       "conformalism"  → "conformal"

Step 3 — medium-length suffixes:

(m > 0) ICATE → IC      "triplicate"   → "triplic"
(m > 0) ATIVE → ""       "formative"    → "form"
(m > 0) ALIZE → AL      "electricalize" → "electrical"
(m > 0) ICITI → IC      "electriciti"  → "electric"

Step 4 — longer structural suffixes:

(m > 1) AL    → ""    "revival"     → "reviv"
(m > 1) ANCE  → ""    "allowance"   → "allow"
(m > 1) MENT  → ""    "adjustment"  → "adjust"
(m > 1) ION   → ""    (only when stem ends in S or T)
(m > 1) ISM   → ""    "homologism"  → "homolog"

Step 5 — final cleanup:

(m > 1) E → ""
(m = 1 and not *o) E → ""      "cease" → "ceas"
(m > 1 and *d and *L) → remove one L
                                "controll" → "control"

Example

Trace "generalizations" through all five steps:

Step	Input	Rule fired	Output
1a	`generalizations`	S → ""	`generalization`
1b	`generalization`	no `v ED/ING` match	`generalization`
2	`generalization`	`ATION → ATE`	`generalize`
3	`generalize`	`ALIZE → AL`	`general`
4	`general`	`AL → ""` (m=2 > 1)	`gener`
5	`gener`	no final E	`gener`

Final stem: gener.

Failure cases

The algorithm’s purely mechanical nature produces predictable failure modes.

Over-stemming (false conflation). Two words collapse to the same stem despite different meanings:

Words	Stem	Problem
`universe`, `university`	`univers`	Unrelated concepts conflated
`general`, `generalization`	`gener`	Adjective conflated with abstract noun
`experiment`, `experience`	`experi`	False match in retrieval

Under-stemming. Variants that should share a stem do not:

Words	Stems	Problem
`run`, `ran`	`run`, `ran`	Irregular verb — vowel mutation not handled
`go`, `went`	`go`, `went`	Suppletive forms invisible to the algorithm
`good`, `better`, `best`	`good`, `better`, `best`	Irregular comparatives

Irregular morphology is structurally outside the algorithm’s scope: suffix-stripping cannot handle vowel alternations or suppletive forms. These cases require a morphological lexicon — that is, lemmatisation.

Variants and history

Porter’s original paper — M.F. Porter, “An algorithm for suffix stripping,” Program: Electronic Library and Information Systems 14(3):130–137, 1980 — is one of the most cited papers in information retrieval. The algorithm was intentionally simple and fast; Porter described it as a heuristic, not a linguistically principled morphological analyser.

Porter2 (Snowball English stemmer). Porter revisited the algorithm in the 2000s when he developed Snowball, a string-processing language for writing stemmers. The Snowball English stemmer — sometimes called Porter2 — corrects around 200 known mis-stemmings from the original, handles y-as-vowel and double-letter edge cases more consistently, and is the version recommended for new systems. It is exposed in Lucene via SnowballFilter with the English language setting.

Snowball for other languages. The Snowball framework extended the same design to over fifteen languages — German (german2), French, Spanish, Russian, Finnish, and others — using the same measure-gated suffix-stripping model adapted to each language’s morphology.

When to use it

The Porter Stemmer is a reasonable default for English full-text search when deployment simplicity and throughput matter more than edge-case accuracy. Lucene’s PorterStemFilter adds it to an analysis chain in one line.

Prefer Porter2 / Snowball English for new systems — the accuracy improvements are free and require no configuration change beyond the filter class name.

Prefer lemmatisation when:

False conflations (universe/university) noticeably degrade retrieval quality.
Stems are surfaced to users in facets or suggestions and non-words like "univers" are unacceptable.
Your pipeline already includes a POS tagger for other reasons and the additional cost of lemmatisation is negligible.

The same consistency rule applies as with any normaliser: the stemmer used at index time must be identical to the one used at query time. Mixing Porter and Porter2 between index and query silently breaks matching.