Porter2 Stemmer

Stemming Preprocessing Information-Retrieval Query-Parsing String-Matching Needs-Review

What it is

The Porter2 Stemmer — formally the Snowball English stemmer — is Martin Porter’s rewrite of his 1980 algorithm, published as part of the Snowball project in the early 2000s. It corrects roughly 200 known mis-stemmings in the original, tightens several structural definitions, and adds handling for edge cases that the original left ambiguous. Porter himself recommends Porter2 over the original for all new work; it is the stemmer behind Elasticsearch’s built-in english analyser.

The algorithm remains purely rule-based and dictionary-free. Like its predecessor it strips English suffixes in a fixed sequence of passes, but it replaces the original’s measure m gating mechanism with a more precise region-based model, and it adds pre-processing steps that the original lacked entirely.

How it works

Region definitions: R1 and R2

The original Porter Stemmer gates every rule on the measure m — a count of vowel-consonant transitions. Porter2 replaces this with two explicit string regions:

R1 is the suffix of the word beginning after the first non-vowel following the first vowel. Skip any leading consonants, find the first vowel, then find the first consonant after it — R1 starts there.
R2 is the same operation applied again inside R1.

Word: "beautiful"
       b e a u t i f u l
       C V V V C V C V C

First vowel = e (pos 1)
First non-vowel after e = t (pos 4)
R1 = "iful"   (from pos 5)

R2 is R1's own R1:
R1 = "iful":  first vowel = i, first non-vowel after = f
R2 = "ul"     (from pos 2 of R1)

Rules in later steps require the suffix being stripped to lie entirely within R1 or R2. This is more precise than a global measure count: it prevents stripping from short words regardless of their VC pattern, and avoids some over-stripping that m permitted on words with clustered consonants.

One fixed override: R1 is set to begin no earlier than position 3. At least three characters are always protected from stripping.

[illustrate: “beautiful” shown letter-by-letter with C/V labels; R1 and R2 regions shaded in two distinct colours, boundaries annotated with the rule that sets each; contrast side-by-side with the m=2 measure for the same word under Porter1]

y-as-vowel pre-processing

Before any step fires, Porter2 recategorises initial y and any y preceded by a vowel as a vowel marker Y:

"yell"    →  "Yell"   (initial y → Y; treated as consonant for region purposes)
"player"  →  "plaYer" (y preceded by vowel → Y; treated as vowel)
"beyond"  →  "beYond"

The original Porter Stemmer treated y inconsistently, leading to mismeasurement for words like "sky" and "syrup". Uppercasing Y makes the categorisation explicit and rule-referenceable.

Possessive stripping (Step 0)

Porter2 adds a step entirely absent from the original: stripping 's', 's, and ' before any other step runs.

"running's"  →  "running"
"woman's"    →  "woman"
"dogs'"      →  "dogs"

Without Step 0, "running's" falls through the suffix rules entirely and reaches the index as an unstemmed token — a common failure mode when tokenisers do not split on apostrophes.

Key rule corrections

The majority of Porter2’s improvements are individual rule corrections.

Step 1a fixes IES → I. Under Porter1, "ties" → "ti", losing the semantic root. Porter2 emits "tie" when the word is short enough:

Porter1:  "ties"  →  "ti"
Porter2:  "ties"  →  "tie"   (length ≤ 4 after strip)
           "cries" →  "cri"   (longer form still strips to short stem)

Step 2 expands derivational suffix handling and fixes ISATION / IZATION, which Porter1 conflated inconsistently:

ISATION → ISE
IZATION → IZE

Both now reduce to the same stem via a clean two-step sequence.

Step 5 removes the *o (ends-in-CVC) exception governing final-E deletion and replaces it with a pure R2 region test. This eliminates several well-known over-stemmings.

The exception word list

Porter2 ships with approximately 18 hard-coded exceptions that bypass the algorithm and map to a fixed stem — cases where the rule cascade produces the wrong result regardless of region gating:

"skis"    →  "ski"       "dying"   →  "die"
"skies"   →  "sky"       "lying"   →  "lie"
"early"   →  "earli"     "tying"   →  "tie"
"only"    →  "onli"      "gently"  →  "gentl"

These words cannot be correctly stemmed by suffix-stripping alone. The list fits in a lookup table with negligible performance cost.

The Snowball language

Porter2 is written in Snowball, a domain-specific language Porter designed for string transformation algorithms. Snowball programs operate on a string with a movable cursor and expose primitives for suffix testing, region boundary checking, and in-place replacement:

define step1a as (
    [substring] among (
        'sses'  (<-  'ss')
        'ied'
        'ies'   (>r2  <-  'i'   or  <-  'ie')
        's'     (not 'ss' hop 1 test hop 1 delete)
    )
)

The notation >r2 tests whether the cursor position is within R2. <- replaces the matched suffix. This makes the algorithm’s region logic explicit and independently auditable — a significant advantage over Porter1’s prose-and-pseudocode specification, which contained ambiguities that different implementations resolved differently.

Snowball compiles to C, Java, and several other targets. The generated Java is what ships in Lucene’s SnowballFilter.

Example

Trace "generously" through the key steps:

Step	Input	Rule	Output
0	`generously`	no possessive	`generously`
y-flag	`generously`	no vowel-adjacent y	`generously`
1a–1b	`generously`	no match	`generously`
2	`generously`	`OUSLY → OUS` (in R1)	`generous`
3	`generous`	`OUS → ""` (in R2)	`gener`
4–5	`gener`	no match	`gener`

Final stem: gener.

[illustrate: side-by-side pipeline for “generously” under Porter1 (m-gated rules) and Porter2 (R1/R2 region rules), the active rule shown at each step, R1/R2 regions shaded on the word string above each Porter2 rule; highlight Step 2 where the suffix is tested against R1]

Variants and history

Martin Porter published the revision in 2001–2002 while building the Snowball framework, motivated by a growing list of known failures accumulated from user reports and comparative studies. The revision was never published as a standalone paper — it lives at snowballstem.org alongside the Snowball compiler and stemmer implementations for over 25 languages.

The name Porter2 is informal; the official name is the Snowball English stemmer. Lucene exposes it via SnowballFilter with language "English".

Other Snowball stemmers follow the same R1/R2 region model: German2, French, Spanish, Dutch, Finnish, Russian, Arabic, and others. Adding a new language to a multilingual pipeline is a configuration change, not an algorithm rewrite.

When to use it

Porter2 vs Porter1. Porter2 is a strict improvement: same performance profile, same computational cost, no additional dependencies, noticeably fewer mis-stemmings. There is no reason to use Porter1 in new systems. In Lucene, switch PorterStemFilter for SnowballFilter with language "English".

Elasticsearch’s english analyser. The built-in english analyser applies Porter2 by default via the stemmer token filter set to "english". If you are using the english analyser without customisation, you are already using Porter2.

{
  "filter": {
    "my_stemmer": {
      "type": "stemmer",
      "language": "english"
    }
  }
}

Porter2 vs KStem. KStem combines suffix-stripping with a small lexicon of known inflections, making it more conservative — it under-stems rather than over-stems. KStem is the better choice when false conflations are a visible problem and recall is secondary. Porter2 gives broader recall; KStem gives cleaner stems on the cases it covers.

Porter2 vs lemmatisation. Lemmatisation uses a morphological lexicon to return the canonical dictionary form. It handles irregular morphology — "went" → "go", "better" → "good" — that no suffix stripper can reach. The cost is a runtime dependency on a lexicon and typically a 5–20× throughput reduction. Choose lemmatisation when stems are surfaced to users (autocomplete labels, facet values), when false conflations degrade result quality measurably, or when downstream NLP tasks require real word forms.

The index-time / query-time consistency rule is unchanged: the same stemmer must run in both directions. Mixing Porter2 at index time with KStem at query time silently breaks matching.