Lovins Stemmer

What it is

The Lovins Stemmer is the first fully described, publicly available English stemming algorithm. Julie Beth Lovins published it in 1968 in Mechanical Translation and Computational Linguistics under the title “Development of a Stemming Algorithm.” Every subsequent stemmer — Porter, Porter2, KStem, Paice/Husk — was built in the shadow of this work.

Like all suffix-stripping stemmers, Lovins takes an inflected or derived word and returns a shorter string, the stem, intended to be shared by a family of related forms. The stem is not required to be a valid word. "generalizing" and "generalisation" should collapse to the same stem so that a search engine can index and match them as a single term.

What distinguishes Lovins from every stemmer that followed it is its architecture: the entire transformation is carried out in a single pass. The algorithm locates the longest suffix that matches a rule, removes it, optionally transforms the result, and stops. There is no iteration.

How it works

The algorithm has three components: a suffix table, a set of condition codes, and a recoding table.

1. Suffix table — 294 rules

The suffix table lists 294 suffixes in descending order of length. The algorithm scans the table from longest to shortest and removes the first suffix that matches the end of the input word, provided two constraints are satisfied:

  • The remaining stem must be at least a minimum length (typically two characters).
  • The condition code attached to that suffix rule must evaluate to true.

Because the scan goes longest-first, "-izations" is tried before "-ations", which is tried before "-ions", and so on. Only one suffix is ever removed.

[illustrate: longest-suffix-first scan over the word “generalizations” — suffixes attempted from longest ("-izations", “-ations”, “-ions”, “-ns”, “-s”) shown as overlapping brackets on the word; the first matching suffix highlighted and removed, leaving the stem “general”]

2. Condition codes — 29 tests

Each suffix rule is tagged with one of 29 condition codes (labelled A through BB in Lovins’s original notation). A condition code is a boolean test on the candidate stem — the string that would remain after the suffix is stripped. If the test fails, that rule is skipped and the next-longest matching suffix is tried.

Examples of condition types:

Code Condition on the candidate stem
A No condition — always applies
B Length ≥ 3
C Length ≥ 4
H Ends in two identical consonants
I Ends in a vowel + T
J Ends in ST
M Does not end in A, E, I, O, U, or Y

Condition codes serve the same protective function as Porter’s measure m — they prevent rules from firing on stems that are too short or structurally inappropriate — but using named boolean predicates rather than a single numerical threshold.

3. Recoding — ~35 transformation rules

After suffix removal, the stem may have irregular spelling artifacts: a letter dropped at a morpheme boundary, a doubled consonant that should be collapsed, a terminal e that should be restored. Lovins called the fix-up stage recoding.

Approximately 35 recoding rules are applied to the raw stem. Each rule is a simple string substitution triggered by the stem’s ending. For example:

stem ends in "bb"  →  replace with "b"    ("robbing" → "rob" → "rob")
stem ends in "iev"  →  replace with "ief"  ("belief" restored from stripping "-ving")
stem ends in "udd"  →  replace with "ud"   ("muddy" → "mud")

Recoding is what distinguishes Lovins from a bare suffix-stripping table: it acknowledges that morphological boundaries cause orthographic changes and attempts to undo them.

[illustrate: three-stage pipeline for a single token — box 1 “Suffix scan” removes the longest matching suffix; box 2 “Condition check” tests the candidate stem; box 3 “Recoding” applies fix-up substitution — show “believing” flowing through: suffix “-ing” stripped → “believ” checked → recoded to “belief”]

Example

Trace "generalizations" through the algorithm:

Step 1 — Suffix scan. The algorithm tries suffixes from longest to shortest. It finds "-izations" (9 characters) in the table. The condition code for this rule is A — no constraint. The candidate stem is "general" (7 characters), which meets the minimum length.

Step 2 — Strip. The suffix is removed: "generalizations""general".

Step 3 — Recode. The stem "general" does not match any recoding pattern. It is returned as-is.

Final stem: general

Now contrast with "itemization":

Stage Value Note
Input itemization
Suffix matched -ization Rule fires; condition A
After strip item 4 characters — passes minimum
After recode item No recoding applies
Stem item

And "debugging":

Stage Value Note
Input debugging
Suffix matched -ing
After strip debugg
Recode: gg → g debug Doubled consonant collapsed
Stem debug

[illustrate: side-by-side before/after for “generalizations”, “itemization”, and “debugging” — each token shown as a labelled chip, suffix portion shaded, recoded characters highlighted in the final stem]

Variants and history

Lovins published the algorithm as a practical tool for an information retrieval system at the University of Chicago. The paper included the full suffix table, all 29 condition codes, and the recoding rules — unusually complete documentation for the era, and the reason the algorithm could be independently replicated.

The 1968 paper attracted relatively little immediate uptake. Porter’s 1980 algorithm, which appeared in a more widely read venue and was implemented in software that researchers could run directly, became the field’s standard. Once Porter established the iterative multi-pass design as the default pattern, Lovins’s single-pass approach came to be seen as a curiosity rather than a competitor.

Implementations appeared in various IR research toolkits through the 1970s and 1980s. Lucene does not ship a Lovins implementation as a first-class filter; research reproductions exist in Java and Python but are not maintained under any standard library.

No significant variants of the Lovins Stemmer exist — unlike Porter, which spawned Porter2 and the Snowball family. The algorithm is treated in the literature as a fixed historical artefact rather than a living baseline.

When to use it

In practice: rarely, and almost never in production.

The Lovins Stemmer over-stems more aggressively than Porter. Its single-pass, longest-match design means that a long suffix match can strip too much, and the recoding rules do not fully compensate. Words in different semantic fields collapse to the same stem more often than they should.

It remains useful in two narrow contexts:

  • IR research baselines. When a paper needs to report results under multiple stemming conditions, Lovins provides a well-documented historical point of comparison that predates all other described algorithms.
  • Stemming algorithm studies. Because the suffix table and condition codes are fully enumerated in the original paper, the Lovins Stemmer is straightforward to implement from scratch and is often used in coursework or comparative analyses of stemmer design decisions.

For any production English search system, prefer Porter2 / Snowball English or KStem. Both offer substantially better precision whilst retaining the speed advantages of rule-based stemming. If linguistic accuracy is critical — faceted navigation, user-facing suggestions, downstream NLP tasks — use lemmatisation instead.

See also