KStem

Stemming Preprocessing Information-Retrieval Query-Parsing Fuzzy-Matching Needs-Review

What it is

KStem — formally the Krovetz stemmer — is an English stemming algorithm published by Robert Krovetz in 1993. It takes a different architectural position from the Porter family: rather than applying suffix-stripping rules unconditionally, it validates each candidate stem against a small word table before committing to it. If the stripped form is not a recognisable English word, KStem backs off and tries a less aggressive reduction, or returns the original form.

The result is a stemmer that is deliberately conservative. It under-stems on the margin rather than over-stemming. Where Porter2 will conflate "generality" into "general" and "university" into "univers", KStem keeps forms closer to their surface shapes. This makes KStem the preferred stemmer when false conflations are a visible problem — precision-oriented retrieval, stems displayed in a UI, or specialised domains like legal and medical text where morphological distinctions carry meaning.

KStem ships as KStemFilter in Apache Lucene and is available in Elasticsearch via the kstem token filter.

How it works

KStem runs a three-stage pipeline for each input token.

Stage 1 — Dictionary lookup

Before any suffix is stripped, the token is looked up verbatim in the word table. This table contains a curated set of English words together with their canonical stems. If the token is found, the listed stem is returned immediately and no further processing occurs.

"knives"  →  table hit  →  "knife"

The word table is the primary mechanism distinguishing KStem from pure rule-based stemmers. It is not a full morphological lexicon — it is sized to cover high-frequency ambiguous cases that suffix rules consistently get wrong.

Stage 2 — Inflectional stripping

If no table entry matches, KStem strips inflectional suffixes: plural -s, -es, -ies, past tense -ed, progressive -ing, and comparatives -er, -est. Each stripped candidate is checked against the word table before the strip is accepted. If the candidate is a known word, the strip proceeds; if not, KStem leaves the suffix in place.

"computers"  →  strip -s  →  "computer"  →  table check: known  →  accept  →  "computer"
"axes"       →  strip -es →  "ax"        →  table check: known  →  accept
             →  also try strip -s  →  "axe"  →  known  →  prefer longer stem  →  "axe"

The re-lookup after each candidate strip is the mechanism that separates KStem from Porter. Porter fires a rule and moves on; KStem fires a rule and then asks whether the result makes sense.

Stage 3 — Derivational stripping

If inflectional stripping produces no change, KStem attempts a limited set of derivational reductions: -tion → base, -ness → base, -ity → base, -ment → base, and a handful of others. Again, each candidate is validated against the word table. If the derivational base is not in the table, the strip is refused and the inflectionally-normalised form is returned.

"generalisation"  →  strip -ation  →  "general"   →  table check: known  →  "general"
"university"      →  strip -ity    →  "universe"  →  table check: known, but wrong direction
                  →  KStem returns →  "university"   (refuses the conflation)

[illustrate: three-stage pipeline diagram for KStem — token enters Stage 1 (dictionary lookup) with a branch for “hit → return stem” and “miss → continue”; Stage 2 (inflectional stripping) with a candidate check loop showing accept/reject paths; Stage 3 (derivational stripping) with the same accept/reject loop; the word table shown as a shared resource consulted at each stage]

Example

Cases where KStem correctly refuses to conflate:

Input	Porter2 stem	KStem stem	Note
`university`	`univers`	`university`	Porter2 conflates with `"universe"`
`policy`	`polici`	`policy`	Porter2 produces a non-word stem
`electricity`	`electr`	`electric`	KStem stops at a valid word
`corporate`	`corpor`	`corporate`	KStem preserves the valid form
`presumably`	`presum`	`presumably`	Porter2 over-strips; KStem backs off

Cases where KStem under-stems (misses a valid conflation):

Input	Porter2 stem	KStem stem	Note
`generously`	`gener`	`generous`	KStem stops one level early
`computational`	`comput`	`computation`	KStem backs off derivational strip
`democratisation`	`democrat`	`democratis`	Long chain trips the validator

KStem’s under-stemming failures tend to be less damaging than Porter2’s over-stemming failures. "generous" and "generously" share the stem "generous" under KStem; they do not need to collapse all the way to "gener" to support recall for most queries.

[illustrate: side-by-side table of five words (“university”, “electricity”, “corporate”, “computational”, “generously”), showing Porter2 stem and KStem stem for each; over-stemming cases highlighted in one colour and under-stemming cases in another; a column indicating which stemmer produces a valid English word]

Variants and history

Robert Krovetz described the stemmer in a 1993 ACM SIGIR paper, Viewing Morphology as an Inference Process, which framed stemming not as suffix removal but as probabilistic inference about the underlying lexeme.

The word table in the original implementation was hand-curated. The Lucene implementation (KStemmer.java) contains an expanded version baked directly into the source as a static data structure — no external file dependency at runtime.

There is no KStem2 or Snowball-style language family. KStem is English-only by design; the word table approach does not transfer to languages with richer morphology without a much larger lexicon investment.

Configuration

Elasticsearch / OpenSearch

The kstem filter is built in; no plugin is required:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "kstem_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "kstem"]
        }
      }
    }
  }
}

To use KStem via the named filter type explicitly:

{
  "filter": {
    "my_kstem": {
      "type": "kstem"
    }
  }
}

Note: the stemmer token filter also accepts "light_english" and "minimal_english" as conservative alternatives, but these are Snowball-family stemmers, not KStem. To get KStem specifically, use "type": "kstem" or the built-in kstem filter name — they are not the same algorithm.

Solr

<filter class="solr.KStemFilterFactory"/>

Apply identically in both <analyzer type="index"> and <analyzer type="query"> blocks.

When to use it

Prefer KStem over Porter2 when:

Stems are visible to users. Autocomplete suggestions, facet labels, and spell-correction candidates benefit from KStem’s valid-word output. "electric" is a usable label; "electr" is not.
Precision matters more than recall. Legal search, medical literature retrieval, and patent databases contain terms where morphological distinctions are substantive. "operation" and "operator" should not collapse into the same index term.
False conflations are already a known problem. If query log analysis shows that "university" queries are returning documents about "universe", switching from Porter2 to KStem directly addresses the failure mode.
The corpus is standard edited English. KStem’s word table is tuned to general English vocabulary. It degrades on domain neologisms and technical abbreviations not in the table — these fall through to suffix-stripping, which behaves like a simplified Porter.

Stick with Porter2 when:

Recall is the primary objective and false conflations are acceptable.
Throughput is critical. KStem’s table lookups add overhead over pure suffix-stripping, though both run in the single-digit microsecond range per token on modern hardware.
Multilingual pipelines are in scope — KStem is English-only; Snowball covers 25+ languages.

KStem vs lemmatisation. KStem does not handle irregular morphology. "went" stays "went"; "better" stays "better". If you need canonical dictionary forms and can tolerate a runtime lexicon dependency and lower throughput, lemmatisation is the right choice. KStem occupies the middle ground: more linguistically aware than Porter, lighter-weight than a full morphological analyser.