Stemming
-
Snowball Stemmer
Snowball is a string-processing language and framework for writing stemming algorithms, developed by Martin Porter. It ships stemmers for 20+ languages and is the source of the Porter2 (English) stemmer used in most modern search engines.
-
Hunspell
Hunspell is a dictionary-based morphological analyser and spell checker that produces lemmas by stripping affixes and looking up base forms in a language-specific dictionary.
-
Inflection
Inflection is the morphological process by which a single lexeme takes on different surface forms to express grammatical categories such as tense, number, and case — the variation that lemmatisation is designed to undo.
-
Morphological Analysis
Morphological analysis decomposes words into their constituent morphemes — stems, prefixes, suffixes, and inflectional endings — enabling NLP systems to recognise that surface-form variants refer to the same underlying concept.
-
Suffix
A suffix is a bound morpheme appended to the right end of a word stem, encoding grammatical properties or creating new words — and the primary target of every English stemming algorithm.
-
Lancaster Stemmer
The Lancaster Stemmer is an alternative name for the Paice/Husk Stemmer — an aggressive, iterative English stemming algorithm developed at Lancaster University.
-
Lovins Stemmer
The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
-
Paice/Husk Stemmer
The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
-
KStem
KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
-
Porter Stemmer
The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
-
Porter2 Stemmer
Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
-
Stemming
Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.