Snowball Stemmer

What it is

Snowball is both a language for writing stemming algorithms and a collection of stemmers compiled from that language. Created by Martin Porter — the same researcher who published the original Porter stemmer in 1980 — Snowball was designed to make it easier to write, test, and publish stemmers for multiple languages in a consistent, maintainable way.

The name comes from the analogy that a snowball rolling downhill accumulates more snow: the framework has accumulated stemmers for more languages over time.

How it works

The Snowball language defines primitive operations on strings — suffix removal, prefix removal, conditional checks on string endings, and vowel/consonant categorisation. A Snowball program is a set of rules encoded in this language; a compiler (Snowball) translates the program into C, Java, Python, or other target languages.

All Snowball stemmers follow the same broad pattern:

  1. Identify the stem region of the word (the suffix-strippable tail, defined per-language).
  2. Apply a sequence of conditional suffix-removal rules from longest to shortest.
  3. Clean up double consonants, trailing vowels, or other artefacts left by previous steps.

Because Snowball programs are compiled to native code, the resulting stemmers are fast enough for production indexing pipelines.

Example

English (Porter2/Snowball):

Input Stem
generalises generalis
generalisation generalis
generalising generalis
electrically electr
provision provis

German (Snowball):

Input Stem
Auffassung auffass
Wörter wort
Häuser haus

Variants and history

Martin Porter released Snowball in 2001 as an open-source project at snowballstem.org. The current Snowball distribution ships stemmers for:

European languages: English (Porter2), German, French, Spanish, Portuguese, Italian, Dutch, Swedish, Norwegian, Danish, Finnish, Hungarian, Romanian, Russian, Turkish

Other: Arabic, Basque, Catalan, Irish, Tamil

The Porter2 stemmer (also called the “English Snowball stemmer”) supersedes the original 1980 Porter stemmer. It is the default English stemmer in Elasticsearch, OpenSearch, Solr, and Lucene. The improvements over Porter include better handling of certain suffix rules and a cleaner algorithm structure.

Snowball stemmers are suffix-stripping algorithms — they are faster and more predictable than dictionary-based approaches like Hunspell, but they do not consult a lexicon, so they can produce non-words as stems (electrically → electr).

When to use it

Use a Snowball stemmer as the default stemming choice for any language it supports. For English specifically, use Porter2. For other languages, use the language-specific Snowball stemmer rather than attempting to adapt an English one.

Snowball stemmers are the right choice when:

  • You need fast, deterministic stemming with no external dictionary dependency.
  • You are building a traditional keyword search pipeline.
  • The language is one of the 20+ Snowball supports.

Prefer Hunspell or a morphological analyser when you need true lemmatisation (actual dictionary base forms) rather than approximate stems — for example, when your application displays the stem to users, or when your language has highly irregular morphology.

See also