Preprocessing
-
Snowball Stemmer
Snowball is a string-processing language and framework for writing stemming algorithms, developed by Martin Porter. It ships stemmers for 20+ languages and is the source of the Porter2 (English) stemmer used in most modern search engines.
-
Query-time Analysis
Query-time analysis is the analysis chain applied to the query string before it is matched against the index. It must produce terms compatible with those generated at index time.
-
Index-time Analysis
Index-time analysis is the analysis chain applied to document text when it is ingested into the index. The terms it produces are what get stored in the inverted index and matched against at search time.
-
Case Folding
Case folding is locale-aware lowercasing that correctly handles languages where simple ASCII lowercasing produces wrong results — such as Turkish dotted-i or German sharp-s.
-
Analyzer
An analyzer is a named, reusable analysis chain configuration in Solr, Elasticsearch, or OpenSearch — combining a tokeniser and token filters into a unit that can be assigned to fields.
-
Analysis Chain
An analysis chain is the ordered pipeline of tokeniser and token filters that transforms raw text into index terms. The same chain (or a compatible one) must be applied at both index time and query time.
-
Phonetic Encoding
Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
-
Decompounding
Decompounding splits compound words — common in German, Dutch, and Scandinavian languages — into their component tokens so that searches for constituents match the full compound at index and query time.
-
Hunspell
Hunspell is a dictionary-based morphological analyser and spell checker that produces lemmas by stripping affixes and looking up base forms in a language-specific dictionary.
-
Inflection
Inflection is the morphological process by which a single lexeme takes on different surface forms to express grammatical categories such as tense, number, and case — the variation that lemmatisation is designed to undo.
-
Morphological Analysis
Morphological analysis decomposes words into their constituent morphemes — stems, prefixes, suffixes, and inflectional endings — enabling NLP systems to recognise that surface-form variants refer to the same underlying concept.
-
Suffix
A suffix is a bound morpheme appended to the right end of a word stem, encoding grammatical properties or creating new words — and the primary target of every English stemming algorithm.
-
Lancaster Stemmer
The Lancaster Stemmer is an alternative name for the Paice/Husk Stemmer — an aggressive, iterative English stemming algorithm developed at Lancaster University.
-
Lovins Stemmer
The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
-
Paice/Husk Stemmer
The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
-
ASCII Folding
ASCII folding maps accented and special characters to their closest ASCII equivalents using a lookup table, improving recall for users who omit diacritics at the cost of collapsing distinctions that may be semantically meaningful.
-
Decimal Digit Filter
A decimal digit filter maps Unicode decimal digit characters from any script to their ASCII 0–9 equivalents, ensuring that numbers written in Eastern Arabic, Devanagari, Thai, and other numeral systems match the same query regardless of which digit form was used.
-
Elision Filter
An elision filter is a token filter that strips language-specific clitic prefixes — such as French l’ and d’ — from the start of tokens, leaving the bare stem for indexing and matching.
-
HTML Strip
HTML stripping is a character-level preprocessing stage that removes markup tags and decodes HTML entities from raw text before it reaches the tokeniser, preventing angle brackets and entity sequences from appearing as index terms.
-
KStem
KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
-
Length Filter
A length filter is a token filter in an analysis chain that discards any token whose character length falls outside a configured minimum and maximum bound, removing noise tokens produced by tokenisation or upstream rewriting.
-
Lowercasing
Lowercasing converts every character in a string to its lowercase form, eliminating case variation so that ‘HTTP’, ‘Http’, and ‘http’ map to a single index term.
-
Normalisation
Normalisation transforms raw text into a consistent, canonical form — lowercasing, accent stripping, Unicode standardisation — so that surface variants of the same term map to a single index entry.
-
Pattern Replace Filter
A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
-
Porter Stemmer
The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
-
Porter2 Stemmer
Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
-
Stop Word
A stop word is a high-frequency function word — such as the, is, or at — removed from a token stream during analysis to reduce index noise and improve retrieval efficiency.
-
Stop Word Filter
A stop word filter is a token filter in an analysis chain that removes stop words from the token stream at index time and query time, reducing index size and suppressing high-frequency noise terms.
-
Trim Filter
A trim filter is a token filter that strips leading and trailing whitespace characters from each token in the analysis stream, leaving the token’s interior content unchanged.
-
Unicode Normalisation
Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.
-
CJK Tokeniser
A CJK tokeniser segments Chinese, Japanese, and Korean text into tokens by splitting at every character or by applying a dictionary and statistical model to identify word boundaries.
-
Thai Tokeniser
A Thai tokeniser segments Thai script into words by combining a word-boundary dictionary with statistical or ML models, since Thai is written without spaces between words.
-
Path Hierarchy Tokeniser
A path hierarchy tokeniser splits a path string into every prefix hierarchy, so that a document at
/a/b/cis also findable by/a/bor/a— enabling subtree search on file paths, URL components, and category trees. -
ICU Tokeniser
The ICU tokeniser applies ICU BreakIterator rules to split text into tokens, extending UAX #29 with locale-aware dictionary segmentation for CJK and Thai and support for custom script rules.
-
Unigram Language Model Tokeniser
The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
-
SentencePiece
SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
-
WordPiece
WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
-
Byte Pair Encoding
Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
-
Subword Tokenisation
Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
-
Unicode Tokeniser
A Unicode tokeniser splits text into tokens using Unicode character categories and the UAX #29 word-boundary rules, producing correct token boundaries across all scripts and languages.
-
Regex Tokeniser
A regex tokeniser defines token boundaries with a regular expression, either splitting on delimiter matches or extracting token matches — the generalisation underlying whitespace, punctuation, and word tokenisers.
-
Sentence Tokeniser
A sentence tokeniser splits a document into individual sentences, establishing the boundary between document-level and word-level processing — a step that is harder than it appears because full stops serve multiple roles.
-
Word Tokeniser
A word tokeniser splits text into tokens at word boundaries using rules or regular expressions, correctly handling punctuation, contractions, hyphenation, and URLs where a whitespace split would fail.
-
Punctuation Tokeniser
A punctuation tokeniser splits text on both whitespace and punctuation characters, emitting only alphabetic and numeric runs — a simple, stateless approach common in search engine analysis chains.
-
Whitespace Tokeniser
A whitespace tokeniser splits a string into tokens by breaking on space, tab, and newline characters — the simplest possible tokenisation strategy, with well-defined failure modes.
-
Lemmatisation
Lemmatisation reduces an inflected word form to its dictionary base form — its lemma — by applying morphological analysis and a lexicon lookup, producing valid words rather than truncated stems.
-
Stemming
Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.
-
Corpus
A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.
-
Tokenisation
Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.
-
Token
A token is the smallest unit of text that an NLP pipeline or search engine operates on — typically a word, subword, or character produced by splitting an input string.