Query-Parsing
-
Query-time Analysis
Query-time analysis is the analysis chain applied to the query string before it is matched against the index. It must produce terms compatible with those generated at index time.
-
Query Parser
Interprets a query string and produces a structured query object for execution. Essential bridge between user input and query engine.
-
Boolean Retrieval
Boolean retrieval matches documents using AND, OR, and NOT operators applied to inverted index postings lists. It returns an exact set — all matching documents, unranked — rather than a ranked list.
-
ASCII Folding
ASCII folding maps accented and special characters to their closest ASCII equivalents using a lookup table, improving recall for users who omit diacritics at the cost of collapsing distinctions that may be semantically meaningful.
-
Elision Filter
An elision filter is a token filter that strips language-specific clitic prefixes — such as French l’ and d’ — from the start of tokens, leaving the bare stem for indexing and matching.
-
HTML Strip
HTML stripping is a character-level preprocessing stage that removes markup tags and decodes HTML entities from raw text before it reaches the tokeniser, preventing angle brackets and entity sequences from appearing as index terms.
-
KStem
KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
-
Normalisation
Normalisation transforms raw text into a consistent, canonical form — lowercasing, accent stripping, Unicode standardisation — so that surface variants of the same term map to a single index entry.
-
Pattern Replace Filter
A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
-
Porter Stemmer
The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
-
Porter2 Stemmer
Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
-
Stop Word
A stop word is a high-frequency function word — such as the, is, or at — removed from a token stream during analysis to reduce index noise and improve retrieval efficiency.
-
Stop Word Filter
A stop word filter is a token filter in an analysis chain that removes stop words from the token stream at index time and query time, reducing index size and suppressing high-frequency noise terms.
-
CJK Tokeniser
A CJK tokeniser segments Chinese, Japanese, and Korean text into tokens by splitting at every character or by applying a dictionary and statistical model to identify word boundaries.
-
Path Hierarchy Tokeniser
A path hierarchy tokeniser splits a path string into every prefix hierarchy, so that a document at
/a/b/cis also findable by/a/bor/a— enabling subtree search on file paths, URL components, and category trees. -
ICU Tokeniser
The ICU tokeniser applies ICU BreakIterator rules to split text into tokens, extending UAX #29 with locale-aware dictionary segmentation for CJK and Thai and support for custom script rules.
-
Unicode Tokeniser
A Unicode tokeniser splits text into tokens using Unicode character categories and the UAX #29 word-boundary rules, producing correct token boundaries across all scripts and languages.
-
Regex Tokeniser
A regex tokeniser defines token boundaries with a regular expression, either splitting on delimiter matches or extracting token matches — the generalisation underlying whitespace, punctuation, and word tokenisers.
-
Word Tokeniser
A word tokeniser splits text into tokens at word boundaries using rules or regular expressions, correctly handling punctuation, contractions, hyphenation, and URLs where a whitespace split would fail.
-
Punctuation Tokeniser
A punctuation tokeniser splits text on both whitespace and punctuation characters, emitting only alphabetic and numeric runs — a simple, stateless approach common in search engine analysis chains.
-
Whitespace Tokeniser
A whitespace tokeniser splits a string into tokens by breaking on space, tab, and newline characters — the simplest possible tokenisation strategy, with well-defined failure modes.
-
Stemming
Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.
-
Trie
A trie is a tree where each path from root to node spells out a prefix, enabling O(k) term lookup, prefix enumeration, and autocomplete — where k is the length of the query string.