Normalisation
-
Case Folding
Case folding is locale-aware lowercasing that correctly handles languages where simple ASCII lowercasing produces wrong results — such as Turkish dotted-i or German sharp-s.
-
ASCII Folding
ASCII folding maps accented and special characters to their closest ASCII equivalents using a lookup table, improving recall for users who omit diacritics at the cost of collapsing distinctions that may be semantically meaningful.
-
Decimal Digit Filter
A decimal digit filter maps Unicode decimal digit characters from any script to their ASCII 0–9 equivalents, ensuring that numbers written in Eastern Arabic, Devanagari, Thai, and other numeral systems match the same query regardless of which digit form was used.
-
Elision Filter
An elision filter is a token filter that strips language-specific clitic prefixes — such as French l’ and d’ — from the start of tokens, leaving the bare stem for indexing and matching.
-
Length Filter
A length filter is a token filter in an analysis chain that discards any token whose character length falls outside a configured minimum and maximum bound, removing noise tokens produced by tokenisation or upstream rewriting.
-
Lowercasing
Lowercasing converts every character in a string to its lowercase form, eliminating case variation so that ‘HTTP’, ‘Http’, and ‘http’ map to a single index term.
-
Pattern Replace Filter
A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
-
Trim Filter
A trim filter is a token filter that strips leading and trailing whitespace characters from each token in the analysis stream, leaving the token’s interior content unchanged.
-
Unicode Normalisation
Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.