Tokenisation
-
Tokeniser Vocabulary
Fixed set of subword units learned or predefined for tokenisation; typically 32k–128k tokens, balancing compression and flexibility.
-
Query-time Analysis
Query-time analysis is the analysis chain applied to the query string before it is matched against the index. It must produce terms compatible with those generated at index time.
-
Index-time Analysis
Index-time analysis is the analysis chain applied to document text when it is ingested into the index. The terms it produces are what get stored in the inverted index and matched against at search time.
-
Case Folding
Case folding is locale-aware lowercasing that correctly handles languages where simple ASCII lowercasing produces wrong results — such as Turkish dotted-i or German sharp-s.
-
Analyzer
An analyzer is a named, reusable analysis chain configuration in Solr, Elasticsearch, or OpenSearch — combining a tokeniser and token filters into a unit that can be assigned to fields.
-
Analysis Chain
An analysis chain is the ordered pipeline of tokeniser and token filters that transforms raw text into index terms. The same chain (or a compatible one) must be applied at both index time and query time.
-
Decompounding
Decompounding splits compound words — common in German, Dutch, and Scandinavian languages — into their component tokens so that searches for constituents match the full compound at index and query time.
-
Elision Filter
An elision filter is a token filter that strips language-specific clitic prefixes — such as French l’ and d’ — from the start of tokens, leaving the bare stem for indexing and matching.
-
Length Filter
A length filter is a token filter in an analysis chain that discards any token whose character length falls outside a configured minimum and maximum bound, removing noise tokens produced by tokenisation or upstream rewriting.
-
Lowercasing
Lowercasing converts every character in a string to its lowercase form, eliminating case variation so that ‘HTTP’, ‘Http’, and ‘http’ map to a single index term.
-
Normalisation
Normalisation transforms raw text into a consistent, canonical form — lowercasing, accent stripping, Unicode standardisation — so that surface variants of the same term map to a single index entry.
-
Pattern Replace Filter
A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
-
Stop Word
A stop word is a high-frequency function word — such as the, is, or at — removed from a token stream during analysis to reduce index noise and improve retrieval efficiency.
-
Stop Word Filter
A stop word filter is a token filter in an analysis chain that removes stop words from the token stream at index time and query time, reducing index size and suppressing high-frequency noise terms.
-
Trim Filter
A trim filter is a token filter that strips leading and trailing whitespace characters from each token in the analysis stream, leaving the token’s interior content unchanged.
-
CJK Tokeniser
A CJK tokeniser segments Chinese, Japanese, and Korean text into tokens by splitting at every character or by applying a dictionary and statistical model to identify word boundaries.
-
Thai Tokeniser
A Thai tokeniser segments Thai script into words by combining a word-boundary dictionary with statistical or ML models, since Thai is written without spaces between words.
-
Path Hierarchy Tokeniser
A path hierarchy tokeniser splits a path string into every prefix hierarchy, so that a document at
/a/b/cis also findable by/a/bor/a— enabling subtree search on file paths, URL components, and category trees. -
Edge N-Gram
An edge n-gram is a prefix-anchored n-gram generated from the start of a token, used in search engines to power as-you-type autocomplete and prefix matching.
-
ICU Tokeniser
The ICU tokeniser applies ICU BreakIterator rules to split text into tokens, extending UAX #29 with locale-aware dictionary segmentation for CJK and Thai and support for custom script rules.
-
Unigram Language Model Tokeniser
The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
-
SentencePiece
SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
-
WordPiece
WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
-
Byte Pair Encoding
Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
-
Subword Tokenisation
Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
-
Unicode Tokeniser
A Unicode tokeniser splits text into tokens using Unicode character categories and the UAX #29 word-boundary rules, producing correct token boundaries across all scripts and languages.
-
Regex Tokeniser
A regex tokeniser defines token boundaries with a regular expression, either splitting on delimiter matches or extracting token matches — the generalisation underlying whitespace, punctuation, and word tokenisers.
-
Sentence Tokeniser
A sentence tokeniser splits a document into individual sentences, establishing the boundary between document-level and word-level processing — a step that is harder than it appears because full stops serve multiple roles.
-
Word Tokeniser
A word tokeniser splits text into tokens at word boundaries using rules or regular expressions, correctly handling punctuation, contractions, hyphenation, and URLs where a whitespace split would fail.
-
Punctuation Tokeniser
A punctuation tokeniser splits text on both whitespace and punctuation characters, emitting only alphabetic and numeric runs — a simple, stateless approach common in search engine analysis chains.
-
Whitespace Tokeniser
A whitespace tokeniser splits a string into tokens by breaking on space, tab, and newline characters — the simplest possible tokenisation strategy, with well-defined failure modes.
-
Stemming
Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.
-
Character N-Gram
A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
-
Bigram
A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
-
Unigram
A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
-
N-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
-
Tokenisation
Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.
-
Token
A token is the smallest unit of text that an NLP pipeline or search engine operates on — typically a word, subword, or character produced by splitting an input string.