Information-Retrieval
-
Vector Space Model
The vector space model (VSM) represents documents and queries as vectors in a high-dimensional term space and ranks documents by their cosine similarity to the query vector.
-
Term Vector
A term vector is a per-document record of the terms, frequencies, and optionally positions and offsets produced by index-time analysis. It enables highlighting, More Like This queries, and forward-index access.
-
Term Frequency
Term frequency (TF) is the count of how many times a term appears in a document. It is one of the two core signals in TF-IDF and BM25 scoring.
-
Stored Field
A stored field retains the original verbatim value of a field in the index so it can be returned in search results. Stored fields are separate from the inverted index and from DocValues.
-
Snowball Stemmer
Snowball is a string-processing language and framework for writing stemming algorithms, developed by Martin Porter. It ships stemmers for 20+ languages and is the source of the Porter2 (English) stemmer used in most modern search engines.
-
Segment
A segment is an immutable, self-contained unit of a Lucene index. New documents are written to new segments; segments are periodically merged to keep the index efficient.
-
Probabilistic Retrieval Model
Probabilistic retrieval models rank documents by their estimated probability of relevance to a query. BM25 is the most successful probabilistic retrieval model; language models offer an alternative probabilistic framework.
-
Postings List
A postings list is the ordered sequence of postings for a single term in an inverted index — the list of all documents containing that term, with optional frequencies and positions.
-
Posting
A posting is a single record in an inverted index, linking a term to one document in which it appears — optionally including term frequency and token positions.
-
Positional Index
A positional index extends the inverted index by storing the token position of each term occurrence within a document, enabling phrase queries and proximity queries.
-
Okapi BM25
Okapi BM25 is the original formulation of BM25, developed at City University London on the Okapi IR system in the early 1990s. The name ‘Okapi BM25’ honours the system; in practice it is synonymous with BM25.
-
Merge Policy
A merge policy defines the rules governing when and how Lucene index segments are merged. Merging controls the tradeoff between indexing throughput, search performance, and disk usage.
-
Learning to Rank
Learning to rank (LTR) trains a model to produce an optimal ordering of documents for a query using labelled relevance data, combining signals such as BM25, click-through rate, and document features.
-
Inverse Document Frequency
Inverse document frequency (IDF) is a log-scaled measure of how rare a term is across a corpus. Rare terms receive high IDF weights; common terms receive low weights, making IDF a natural filter for uninformative vocabulary.
-
Index-time Analysis
Index-time analysis is the analysis chain applied to document text when it is ingested into the index. The terms it produces are what get stored in the inverted index and matched against at search time.
-
Forward Index
A forward index maps each document to the list of terms it contains. It is the natural output of document ingestion and the starting point for building an inverted index.
-
Field Type
A field type is a named schema definition that specifies how a field’s values are stored, indexed, and analysed. It bundles an analyzer, storage options, and index behaviour into a reusable configuration.
-
DocValues
DocValues is a column-oriented on-disk data structure in Lucene that stores field values per document, enabling efficient sorting, faceting, and aggregations without loading the entire index into memory.
-
Document Frequency
Document frequency (DF) is the number of documents in a corpus that contain a given term. It is the denominator in IDF and signals how common or rare a term is across the collection.
-
Cosine Similarity
Vector similarity metric: dot product / product of magnitudes. Standard measure for dense and sparse vector comparison in IR.
-
Commit
A commit makes indexed documents durable by flushing the Lucene transaction log to disk and writing a new segment commit point. Elasticsearch distinguishes hard commits (durable) from refreshes (visible but not durable).
-
Boolean Retrieval
Boolean retrieval matches documents using AND, OR, and NOT operators applied to inverted index postings lists. It returns an exact set — all matching documents, unranked — rather than a ranked list.
-
BM25F
BM25F extends BM25 to multi-field documents by weighting each field separately before combining, so title matches can outweigh body matches without simply multiplying the final score.
-
BM25+
BM25+ fixes an edge-case bug in BM25 where long documents containing a rare query term can score lower than shorter documents that don’t contain it at all, by adding a small constant lower-bound to the TF contribution.
-
Analyzer
An analyzer is a named, reusable analysis chain configuration in Solr, Elasticsearch, or OpenSearch — combining a tokeniser and token filters into a unit that can be assigned to fields.
-
Analysis Chain
An analysis chain is the ordered pipeline of tokeniser and token filters that transforms raw text into index terms. The same chain (or a compatible one) must be applied at both index time and query time.
-
Beider-Morse Phonetic Matching
Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
-
Cologne Phonetics
Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
-
Daitch-Mokotoff Soundex
A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
-
Match Rating Approach
The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
-
NYSIIS
NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
-
Caverphone
Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
-
Double Metaphone
Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
-
Metaphone
Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
-
Metaphone 3
Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
-
Phonetic Encoding
Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
-
Soundex
Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
-
Decompounding
Decompounding splits compound words — common in German, Dutch, and Scandinavian languages — into their component tokens so that searches for constituents match the full compound at index and query time.
-
Hunspell
Hunspell is a dictionary-based morphological analyser and spell checker that produces lemmas by stripping affixes and looking up base forms in a language-specific dictionary.
-
Lancaster Stemmer
The Lancaster Stemmer is an alternative name for the Paice/Husk Stemmer — an aggressive, iterative English stemming algorithm developed at Lancaster University.
-
Lovins Stemmer
The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
-
Paice/Husk Stemmer
The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
-
ASCII Folding
ASCII folding maps accented and special characters to their closest ASCII equivalents using a lookup table, improving recall for users who omit diacritics at the cost of collapsing distinctions that may be semantically meaningful.
-
Decimal Digit Filter
A decimal digit filter maps Unicode decimal digit characters from any script to their ASCII 0–9 equivalents, ensuring that numbers written in Eastern Arabic, Devanagari, Thai, and other numeral systems match the same query regardless of which digit form was used.
-
HTML Strip
HTML stripping is a character-level preprocessing stage that removes markup tags and decodes HTML entities from raw text before it reaches the tokeniser, preventing angle brackets and entity sequences from appearing as index terms.
-
KStem
KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
-
Length Filter
A length filter is a token filter in an analysis chain that discards any token whose character length falls outside a configured minimum and maximum bound, removing noise tokens produced by tokenisation or upstream rewriting.
-
Lowercasing
Lowercasing converts every character in a string to its lowercase form, eliminating case variation so that ‘HTTP’, ‘Http’, and ‘http’ map to a single index term.
-
Normalisation
Normalisation transforms raw text into a consistent, canonical form — lowercasing, accent stripping, Unicode standardisation — so that surface variants of the same term map to a single index entry.
-
Pattern Replace Filter
A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
-
Porter Stemmer
The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
-
Porter2 Stemmer
Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
-
Stop Word
A stop word is a high-frequency function word — such as the, is, or at — removed from a token stream during analysis to reduce index noise and improve retrieval efficiency.
-
Stop Word Filter
A stop word filter is a token filter in an analysis chain that removes stop words from the token stream at index time and query time, reducing index size and suppressing high-frequency noise terms.
-
Unicode Normalisation
Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.
-
CJK Tokeniser
A CJK tokeniser segments Chinese, Japanese, and Korean text into tokens by splitting at every character or by applying a dictionary and statistical model to identify word boundaries.
-
Thai Tokeniser
A Thai tokeniser segments Thai script into words by combining a word-boundary dictionary with statistical or ML models, since Thai is written without spaces between words.
-
Path Hierarchy Tokeniser
A path hierarchy tokeniser splits a path string into every prefix hierarchy, so that a document at
/a/b/cis also findable by/a/bor/a— enabling subtree search on file paths, URL components, and category trees. -
Query Expansion
Query expansion augments a user’s search query with synonyms, related terms, or reformulations to reduce vocabulary mismatch and improve recall against an inverted index.
-
Edge N-Gram
An edge n-gram is a prefix-anchored n-gram generated from the start of a token, used in search engines to power as-you-type autocomplete and prefix matching.
-
ICU Tokeniser
The ICU tokeniser applies ICU BreakIterator rules to split text into tokens, extending UAX #29 with locale-aware dictionary segmentation for CJK and Thai and support for custom script rules.
-
Unicode Tokeniser
A Unicode tokeniser splits text into tokens using Unicode character categories and the UAX #29 word-boundary rules, producing correct token boundaries across all scripts and languages.
-
Regex Tokeniser
A regex tokeniser defines token boundaries with a regular expression, either splitting on delimiter matches or extracting token matches — the generalisation underlying whitespace, punctuation, and word tokenisers.
-
Sentence Tokeniser
A sentence tokeniser splits a document into individual sentences, establishing the boundary between document-level and word-level processing — a step that is harder than it appears because full stops serve multiple roles.
-
Word Tokeniser
A word tokeniser splits text into tokens at word boundaries using rules or regular expressions, correctly handling punctuation, contractions, hyphenation, and URLs where a whitespace split would fail.
-
Punctuation Tokeniser
A punctuation tokeniser splits text on both whitespace and punctuation characters, emitting only alphabetic and numeric runs — a simple, stateless approach common in search engine analysis chains.
-
Whitespace Tokeniser
A whitespace tokeniser splits a string into tokens by breaking on space, tab, and newline characters — the simplest possible tokenisation strategy, with well-defined failure modes.
-
Lemmatisation
Lemmatisation reduces an inflected word form to its dictionary base form — its lemma — by applying morphological analysis and a lexicon lookup, producing valid words rather than truncated stems.
-
Stemming
Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.
-
F1 Score
The F1 score is the harmonic mean of precision and recall, producing a single number that balances a model’s ability to avoid false positives against its ability to avoid false negatives.
-
BM25
BM25 (Best Match 25) is a probabilistic ranking function that scores documents against a query by weighing term frequency and inverse document frequency with length normalisation.
-
TF-IDF
TF-IDF (term frequency–inverse document frequency) is a numerical statistic that reflects how important a word is to a document relative to a corpus, used as a relevance signal in search ranking.
-
Trie
A trie is a tree where each path from root to node spells out a prefix, enabling O(k) term lookup, prefix enumeration, and autocomplete — where k is the length of the query string.
-
Inverted Index
An inverted index maps each unique term in a corpus to the documents — and optionally the positions — where it appears, making full-text search fast regardless of corpus size.
-
Shingle
A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.
-
Character N-Gram
A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
-
Trigram
A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
-
Bigram
A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
-
Unigram
A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
-
N-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
-
Corpus
A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.
-
Tokenisation
Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.
-
Token
A token is the smallest unit of text that an NLP pipeline or search engine operates on — typically a word, subword, or character produced by splitting an input string.