Needs-Review
-
Zipf's Law
Empirical observation that term frequency is inversely proportional to frequency rank; explains why few words dominate corpus.
-
Zero-Shot Learning
Model performs task without explicit training examples; relies on pre-training and task description in natural language.
-
Word2Vec
Efficient neural method for learning word embeddings using skip-gram or CBOW objectives, published by Mikolov et al. in 2013.
-
Word Embedding
Dense vector representation of a word in low-dimensional space, capturing semantic and syntactic relationships.
-
Wildcard Query
Matches terms using * (any characters) and ? (single character) glob patterns. Enables flexible term matching without edit distance.
-
Wavelet Tree
Succinct structure for rank/select queries on sequences; enables fast pattern searching and compressed storage with fast access.
-
Vocabulary
Set of unique terms appearing in a corpus; fundamental to information retrieval and language analysis. Distinguished from types and tokens.
-
Vector Space Model
The vector space model (VSM) represents documents and queries as vectors in a high-dimensional term space and ranks documents by their cosine similarity to the query vector.
-
Type
A unique word form in a corpus; distinguished from token (single occurrence). Vocabulary = set of all types.
-
Trigram Similarity
Jaccard similarity over character trigrams. Used by PostgreSQL pg_trgm for fast approximate matching.
-
Transformer
Attention-based neural architecture without recurrence; enables efficient parallel training and strong performance on language tasks. Published by Vaswani et al., 2017.
-
Tokeniser Vocabulary
Fixed set of subword units learned or predefined for tokenisation; typically 32k–128k tokens, balancing compression and flexibility.
-
Text Classification
Assigning one or more categories to text; includes sentiment analysis, topic classification, spam detection, and intent recognition.
-
Term Vector
A term vector is a per-document record of the terms, frequencies, and optionally positions and offsets produced by index-time analysis. It enables highlighting, More Like This queries, and forward-index access.
-
Term Frequency
Term frequency (TF) is the count of how many times a term appears in a document. It is one of the two core signals in TF-IDF and BM25 scoring.
-
Synonym Expansion
Query expansion using a configured synonym map, automatically retrieving documents with synonymous terms. Improves recall at query time.
-
Suffix Tree
Compressed trie of all suffixes enabling O(m) pattern matching without binary search; space-expensive but time-optimal.
-
Suffix Array
Sorted array of all suffixes; space-efficient full-text search enabling O(m log n) pattern matching with m = pattern length.
-
Succinct Data Structure
Data structures using near-optimal space (information-theoretic bounds) while maintaining efficient operations; compress without decompressing.
-
Stored Field
A stored field retains the original verbatim value of a field in the index so it can be returned in search results. Stored fields are separate from the inverted index and from DocValues.
-
Stop List
Curated set of stop words for a language (the, a, and, or); filtered during preprocessing to reduce noise in retrieval and analysis.
-
SPLADE
Sparse Lexical and Expansion Embeddings; learns sparse embeddings compatible with inverted indexes while capturing semantic understanding.
-
Sparse Retrieval
Retrieval using inverted-index term matching and scoring functions like BM25 or TF-IDF; contrasts with dense nearest-neighbour methods.
-
Span Query
Low-level positional query type enabling precise term position constraints. Basis for phrase and proximity queries in Lucene.
-
Snowball Stemmer
Snowball is a string-processing language and framework for writing stemming algorithms, developed by Martin Porter. It ships stemmers for 20+ languages and is the source of the Porter2 (English) stemmer used in most modern search engines.
-
Smith-Waterman
Local sequence alignment algorithm with configurable match/mismatch/gap penalties. Standard in bioinformatics for finding conserved regions.
-
Skip List
Probabilistic linked-list variant enabling O(log n) search and efficient postings-list merging; used in full-text search engines.
-
SimHash
Fingerprinting algorithm preserving cosine similarity. Maps similar documents to nearby hashes; enables efficient near-duplicate detection.
-
Sequence-to-Sequence
Encoder-decoder architecture mapping input sequences to output sequences; used for translation, summarisation, and dialogue.
-
Sentence Embedding
Dense vector representation of a sentence or passage, aggregating token information into a single low-dimensional vector that preserves semantic meaning.
-
Sentence Boundary Detection
Identifying sentence boundaries in text; handles ambiguous punctuation (periods in abbreviations, decimal points, URLs) and enables sentence-level processing.
-
Self-Attention
Attention where query, key, and value vectors come from the same input sequence; enables capturing dependencies within a sequence.
-
Segment
A segment is an immutable, self-contained unit of a Lucene index. New documents are written to new segments; segments are periodically merged to keep the index efficient.
-
ROUGE
Recall-Oriented Understudy for Gisting Evaluation; n-gram overlap metric for summarization and paraphrase evaluation.
-
Roaring Bitmap
Compressed bitmap enabling fast set operations (AND, OR); space-efficient for storing large sparse sets of integers.
-
RLHF
Reinforcement Learning from Human Feedback; uses human preference comparisons to fine-tune language models for safety and alignment.
-
Retrieval-Augmented Generation
Grounding language model generation in retrieved external documents; reduces hallucination and enables knowledge updates without retraining.
-
Reranker
Second-stage model re-scoring a candidate set retrieved by first-stage retrieval; improves ranking quality at modest computational cost.
-
Relevance Judgement
Human annotation of query-document relevance; provides ground truth for IR system evaluation.
-
Recall
Fraction of relevant documents that are retrieved; measures completeness of retrieval; high recall indicates few false negatives.
-
Range Query
Matches terms or values within a numeric, date, or lexicographic range. Enables filtering by boundaries.
-
Query-time Analysis
Query-time analysis is the analysis chain applied to the query string before it is matched against the index. It must produce terms compatible with those generated at index time.
-
Query Parser
Interprets a query string and produces a structured query object for execution. Essential bridge between user input and query engine.
-
Proximity Query
Matches terms within a maximum token distance of each other, ignoring order. Useful for finding related terms that needn’t be phrases.
-
Prompt Engineering
Crafting input text to elicit desired behaviour from language models without retraining; critical skill for modern LLMs.
-
Product Quantisation
Vector compression technique decomposing high-dimensional vectors into products of lower-dimensional quantisers for memory-efficient ANN. Abbreviated PQ.
-
Probabilistic Retrieval Model
Probabilistic retrieval models rank documents by their estimated probability of relevance to a query. BM25 is the most successful probabilistic retrieval model; language models offer an alternative probabilistic framework.
-
Prefix Query
Matches all terms beginning with a given string. Efficient with prefix-sorted indices or trie structures. Common in autocomplete.
-
Precision
Fraction of retrieved documents that are relevant; measures quality of retrieved set; high precision indicates few false positives.
-
Precision at k
Precision measured over top k results; practical metric reflecting user experience when viewing limited results.
-
Postings List
A postings list is the ordered sequence of postings for a single term in an inverted index — the list of all documents containing that term, with optional frequencies and positions.
-
Posting
A posting is a single record in an inverted index, linking a term to one document in which it appears — optionally including term frequency and token positions.
-
Positional Index
A positional index extends the inverted index by storing the token position of each term occurrence within a document, enabling phrase queries and proximity queries.
-
Positional Encoding
Injecting token position information into transformer inputs; allows model to distinguish between tokens based on sequence order.
-
Pointwise Mutual Information
Association between terms relative to independent frequencies; measures how much terms co-occur more than expected by chance. Abbreviated PMI.
-
Phrase Query
Matches tokens in exact sequence within a configurable slop distance. Enables “quoted phrase” search and loose phrase matching.
-
Perplexity
Exponentiated negative average log probability; measures how well a language model predicts a sample. Lower is better.
-
Part-of-Speech Tagging
Assigning grammatical roles (noun, verb, adjective, etc.) to tokens in text; fundamental for syntax analysis and downstream NLP tasks.
-
Overlap Coefficient
Set similarity metric: intersection / smaller set size. Asymmetric variant of Jaccard emphasizing subset containment.
-
Okapi BM25
Okapi BM25 is the original formulation of BM25, developed at City University London on the Okapi IR system in the early 1990s. The name ‘Okapi BM25’ honours the system; in practice it is synonymous with BM25.
-
Needleman-Wunsch
Global sequence alignment algorithm computing optimal alignment across entire sequences. Foundation for sequence comparison in bioinformatics.
-
NDCG
Ranking quality metric weighting relevant results by position; accounts for ranking order. Normalized DCG (NDCG).
-
Named Entity Recognition
Identifying and classifying named entities (persons, locations, organisations) in text; fundamental NLP task for information extraction.
-
N-Gram Language Model
Language model estimating token probabilities from observed n-gram counts; foundation of statistical NLP before neural methods.
-
Multi-Head Attention
Multiple parallel attention mechanisms operating on different subspaces; enables learning diverse interaction patterns simultaneously.
-
More Like This
Finds documents similar to a given document using term statistics and relevance scoring. Basis for recommendation and related-document features.
-
Minimum Should Match
Specifies how many optional (SHOULD) clauses must match in a boolean query. Controls recall-precision tradeoff for OR queries.
-
MinHash
Probabilistic set similarity estimation via minimal hash values. Enables fast approximate Jaccard similarity in streaming or large-scale settings.
-
Merge Policy
A merge policy defines the rules governing when and how Lucene index segments are merged. Merging controls the tradeoff between indexing throughput, search performance, and disk usage.
-
Mean Reciprocal Rank
Mean of reciprocal rank of first relevant result; measures how quickly system finds first answer. Abbreviated MRR.
-
Mean Average Precision
Mean of average precision scores across queries; standard evaluation metric balancing precision and ranking quality. Abbreviated MAP.
-
Matryoshka Representation Learning
Training method where prefixes of a vector are also useful embeddings; enables efficient storage and search at multiple granularities. Abbreviated MRL.
-
Masked Language Model
Predicts randomly masked tokens from context; primary pre-training objective for bidirectional encoders like BERT.
-
Longest Common Substring
Longest contiguous character sequence common to two strings. Useful for plagiarism detection and similarity measurement.
-
Longest Common Subsequence
Longest sequence of characters common to two strings in order (not necessarily contiguous). Foundation for sequence alignment and diff algorithms.
-
Locality-Sensitive Hashing
Hashing technique mapping similar items to the same bucket. Enables sublinear approximate nearest-neighbor search. LSH.
-
Levenshtein Distance
Edit distance allowing insertions, deletions, and substitutions. Canonical metric for string similarity and typo tolerance.
-
Learning to Rank
Learning to rank (LTR) trains a model to produce an optimal ordering of documents for a query using labelled relevance data, combining signals such as BM25, click-through rate, and document features.
-
Language Model
Probability distribution over sequences of tokens; predicts next token given context. Foundation of NLP from n-grams to large language models.
-
Jaro-Winkler Similarity
Jaro similarity with prefix bonus for matching initial characters. Improves accuracy for name and record matching.
-
Jaro Similarity
String similarity metric for short strings based on matching characters and transpositions. Commonly used in record linkage and data quality.
-
Jaccard Similarity
Set overlap metric: intersection / union. Measures similarity of sets without regard to order or duplicates.
-
IVF Index
Inverted File index; partitions high-dimensional vector space into Voronoi cells for scalable approximate nearest-neighbour search.
-
Inverse Document Frequency
Inverse document frequency (IDF) is a log-scaled measure of how rare a term is across a corpus. Rare terms receive high IDF weights; common terms receive low weights, making IDF a natural filter for uninformative vocabulary.
-
Instruction Tuning
Fine-tuning language models on diverse (instruction, response) pairs to improve generalization and follow natural language instructions.
-
Index-time Analysis
Index-time analysis is the analysis chain applied to document text when it is ingested into the index. The terms it produces are what get stored in the inverted index and matched against at search time.
-
Hybrid Search
Combining dense vector similarity and sparse term-matching scores to balance semantic understanding with keyword precision.
-
HNSW
Hierarchical Navigable Small World; state-of-the-art graph-based approximate nearest-neighbour index balancing speed and recall.
-
Heaps' Law
Vocabulary grows sub-linearly with corpus size; predicts vocabulary size from token count via power law.
-
Hapax Legomenon
Term occurring exactly once in a corpus; indicates vocabulary richness and poses challenges for language models and IR systems.
-
Hamming Distance
Number of positions at which two equal-length strings differ. Efficient metric for fixed-length codes and binary data.
-
Hallucination
Generating plausible-sounding but factually incorrect content; a key limitation of language models, especially on knowledge-intensive tasks.
-
Grounding
Connecting model outputs to verifiable external sources; reduces hallucination by anchoring generation in retrieved facts or documents.
-
GPT
Generative Pre-trained Transformer; autoregressive decoder-only model for text generation and language understanding, published by OpenAI from 2018 onwards.
-
GloVe
Global Vectors for Word Representation; combines matrix factorization of word co-occurrence statistics with local context windows for learning embeddings.
-
Fuzzy Query
Matches terms within a specified edit distance threshold, tolerating typos and misspellings. Typically uses Levenshtein distance.
-
Forward Index
A forward index maps each document to the list of terms it contains. It is the natural output of document ingestion and the starting point for building an inverted index.
-
FM-Index
Full-text index based on Burrows-Wheeler Transform; enables pattern matching and compressed storage simultaneously.
-
Finite State Transducer
Trie-like automaton for compressed term dictionaries and morphological analysis; maps input strings to outputs (e.g., word to ID).
-
Fine-Tuning
Adapting a pre-trained model to a downstream task by training on task-specific data; standard approach in modern NLP.
-
Field Type
A field type is a named schema definition that specifies how a field’s values are stored, indexed, and analysed. It bundles an analyzer, storage options, and index behaviour into a reusable configuration.
-
Few-Shot Learning
Model generalises from small number of prompt examples without explicit retraining; enabled by scale in large language models.
-
fastText
Word embedding method using character n-grams to handle out-of-vocabulary words and morphological variants; published by Bojanowski et al. in 2017.
-
FAISS
Facebook AI Similarity Search; open-source library implementing multiple approximate nearest-neighbour indexes for efficient similarity search at scale.
-
Edit Distance
Minimum number of single-character operations (insertions, deletions, substitutions) to transform one string into another. Foundation for similarity metrics.
-
Dot Product Similarity
Inner product of two vectors; equivalent to cosine similarity when vectors are unit-normalised; fast to compute in dense retrieval.
-
Domain-Specific Stop Words
Stop words for a particular field or domain; words that are frequent in domain but carry little discriminative information (e.g., “paper” in academic text).
-
DocValues
DocValues is a column-oriented on-disk data structure in Lucene that stores field values per document, enabling efficient sorting, faceting, and aggregations without loading the entire index into memory.
-
Document Frequency
Document frequency (DF) is the number of documents in a corpus that contain a given term. It is the denominator in IDF and signals how common or rare a term is across the collection.
-
Dice Coefficient
Set similarity metric: twice shared elements / total elements in both sets. Related to Jaccard but emphasizes intersection differently.
-
Dependency Parsing
Analysing grammatical structure by identifying directed dependency relations between tokens; output is a dependency tree.
-
Dense Retrieval
Retrieval method using nearest-neighbour search over dense embedding vectors; contrasts with inverted-index sparse retrieval like BM25.
-
Damerau-Levenshtein Distance
Edit distance including transpositions (swapping adjacent characters). Captures more common typos than Levenshtein alone.
-
Cross-Entropy
Cross-entropy measures the average number of bits needed to encode samples from a true distribution using a model distribution. It is the standard training loss for language models and the basis of perplexity.
-
Cross-Encoder
Neural architecture jointly encoding query-document pairs for accurate relevance scoring; used for reranking retrieved candidates from first-stage retrieval.
-
Cosine Similarity
Vector similarity metric: dot product / product of magnitudes. Standard measure for dense and sparse vector comparison in IR.
-
Corpus Annotation
Adding linguistic labels to corpus text (POS tags, NER tags, dependencies, etc.); creates training data for supervised NLP tasks.
-
Coreference Resolution
Linking mentions of the same entity across a document; resolves pronouns and nominal references to their antecedents.
-
Context Window
Maximum number of tokens a language model can process in one pass; determines how much context the model sees. Typical values range from 512 to 128k tokens.
-
Commit
A commit makes indexed documents durable by flushing the Lucene transaction log to disk and writing a new segment commit point. Elasticsearch distinguishes hard commits (durable) from refreshes (visible but not durable).
-
Collocation
Statistically significant co-occurrence of words (e.g. “strong tea”, “black coffee”); indicates meaningful phrases beyond random chance.
-
ColBERT
Contextualized Late Interaction over BERT; late-interaction ranking using per-token embeddings with MaxSim scoring for efficient dense retrieval.
-
Co-occurrence Matrix
Counts how often term pairs appear together in context; captures semantic relationships and enables embedding learning via matrix factorization.
-
Cloze Task
Predicting masked tokens from context; unsupervised pre-training objective where random words are hidden and must be inferred.
-
Chunking
Grouping tokens into phrases or chunks; shallow syntactic analysis that segments noun phrases, verb phrases, and prepositional phrases.
-
Chunking Strategy
How documents are split into passages for indexing and retrieval in RAG systems; balance between granularity and context preservation.
-
Causal Language Model
Predicts next token from previous tokens; autoregressive objective for generative models like GPT, enabling text generation.
-
Case Folding
Case folding is locale-aware lowercasing that correctly handles languages where simple ASCII lowercasing produces wrong results — such as Turkish dotted-i or German sharp-s.
-
Burrows-Wheeler Transform
Reversible permutation clustering similar contexts; makes text more compressible and enables FM-index for full-text search.
-
Boosting
Adjusts the relevance score contribution of a field, term, or query clause, multiplying base scores to prioritise matches. Essential for ranking tuning.
-
Boolean Retrieval
Boolean retrieval matches documents using AND, OR, and NOT operators applied to inverted index postings lists. It returns an exact set — all matching documents, unranked — rather than a ranked list.
-
BM25F
BM25F extends BM25 to multi-field documents by weighting each field separately before combining, so title matches can outweigh body matches without simply multiplying the final score.
-
BM25+
BM25+ fixes an edge-case bug in BM25 where long documents containing a rare query term can score lower than shorter documents that don’t contain it at all, by adding a small constant lower-bound to the TF contribution.
-
Bloom Filter
Probabilistic set membership test; extremely space-efficient with no false negatives but small false positive rate.
-
BLEU Score
Bilingual Evaluation Understudy; n-gram overlap metric for machine translation evaluation. Published by Papineni et al., 2002.
-
Bi-Encoder
Neural architecture encoding query and document independently into separate embeddings, enabling fast retrieval via approximate nearest-neighbour search.
-
BERTScore
Semantic similarity using contextual BERT embeddings; measures meaning-level matching rather than surface-level n-gram overlap.
-
BERT
Bidirectional Encoder Representations from Transformers; bidirectional transformer pre-trained with masked language modeling, foundational for NLP tasks.
-
Attention Mechanism
Weighted aggregation of context vectors, allowing models to focus on relevant information. Fundamental to transformers and modern NLP.
-
Approximate Nearest Neighbour
Fast nearest-neighbour search algorithm sacrificing exactness for speed; enables practical dense retrieval at scale. Abbreviated ANN.
-
Analyzer
An analyzer is a named, reusable analysis chain configuration in Solr, Elasticsearch, or OpenSearch — combining a tokeniser and token filters into a unit that can be assigned to fields.
-
Analysis Chain
An analysis chain is the ordered pipeline of tokeniser and token filters that transforms raw text into index terms. The same chain (or a compatible one) must be applied at both index time and query time.
-
Aho-Corasick
Multi-pattern string matching algorithm in O(n + m + z) time; enables efficient synonym/keyword highlighting and entity tagging.
-
Beider-Morse Phonetic Matching
Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
-
Cologne Phonetics
Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
-
Daitch-Mokotoff Soundex
A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
-
Match Rating Approach
The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
-
NYSIIS
NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
-
Caverphone
Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
-
Double Metaphone
Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
-
Metaphone
Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
-
Metaphone 3
Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
-
Phonetic Encoding
Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
-
Soundex
Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
-
Decompounding
Decompounding splits compound words — common in German, Dutch, and Scandinavian languages — into their component tokens so that searches for constituents match the full compound at index and query time.
-
Hunspell
Hunspell is a dictionary-based morphological analyser and spell checker that produces lemmas by stripping affixes and looking up base forms in a language-specific dictionary.
-
Inflection
Inflection is the morphological process by which a single lexeme takes on different surface forms to express grammatical categories such as tense, number, and case — the variation that lemmatisation is designed to undo.
-
Morphological Analysis
Morphological analysis decomposes words into their constituent morphemes — stems, prefixes, suffixes, and inflectional endings — enabling NLP systems to recognise that surface-form variants refer to the same underlying concept.
-
Suffix
A suffix is a bound morpheme appended to the right end of a word stem, encoding grammatical properties or creating new words — and the primary target of every English stemming algorithm.
-
Lancaster Stemmer
The Lancaster Stemmer is an alternative name for the Paice/Husk Stemmer — an aggressive, iterative English stemming algorithm developed at Lancaster University.
-
Lovins Stemmer
The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
-
Paice/Husk Stemmer
The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
-
ASCII Folding
ASCII folding maps accented and special characters to their closest ASCII equivalents using a lookup table, improving recall for users who omit diacritics at the cost of collapsing distinctions that may be semantically meaningful.
-
Decimal Digit Filter
A decimal digit filter maps Unicode decimal digit characters from any script to their ASCII 0–9 equivalents, ensuring that numbers written in Eastern Arabic, Devanagari, Thai, and other numeral systems match the same query regardless of which digit form was used.
-
Elision Filter
An elision filter is a token filter that strips language-specific clitic prefixes — such as French l’ and d’ — from the start of tokens, leaving the bare stem for indexing and matching.
-
HTML Strip
HTML stripping is a character-level preprocessing stage that removes markup tags and decodes HTML entities from raw text before it reaches the tokeniser, preventing angle brackets and entity sequences from appearing as index terms.
-
KStem
KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
-
Length Filter
A length filter is a token filter in an analysis chain that discards any token whose character length falls outside a configured minimum and maximum bound, removing noise tokens produced by tokenisation or upstream rewriting.
-
Lowercasing
Lowercasing converts every character in a string to its lowercase form, eliminating case variation so that ‘HTTP’, ‘Http’, and ‘http’ map to a single index term.
-
Normalisation
Normalisation transforms raw text into a consistent, canonical form — lowercasing, accent stripping, Unicode standardisation — so that surface variants of the same term map to a single index entry.
-
Pattern Replace Filter
A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
-
Porter Stemmer
The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
-
Porter2 Stemmer
Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
-
Stop Word
A stop word is a high-frequency function word — such as the, is, or at — removed from a token stream during analysis to reduce index noise and improve retrieval efficiency.
-
Stop Word Filter
A stop word filter is a token filter in an analysis chain that removes stop words from the token stream at index time and query time, reducing index size and suppressing high-frequency noise terms.
-
Trim Filter
A trim filter is a token filter that strips leading and trailing whitespace characters from each token in the analysis stream, leaving the token’s interior content unchanged.
-
Unicode Normalisation
Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.
-
CJK Tokeniser
A CJK tokeniser segments Chinese, Japanese, and Korean text into tokens by splitting at every character or by applying a dictionary and statistical model to identify word boundaries.
-
Thai Tokeniser
A Thai tokeniser segments Thai script into words by combining a word-boundary dictionary with statistical or ML models, since Thai is written without spaces between words.
-
Path Hierarchy Tokeniser
A path hierarchy tokeniser splits a path string into every prefix hierarchy, so that a document at
/a/b/cis also findable by/a/bor/a— enabling subtree search on file paths, URL components, and category trees. -
Query Expansion
Query expansion augments a user’s search query with synonyms, related terms, or reformulations to reduce vocabulary mismatch and improve recall against an inverted index.
-
Edge N-Gram
An edge n-gram is a prefix-anchored n-gram generated from the start of a token, used in search engines to power as-you-type autocomplete and prefix matching.
-
ICU Tokeniser
The ICU tokeniser applies ICU BreakIterator rules to split text into tokens, extending UAX #29 with locale-aware dictionary segmentation for CJK and Thai and support for custom script rules.
-
Unigram Language Model Tokeniser
The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
-
SentencePiece
SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
-
WordPiece
WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
-
Byte Pair Encoding
Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
-
Subword Tokenisation
Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
-
Unicode Tokeniser
A Unicode tokeniser splits text into tokens using Unicode character categories and the UAX #29 word-boundary rules, producing correct token boundaries across all scripts and languages.
-
Regex Tokeniser
A regex tokeniser defines token boundaries with a regular expression, either splitting on delimiter matches or extracting token matches — the generalisation underlying whitespace, punctuation, and word tokenisers.
-
Sentence Tokeniser
A sentence tokeniser splits a document into individual sentences, establishing the boundary between document-level and word-level processing — a step that is harder than it appears because full stops serve multiple roles.
-
Word Tokeniser
A word tokeniser splits text into tokens at word boundaries using rules or regular expressions, correctly handling punctuation, contractions, hyphenation, and URLs where a whitespace split would fail.
-
Punctuation Tokeniser
A punctuation tokeniser splits text on both whitespace and punctuation characters, emitting only alphabetic and numeric runs — a simple, stateless approach common in search engine analysis chains.
-
Whitespace Tokeniser
A whitespace tokeniser splits a string into tokens by breaking on space, tab, and newline characters — the simplest possible tokenisation strategy, with well-defined failure modes.
-
Lemmatisation
Lemmatisation reduces an inflected word form to its dictionary base form — its lemma — by applying morphological analysis and a lexicon lookup, producing valid words rather than truncated stems.
-
Stemming
Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.
-
F1 Score
The F1 score is the harmonic mean of precision and recall, producing a single number that balances a model’s ability to avoid false positives against its ability to avoid false negatives.
-
BM25
BM25 (Best Match 25) is a probabilistic ranking function that scores documents against a query by weighing term frequency and inverse document frequency with length normalisation.
-
TF-IDF
TF-IDF (term frequency–inverse document frequency) is a numerical statistic that reflects how important a word is to a document relative to a corpus, used as a relevance signal in search ranking.
-
Trie
A trie is a tree where each path from root to node spells out a prefix, enabling O(k) term lookup, prefix enumeration, and autocomplete — where k is the length of the query string.
-
Inverted Index
An inverted index maps each unique term in a corpus to the documents — and optionally the positions — where it appears, making full-text search fast regardless of corpus size.
-
Shingling
Shingling represents a document as its set of overlapping n-grams (shingles), enabling near-duplicate detection via Jaccard similarity or MinHash approximations.
-
Shingle
A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.
-
Character N-Gram
A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
-
Skip-Gram
A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.
-
Trigram
A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
-
Bigram
A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
-
Unigram
A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
-
N-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
-
Corpus
A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.
-
Tokenisation
Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.
-
Token
A token is the smallest unit of text that an NLP pipeline or search engine operates on — typically a word, subword, or character produced by splitting an input string.