Needs-Review

Zipf's Law

Empirical observation that term frequency is inversely proportional to frequency rank; explains why few words dominate corpus.
Zero-Shot Learning

Model performs task without explicit training examples; relies on pre-training and task description in natural language.
Word2Vec

Efficient neural method for learning word embeddings using skip-gram or CBOW objectives, published by Mikolov et al. in 2013.
Word Embedding

Dense vector representation of a word in low-dimensional space, capturing semantic and syntactic relationships.
Wildcard Query

Matches terms using * (any characters) and ? (single character) glob patterns. Enables flexible term matching without edit distance.
Wavelet Tree

Succinct structure for rank/select queries on sequences; enables fast pattern searching and compressed storage with fast access.
Vocabulary

Set of unique terms appearing in a corpus; fundamental to information retrieval and language analysis. Distinguished from types and tokens.
Vector Space Model

The vector space model (VSM) represents documents and queries as vectors in a high-dimensional term space and ranks documents by their cosine similarity to the query vector.
Type

A unique word form in a corpus; distinguished from token (single occurrence). Vocabulary = set of all types.
Trigram Similarity

Jaccard similarity over character trigrams. Used by PostgreSQL pg_trgm for fast approximate matching.
Transformer

Attention-based neural architecture without recurrence; enables efficient parallel training and strong performance on language tasks. Published by Vaswani et al., 2017.
Tokeniser Vocabulary

Fixed set of subword units learned or predefined for tokenisation; typically 32k–128k tokens, balancing compression and flexibility.
Text Classification

Assigning one or more categories to text; includes sentiment analysis, topic classification, spam detection, and intent recognition.
Term Vector

A term vector is a per-document record of the terms, frequencies, and optionally positions and offsets produced by index-time analysis. It enables highlighting, More Like This queries, and forward-index access.
Term Frequency

Term frequency (TF) is the count of how many times a term appears in a document. It is one of the two core signals in TF-IDF and BM25 scoring.
Synonym Expansion

Query expansion using a configured synonym map, automatically retrieving documents with synonymous terms. Improves recall at query time.
Suffix Tree

Compressed trie of all suffixes enabling O(m) pattern matching without binary search; space-expensive but time-optimal.
Suffix Array

Sorted array of all suffixes; space-efficient full-text search enabling O(m log n) pattern matching with m = pattern length.
Succinct Data Structure

Data structures using near-optimal space (information-theoretic bounds) while maintaining efficient operations; compress without decompressing.
Stored Field

A stored field retains the original verbatim value of a field in the index so it can be returned in search results. Stored fields are separate from the inverted index and from DocValues.
Stop List

Curated set of stop words for a language (the, a, and, or); filtered during preprocessing to reduce noise in retrieval and analysis.
SPLADE

Sparse Lexical and Expansion Embeddings; learns sparse embeddings compatible with inverted indexes while capturing semantic understanding.
Sparse Retrieval

Retrieval using inverted-index term matching and scoring functions like BM25 or TF-IDF; contrasts with dense nearest-neighbour methods.
Span Query

Low-level positional query type enabling precise term position constraints. Basis for phrase and proximity queries in Lucene.
Snowball Stemmer

Snowball is a string-processing language and framework for writing stemming algorithms, developed by Martin Porter. It ships stemmers for 20+ languages and is the source of the Porter2 (English) stemmer used in most modern search engines.
Smith-Waterman

Local sequence alignment algorithm with configurable match/mismatch/gap penalties. Standard in bioinformatics for finding conserved regions.
Skip List

Probabilistic linked-list variant enabling O(log n) search and efficient postings-list merging; used in full-text search engines.
SimHash

Fingerprinting algorithm preserving cosine similarity. Maps similar documents to nearby hashes; enables efficient near-duplicate detection.
Sequence-to-Sequence

Encoder-decoder architecture mapping input sequences to output sequences; used for translation, summarisation, and dialogue.
Sentence Embedding

Dense vector representation of a sentence or passage, aggregating token information into a single low-dimensional vector that preserves semantic meaning.
Sentence Boundary Detection

Identifying sentence boundaries in text; handles ambiguous punctuation (periods in abbreviations, decimal points, URLs) and enables sentence-level processing.
Self-Attention

Attention where query, key, and value vectors come from the same input sequence; enables capturing dependencies within a sequence.
Segment

A segment is an immutable, self-contained unit of a Lucene index. New documents are written to new segments; segments are periodically merged to keep the index efficient.
ROUGE

Recall-Oriented Understudy for Gisting Evaluation; n-gram overlap metric for summarization and paraphrase evaluation.
Roaring Bitmap

Compressed bitmap enabling fast set operations (AND, OR); space-efficient for storing large sparse sets of integers.
RLHF

Reinforcement Learning from Human Feedback; uses human preference comparisons to fine-tune language models for safety and alignment.
Retrieval-Augmented Generation

Grounding language model generation in retrieved external documents; reduces hallucination and enables knowledge updates without retraining.
Reranker

Second-stage model re-scoring a candidate set retrieved by first-stage retrieval; improves ranking quality at modest computational cost.
Relevance Judgement

Human annotation of query-document relevance; provides ground truth for IR system evaluation.
Recall

Fraction of relevant documents that are retrieved; measures completeness of retrieval; high recall indicates few false negatives.
Range Query

Matches terms or values within a numeric, date, or lexicographic range. Enables filtering by boundaries.
Query-time Analysis

Query-time analysis is the analysis chain applied to the query string before it is matched against the index. It must produce terms compatible with those generated at index time.
Query Parser

Interprets a query string and produces a structured query object for execution. Essential bridge between user input and query engine.
Proximity Query

Matches terms within a maximum token distance of each other, ignoring order. Useful for finding related terms that needn’t be phrases.
Prompt Engineering

Crafting input text to elicit desired behaviour from language models without retraining; critical skill for modern LLMs.
Product Quantisation

Vector compression technique decomposing high-dimensional vectors into products of lower-dimensional quantisers for memory-efficient ANN. Abbreviated PQ.
Probabilistic Retrieval Model

Probabilistic retrieval models rank documents by their estimated probability of relevance to a query. BM25 is the most successful probabilistic retrieval model; language models offer an alternative probabilistic framework.
Prefix Query

Matches all terms beginning with a given string. Efficient with prefix-sorted indices or trie structures. Common in autocomplete.
Precision

Fraction of retrieved documents that are relevant; measures quality of retrieved set; high precision indicates few false positives.
Precision at k

Precision measured over top k results; practical metric reflecting user experience when viewing limited results.
Postings List

A postings list is the ordered sequence of postings for a single term in an inverted index — the list of all documents containing that term, with optional frequencies and positions.
Posting

A posting is a single record in an inverted index, linking a term to one document in which it appears — optionally including term frequency and token positions.
Positional Index

A positional index extends the inverted index by storing the token position of each term occurrence within a document, enabling phrase queries and proximity queries.
Positional Encoding

Injecting token position information into transformer inputs; allows model to distinguish between tokens based on sequence order.
Pointwise Mutual Information

Association between terms relative to independent frequencies; measures how much terms co-occur more than expected by chance. Abbreviated PMI.
Phrase Query

Matches tokens in exact sequence within a configurable slop distance. Enables “quoted phrase” search and loose phrase matching.
Perplexity

Exponentiated negative average log probability; measures how well a language model predicts a sample. Lower is better.
Part-of-Speech Tagging

Assigning grammatical roles (noun, verb, adjective, etc.) to tokens in text; fundamental for syntax analysis and downstream NLP tasks.
Overlap Coefficient

Set similarity metric: intersection / smaller set size. Asymmetric variant of Jaccard emphasizing subset containment.
Okapi BM25

Okapi BM25 is the original formulation of BM25, developed at City University London on the Okapi IR system in the early 1990s. The name ‘Okapi BM25’ honours the system; in practice it is synonymous with BM25.
Needleman-Wunsch

Global sequence alignment algorithm computing optimal alignment across entire sequences. Foundation for sequence comparison in bioinformatics.
NDCG

Ranking quality metric weighting relevant results by position; accounts for ranking order. Normalized DCG (NDCG).
Named Entity Recognition

Identifying and classifying named entities (persons, locations, organisations) in text; fundamental NLP task for information extraction.
N-Gram Language Model

Language model estimating token probabilities from observed n-gram counts; foundation of statistical NLP before neural methods.
Multi-Head Attention

Multiple parallel attention mechanisms operating on different subspaces; enables learning diverse interaction patterns simultaneously.
More Like This

Finds documents similar to a given document using term statistics and relevance scoring. Basis for recommendation and related-document features.
Minimum Should Match

Specifies how many optional (SHOULD) clauses must match in a boolean query. Controls recall-precision tradeoff for OR queries.
MinHash

Probabilistic set similarity estimation via minimal hash values. Enables fast approximate Jaccard similarity in streaming or large-scale settings.
Merge Policy

A merge policy defines the rules governing when and how Lucene index segments are merged. Merging controls the tradeoff between indexing throughput, search performance, and disk usage.
Mean Reciprocal Rank

Mean of reciprocal rank of first relevant result; measures how quickly system finds first answer. Abbreviated MRR.
Mean Average Precision

Mean of average precision scores across queries; standard evaluation metric balancing precision and ranking quality. Abbreviated MAP.
Matryoshka Representation Learning

Training method where prefixes of a vector are also useful embeddings; enables efficient storage and search at multiple granularities. Abbreviated MRL.
Masked Language Model

Predicts randomly masked tokens from context; primary pre-training objective for bidirectional encoders like BERT.
Longest Common Substring

Longest contiguous character sequence common to two strings. Useful for plagiarism detection and similarity measurement.
Longest Common Subsequence

Longest sequence of characters common to two strings in order (not necessarily contiguous). Foundation for sequence alignment and diff algorithms.
Locality-Sensitive Hashing

Hashing technique mapping similar items to the same bucket. Enables sublinear approximate nearest-neighbor search. LSH.
Levenshtein Distance

Edit distance allowing insertions, deletions, and substitutions. Canonical metric for string similarity and typo tolerance.
Learning to Rank

Learning to rank (LTR) trains a model to produce an optimal ordering of documents for a query using labelled relevance data, combining signals such as BM25, click-through rate, and document features.
Language Model

Probability distribution over sequences of tokens; predicts next token given context. Foundation of NLP from n-grams to large language models.
Jaro-Winkler Similarity

Jaro similarity with prefix bonus for matching initial characters. Improves accuracy for name and record matching.
Jaro Similarity

String similarity metric for short strings based on matching characters and transpositions. Commonly used in record linkage and data quality.
Jaccard Similarity

Set overlap metric: intersection / union. Measures similarity of sets without regard to order or duplicates.
IVF Index

Inverted File index; partitions high-dimensional vector space into Voronoi cells for scalable approximate nearest-neighbour search.
Inverse Document Frequency

Inverse document frequency (IDF) is a log-scaled measure of how rare a term is across a corpus. Rare terms receive high IDF weights; common terms receive low weights, making IDF a natural filter for uninformative vocabulary.
Instruction Tuning

Fine-tuning language models on diverse (instruction, response) pairs to improve generalization and follow natural language instructions.
Index-time Analysis

Index-time analysis is the analysis chain applied to document text when it is ingested into the index. The terms it produces are what get stored in the inverted index and matched against at search time.
Hybrid Search

Combining dense vector similarity and sparse term-matching scores to balance semantic understanding with keyword precision.
HNSW

Hierarchical Navigable Small World; state-of-the-art graph-based approximate nearest-neighbour index balancing speed and recall.
Heaps' Law

Vocabulary grows sub-linearly with corpus size; predicts vocabulary size from token count via power law.
Hapax Legomenon

Term occurring exactly once in a corpus; indicates vocabulary richness and poses challenges for language models and IR systems.
Hamming Distance

Number of positions at which two equal-length strings differ. Efficient metric for fixed-length codes and binary data.
Hallucination

Generating plausible-sounding but factually incorrect content; a key limitation of language models, especially on knowledge-intensive tasks.
Grounding

Connecting model outputs to verifiable external sources; reduces hallucination by anchoring generation in retrieved facts or documents.
GPT

Generative Pre-trained Transformer; autoregressive decoder-only model for text generation and language understanding, published by OpenAI from 2018 onwards.
GloVe

Global Vectors for Word Representation; combines matrix factorization of word co-occurrence statistics with local context windows for learning embeddings.
Fuzzy Query

Matches terms within a specified edit distance threshold, tolerating typos and misspellings. Typically uses Levenshtein distance.
Forward Index

A forward index maps each document to the list of terms it contains. It is the natural output of document ingestion and the starting point for building an inverted index.
FM-Index

Full-text index based on Burrows-Wheeler Transform; enables pattern matching and compressed storage simultaneously.
Finite State Transducer

Trie-like automaton for compressed term dictionaries and morphological analysis; maps input strings to outputs (e.g., word to ID).
Fine-Tuning

Adapting a pre-trained model to a downstream task by training on task-specific data; standard approach in modern NLP.
Field Type

A field type is a named schema definition that specifies how a field’s values are stored, indexed, and analysed. It bundles an analyzer, storage options, and index behaviour into a reusable configuration.
Few-Shot Learning

Model generalises from small number of prompt examples without explicit retraining; enabled by scale in large language models.
fastText

Word embedding method using character n-grams to handle out-of-vocabulary words and morphological variants; published by Bojanowski et al. in 2017.
FAISS

Facebook AI Similarity Search; open-source library implementing multiple approximate nearest-neighbour indexes for efficient similarity search at scale.
Edit Distance

Minimum number of single-character operations (insertions, deletions, substitutions) to transform one string into another. Foundation for similarity metrics.
Dot Product Similarity

Inner product of two vectors; equivalent to cosine similarity when vectors are unit-normalised; fast to compute in dense retrieval.
Domain-Specific Stop Words

Stop words for a particular field or domain; words that are frequent in domain but carry little discriminative information (e.g., “paper” in academic text).
DocValues

DocValues is a column-oriented on-disk data structure in Lucene that stores field values per document, enabling efficient sorting, faceting, and aggregations without loading the entire index into memory.
Document Frequency

Document frequency (DF) is the number of documents in a corpus that contain a given term. It is the denominator in IDF and signals how common or rare a term is across the collection.
Dice Coefficient

Set similarity metric: twice shared elements / total elements in both sets. Related to Jaccard but emphasizes intersection differently.
Dependency Parsing

Analysing grammatical structure by identifying directed dependency relations between tokens; output is a dependency tree.
Dense Retrieval

Retrieval method using nearest-neighbour search over dense embedding vectors; contrasts with inverted-index sparse retrieval like BM25.
Damerau-Levenshtein Distance

Edit distance including transpositions (swapping adjacent characters). Captures more common typos than Levenshtein alone.
Cross-Entropy

Cross-entropy measures the average number of bits needed to encode samples from a true distribution using a model distribution. It is the standard training loss for language models and the basis of perplexity.
Cross-Encoder

Neural architecture jointly encoding query-document pairs for accurate relevance scoring; used for reranking retrieved candidates from first-stage retrieval.
Cosine Similarity

Vector similarity metric: dot product / product of magnitudes. Standard measure for dense and sparse vector comparison in IR.
Corpus Annotation

Adding linguistic labels to corpus text (POS tags, NER tags, dependencies, etc.); creates training data for supervised NLP tasks.
Coreference Resolution

Linking mentions of the same entity across a document; resolves pronouns and nominal references to their antecedents.
Context Window

Maximum number of tokens a language model can process in one pass; determines how much context the model sees. Typical values range from 512 to 128k tokens.
Commit

A commit makes indexed documents durable by flushing the Lucene transaction log to disk and writing a new segment commit point. Elasticsearch distinguishes hard commits (durable) from refreshes (visible but not durable).
Collocation

Statistically significant co-occurrence of words (e.g. “strong tea”, “black coffee”); indicates meaningful phrases beyond random chance.
ColBERT

Contextualized Late Interaction over BERT; late-interaction ranking using per-token embeddings with MaxSim scoring for efficient dense retrieval.
Co-occurrence Matrix

Counts how often term pairs appear together in context; captures semantic relationships and enables embedding learning via matrix factorization.
Cloze Task

Predicting masked tokens from context; unsupervised pre-training objective where random words are hidden and must be inferred.
Chunking

Grouping tokens into phrases or chunks; shallow syntactic analysis that segments noun phrases, verb phrases, and prepositional phrases.
Chunking Strategy

How documents are split into passages for indexing and retrieval in RAG systems; balance between granularity and context preservation.
Causal Language Model

Predicts next token from previous tokens; autoregressive objective for generative models like GPT, enabling text generation.
Case Folding

Case folding is locale-aware lowercasing that correctly handles languages where simple ASCII lowercasing produces wrong results — such as Turkish dotted-i or German sharp-s.
Burrows-Wheeler Transform

Reversible permutation clustering similar contexts; makes text more compressible and enables FM-index for full-text search.
Boosting

Adjusts the relevance score contribution of a field, term, or query clause, multiplying base scores to prioritise matches. Essential for ranking tuning.
Boolean Retrieval

Boolean retrieval matches documents using AND, OR, and NOT operators applied to inverted index postings lists. It returns an exact set — all matching documents, unranked — rather than a ranked list.
BM25F

BM25F extends BM25 to multi-field documents by weighting each field separately before combining, so title matches can outweigh body matches without simply multiplying the final score.
BM25+

BM25+ fixes an edge-case bug in BM25 where long documents containing a rare query term can score lower than shorter documents that don’t contain it at all, by adding a small constant lower-bound to the TF contribution.
Bloom Filter

Probabilistic set membership test; extremely space-efficient with no false negatives but small false positive rate.
BLEU Score

Bilingual Evaluation Understudy; n-gram overlap metric for machine translation evaluation. Published by Papineni et al., 2002.
Bi-Encoder

Neural architecture encoding query and document independently into separate embeddings, enabling fast retrieval via approximate nearest-neighbour search.
BERTScore

Semantic similarity using contextual BERT embeddings; measures meaning-level matching rather than surface-level n-gram overlap.
BERT

Bidirectional Encoder Representations from Transformers; bidirectional transformer pre-trained with masked language modeling, foundational for NLP tasks.
Attention Mechanism

Weighted aggregation of context vectors, allowing models to focus on relevant information. Fundamental to transformers and modern NLP.
Approximate Nearest Neighbour

Fast nearest-neighbour search algorithm sacrificing exactness for speed; enables practical dense retrieval at scale. Abbreviated ANN.
Analyzer

An analyzer is a named, reusable analysis chain configuration in Solr, Elasticsearch, or OpenSearch — combining a tokeniser and token filters into a unit that can be assigned to fields.
Analysis Chain

An analysis chain is the ordered pipeline of tokeniser and token filters that transforms raw text into index terms. The same chain (or a compatible one) must be applied at both index time and query time.
Aho-Corasick

Multi-pattern string matching algorithm in O(n + m + z) time; enables efficient synonym/keyword highlighting and entity tagging.
Beider-Morse Phonetic Matching

Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
Cologne Phonetics

Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
Daitch-Mokotoff Soundex

A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
Match Rating Approach

The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
NYSIIS

NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
Caverphone

Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
Double Metaphone

Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
Metaphone

Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
Metaphone 3

Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
Phonetic Encoding

Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
Soundex

Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
Decompounding

Decompounding splits compound words — common in German, Dutch, and Scandinavian languages — into their component tokens so that searches for constituents match the full compound at index and query time.
Hunspell

Hunspell is a dictionary-based morphological analyser and spell checker that produces lemmas by stripping affixes and looking up base forms in a language-specific dictionary.
Inflection

Inflection is the morphological process by which a single lexeme takes on different surface forms to express grammatical categories such as tense, number, and case — the variation that lemmatisation is designed to undo.
Morphological Analysis

Morphological analysis decomposes words into their constituent morphemes — stems, prefixes, suffixes, and inflectional endings — enabling NLP systems to recognise that surface-form variants refer to the same underlying concept.
Suffix

A suffix is a bound morpheme appended to the right end of a word stem, encoding grammatical properties or creating new words — and the primary target of every English stemming algorithm.
Lancaster Stemmer

The Lancaster Stemmer is an alternative name for the Paice/Husk Stemmer — an aggressive, iterative English stemming algorithm developed at Lancaster University.
Lovins Stemmer

The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
Paice/Husk Stemmer

The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
ASCII Folding

ASCII folding maps accented and special characters to their closest ASCII equivalents using a lookup table, improving recall for users who omit diacritics at the cost of collapsing distinctions that may be semantically meaningful.
Decimal Digit Filter

A decimal digit filter maps Unicode decimal digit characters from any script to their ASCII 0–9 equivalents, ensuring that numbers written in Eastern Arabic, Devanagari, Thai, and other numeral systems match the same query regardless of which digit form was used.
Elision Filter

An elision filter is a token filter that strips language-specific clitic prefixes — such as French l’ and d’ — from the start of tokens, leaving the bare stem for indexing and matching.
HTML Strip

HTML stripping is a character-level preprocessing stage that removes markup tags and decodes HTML entities from raw text before it reaches the tokeniser, preventing angle brackets and entity sequences from appearing as index terms.
KStem

KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
Length Filter

A length filter is a token filter in an analysis chain that discards any token whose character length falls outside a configured minimum and maximum bound, removing noise tokens produced by tokenisation or upstream rewriting.
Lowercasing

Lowercasing converts every character in a string to its lowercase form, eliminating case variation so that ‘HTTP’, ‘Http’, and ‘http’ map to a single index term.
Normalisation

Normalisation transforms raw text into a consistent, canonical form — lowercasing, accent stripping, Unicode standardisation — so that surface variants of the same term map to a single index entry.
Pattern Replace Filter

A pattern replace filter applies a regular expression substitution to each token in an analysis chain, rewriting token text in place without changing token boundaries — distinct from a pattern tokeniser, which splits the raw character stream.
Porter Stemmer

The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
Porter2 Stemmer

Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
Stop Word

A stop word is a high-frequency function word — such as the, is, or at — removed from a token stream during analysis to reduce index noise and improve retrieval efficiency.
Stop Word Filter

A stop word filter is a token filter in an analysis chain that removes stop words from the token stream at index time and query time, reducing index size and suppressing high-frequency noise terms.
Trim Filter

A trim filter is a token filter that strips leading and trailing whitespace characters from each token in the analysis stream, leaving the token’s interior content unchanged.
Unicode Normalisation

Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.
CJK Tokeniser

A CJK tokeniser segments Chinese, Japanese, and Korean text into tokens by splitting at every character or by applying a dictionary and statistical model to identify word boundaries.
Thai Tokeniser

A Thai tokeniser segments Thai script into words by combining a word-boundary dictionary with statistical or ML models, since Thai is written without spaces between words.
Path Hierarchy Tokeniser

A path hierarchy tokeniser splits a path string into every prefix hierarchy, so that a document at /a/b/c is also findable by /a/b or /a — enabling subtree search on file paths, URL components, and category trees.
Query Expansion

Query expansion augments a user’s search query with synonyms, related terms, or reformulations to reduce vocabulary mismatch and improve recall against an inverted index.
Edge N-Gram

An edge n-gram is a prefix-anchored n-gram generated from the start of a token, used in search engines to power as-you-type autocomplete and prefix matching.
ICU Tokeniser

The ICU tokeniser applies ICU BreakIterator rules to split text into tokens, extending UAX #29 with locale-aware dictionary segmentation for CJK and Thai and support for custom script rules.
Unigram Language Model Tokeniser

The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
SentencePiece

SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
WordPiece

WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
Byte Pair Encoding

Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
Subword Tokenisation

Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
Unicode Tokeniser

A Unicode tokeniser splits text into tokens using Unicode character categories and the UAX #29 word-boundary rules, producing correct token boundaries across all scripts and languages.
Regex Tokeniser

A regex tokeniser defines token boundaries with a regular expression, either splitting on delimiter matches or extracting token matches — the generalisation underlying whitespace, punctuation, and word tokenisers.
Sentence Tokeniser

A sentence tokeniser splits a document into individual sentences, establishing the boundary between document-level and word-level processing — a step that is harder than it appears because full stops serve multiple roles.
Word Tokeniser

A word tokeniser splits text into tokens at word boundaries using rules or regular expressions, correctly handling punctuation, contractions, hyphenation, and URLs where a whitespace split would fail.
Punctuation Tokeniser

A punctuation tokeniser splits text on both whitespace and punctuation characters, emitting only alphabetic and numeric runs — a simple, stateless approach common in search engine analysis chains.
Whitespace Tokeniser

A whitespace tokeniser splits a string into tokens by breaking on space, tab, and newline characters — the simplest possible tokenisation strategy, with well-defined failure modes.
Lemmatisation

Lemmatisation reduces an inflected word form to its dictionary base form — its lemma — by applying morphological analysis and a lexicon lookup, producing valid words rather than truncated stems.
Stemming

Stemming reduces a word to a base form by stripping affixes using rule-based heuristics, allowing variant forms such as “running”, “runs”, and “ran” to match a single index term.
F1 Score

The F1 score is the harmonic mean of precision and recall, producing a single number that balances a model’s ability to avoid false positives against its ability to avoid false negatives.
BM25

BM25 (Best Match 25) is a probabilistic ranking function that scores documents against a query by weighing term frequency and inverse document frequency with length normalisation.
TF-IDF

TF-IDF (term frequency–inverse document frequency) is a numerical statistic that reflects how important a word is to a document relative to a corpus, used as a relevance signal in search ranking.
Trie

A trie is a tree where each path from root to node spells out a prefix, enabling O(k) term lookup, prefix enumeration, and autocomplete — where k is the length of the query string.
Inverted Index

An inverted index maps each unique term in a corpus to the documents — and optionally the positions — where it appears, making full-text search fast regardless of corpus size.
Shingling

Shingling represents a document as its set of overlapping n-grams (shingles), enabling near-duplicate detection via Jaccard similarity or MinHash approximations.
Shingle

A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.
Character N-Gram

A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
Skip-Gram

A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.
Trigram

A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
Bigram

A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
Unigram

A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
N-Gram

An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
Corpus

A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.
Tokenisation

Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.
Token

A token is the smallest unit of text that an NLP pipeline or search engine operates on — typically a word, subword, or character produced by splitting an input string.