Language-Modelling

Probabilistic Retrieval Model

Probabilistic retrieval models rank documents by their estimated probability of relevance to a query. BM25 is the most successful probabilistic retrieval model; language models offer an alternative probabilistic framework.
Cross-Entropy

Cross-entropy measures the average number of bits needed to encode samples from a true distribution using a model distribution. It is the standard training loss for language models and the basis of perplexity.
Thai Tokeniser

A Thai tokeniser segments Thai script into words by combining a word-boundary dictionary with statistical or ML models, since Thai is written without spaces between words.
Unigram Language Model Tokeniser

The Unigram LM tokeniser builds a subword vocabulary top-down: it begins with a large candidate set and iteratively prunes entries that minimise the increase in corpus log-loss, producing a probability distribution over segmentations.
SentencePiece

SentencePiece is a language-agnostic subword tokeniser that trains directly on raw Unicode text, encodes whitespace as the ▁ symbol, and produces a fully reversible token sequence using either BPE or Unigram LM as the underlying algorithm.
WordPiece

WordPiece is a subword tokenisation algorithm that builds a vocabulary by iteratively merging symbol pairs chosen to maximise training-corpus likelihood, rather than raw frequency. It is the tokeniser used in BERT and its derivatives.
Byte Pair Encoding

Byte pair encoding is a data-compression algorithm repurposed for NLP to build subword vocabularies by iteratively merging the most frequent adjacent symbol pair in a training corpus.
Subword Tokenisation

Subword tokenisation splits words into smaller vocabulary units — fragments between characters and whole words — so a fixed vocabulary can represent any input string, including words never seen during training.
Sentence Tokeniser

A sentence tokeniser splits a document into individual sentences, establishing the boundary between document-level and word-level processing — a step that is harder than it appears because full stops serve multiple roles.
Lemmatisation

Lemmatisation reduces an inflected word form to its dictionary base form — its lemma — by applying morphological analysis and a lexicon lookup, producing valid words rather than truncated stems.
Character N-Gram

A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
Skip-Gram

A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.
Trigram

A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
Bigram

A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
Unigram

A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
N-Gram

An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.
Corpus

A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.
Tokenisation

Tokenisation is the process of splitting a raw text string into a sequence of discrete units — tokens — that downstream NLP components such as indexers, classifiers, and language models can operate on.
Token

A token is the smallest unit of text that an NLP pipeline or search engine operates on — typically a word, subword, or character produced by splitting an input string.