Scoring
-
Vector Space Model
The vector space model (VSM) represents documents and queries as vectors in a high-dimensional term space and ranks documents by their cosine similarity to the query vector.
-
Term Frequency
Term frequency (TF) is the count of how many times a term appears in a document. It is one of the two core signals in TF-IDF and BM25 scoring.
-
Reranker
Second-stage model re-scoring a candidate set retrieved by first-stage retrieval; improves ranking quality at modest computational cost.
-
Probabilistic Retrieval Model
Probabilistic retrieval models rank documents by their estimated probability of relevance to a query. BM25 is the most successful probabilistic retrieval model; language models offer an alternative probabilistic framework.
-
Okapi BM25
Okapi BM25 is the original formulation of BM25, developed at City University London on the Okapi IR system in the early 1990s. The name ‘Okapi BM25’ honours the system; in practice it is synonymous with BM25.
-
Learning to Rank
Learning to rank (LTR) trains a model to produce an optimal ordering of documents for a query using labelled relevance data, combining signals such as BM25, click-through rate, and document features.
-
Inverse Document Frequency
Inverse document frequency (IDF) is a log-scaled measure of how rare a term is across a corpus. Rare terms receive high IDF weights; common terms receive low weights, making IDF a natural filter for uninformative vocabulary.
-
Document Frequency
Document frequency (DF) is the number of documents in a corpus that contain a given term. It is the denominator in IDF and signals how common or rare a term is across the collection.
-
Boosting
Adjusts the relevance score contribution of a field, term, or query clause, multiplying base scores to prioritise matches. Essential for ranking tuning.
-
BM25F
BM25F extends BM25 to multi-field documents by weighting each field separately before combining, so title matches can outweigh body matches without simply multiplying the final score.
-
BM25+
BM25+ fixes an edge-case bug in BM25 where long documents containing a rare query term can score lower than shorter documents that don’t contain it at all, by adding a small constant lower-bound to the TF contribution.
-
BM25
BM25 (Best Match 25) is a probabilistic ranking function that scores documents against a query by weighing term frequency and inverse document frequency with length normalisation.
-
TF-IDF
TF-IDF (term frequency–inverse document frequency) is a numerical statistic that reflects how important a word is to a document relative to a corpus, used as a relevance signal in search ranking.