Similarity
-
Vector Space Model
The vector space model (VSM) represents documents and queries as vectors in a high-dimensional term space and ranks documents by their cosine similarity to the query vector.
-
More Like This
Finds documents similar to a given document using term statistics and relevance scoring. Basis for recommendation and related-document features.
-
Dot Product Similarity
Inner product of two vectors; equivalent to cosine similarity when vectors are unit-normalised; fast to compute in dense retrieval.
-
Shingling
Shingling represents a document as its set of overlapping n-grams (shingles), enabling near-duplicate detection via Jaccard similarity or MinHash approximations.
-
Shingle
A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.
-
Character N-Gram
A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
-
Trigram
A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
-
N-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.