N-Gram

N-Gram Language Model

Language model estimating token probabilities from observed n-gram counts; foundation of statistical NLP before neural methods.
Edge N-Gram

An edge n-gram is a prefix-anchored n-gram generated from the start of a token, used in search engines to power as-you-type autocomplete and prefix matching.
Shingling

Shingling represents a document as its set of overlapping n-grams (shingles), enabling near-duplicate detection via Jaccard similarity or MinHash approximations.
Shingle

A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.
Character N-Gram

A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
Skip-Gram

A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.
Trigram

A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
Bigram

A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
Unigram

A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
N-Gram

An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.