N-Gram
-
N-Gram Language Model
Language model estimating token probabilities from observed n-gram counts; foundation of statistical NLP before neural methods.
-
Edge N-Gram
An edge n-gram is a prefix-anchored n-gram generated from the start of a token, used in search engines to power as-you-type autocomplete and prefix matching.
-
Shingling
Shingling represents a document as its set of overlapping n-grams (shingles), enabling near-duplicate detection via Jaccard similarity or MinHash approximations.
-
Shingle
A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.
-
Character N-Gram
A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
-
Skip-Gram
A skip-gram is a generalisation of the n-gram that allows gaps between tokens, and also the name of the Word2Vec training objective that predicts context words from a centre word.
-
Trigram
A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
-
Bigram
A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.
-
Unigram
A unigram is an n-gram of length 1 — a single token considered in isolation. The unigram model treats each token as statistically independent, forming the basis of bag-of-words retrieval.
-
N-Gram
An n-gram is a contiguous sequence of n tokens drawn from a text, used to capture local word order for indexing, language modelling, and similarity.