N-Gram

N-Gram Tokenisation Language-Modelling Text-Analysis Information-Retrieval Similarity Needs-Review

What it is

An n-gram is a contiguous sequence of n items drawn from a stream of text. The items are most commonly words (word n-grams) or characters (character n-grams), and n is a small positive integer. A sequence of two items is a bigram, three items a trigram, and one item a unigram. For n ≥ 4 the terms “4-gram”, “5-gram”, and so on are standard.

N-grams are one of the oldest and most broadly applied primitives in NLP. The same mechanism — slide a window of width n across a sequence and emit each window’s contents — underpins language model probability estimation, full-text index construction, spelling correction, document similarity, and plagiarism detection.

How it works

Given a token sequence, a window of width n slides one position at a time from left to right. At each position, the n tokens currently inside the window form one n-gram. A sequence of L tokens produces L − n + 1 n-grams (assuming no padding).

For character n-grams, the same logic applies at the character level rather than the word level. The string "fox" with n = 2 produces: fo, ox.

The mechanics in Python:

def ngrams(tokens: list[str], n: int) -> list[tuple[str, ...]]:
    return [tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)]

tokens = ["the", "quick", "brown", "fox"]
print(ngrams(tokens, 2))
# [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox')]
print(ngrams(tokens, 3))
# [('the', 'quick', 'brown'), ('quick', 'brown', 'fox')]

For character n-grams, pass list(word) or list(text) as the token sequence.

Example

Input sentence: "the quick brown fox"

After whitespace tokenisation: ["the", "quick", "brown", "fox"]

n	Name	N-grams produced
1	Unigrams	`the`, `quick`, `brown`, `fox`
2	Bigrams	`the quick`, `quick brown`, `brown fox`
3	Trigrams	`the quick brown`, `quick brown fox`

Character bigrams of the word "brown": br, ro, ow, wn

N-grams in an inverted index. A search engine indexing bigrams from these four tokens would store entries for the quick, quick brown, and brown fox alongside the standard unigram entries. A query for "quick brown" can now match the bigram directly, supporting exact phrase matching without requiring positional data in the postings list.

Variants and history

Language model n-grams. The statistical use of n-grams for language modelling dates to the work of Andrei Markov (1913) on letter sequences, and was formalised for words by Claude Shannon in 1948. A trigram language model assigns a probability to each word given the two preceding words: P(wᵢ | wᵢ₋₂, wᵢ₋₁). This was the dominant approach to language modelling before neural networks made longer-range dependencies tractable.

Character n-grams. Breaking words into overlapping character sequences before indexing allows partial-word matching without wildcards. The query "browni" has character trigrams bro, row, owi, win — several of which overlap with "brownie", enabling fuzzy matching via index lookup rather than expensive edit-distance computation over the full vocabulary.

Edge n-grams. A variant that generates n-grams only from the start of a token: "quick" with edge bigrams and trigrams produces qu, qui, quic, quick. Search engines use edge n-grams to power as-you-type autocomplete — each prefix is indexed so that partial queries can match fully typed terms.

Skip-grams. A generalisation of n-grams that allows gaps between tokens. A (2,1)-skip-gram over ["the", "quick", "brown"] would include the brown (skipping “quick”). Skip-grams increase vocabulary size rapidly but capture more non-local relationships.

Shingling. Document-level n-grams — also called shingles — are used in near-duplicate detection. A document is represented as its set of (typically character) n-grams or word n-grams. The Jaccard similarity of two shingle sets provides a cheap estimate of document overlap, which locality-sensitive hashing (MinHash) can approximate at scale.

When to use it

Use word n-grams when:

You need phrase matching in a search index. Indexing bigrams and trigrams alongside unigrams lets you support "quick brown fox" as a phrase query without the storage cost of full positional indexes.
You are building a classical language model for text generation, autocomplete, or perplexity scoring and your corpus is too small for a neural approach.
You need query expansion — n-grams of query terms can be used to retrieve documents containing partial phrase matches.

Use character n-grams when:

You need fuzzy or typo-tolerant search without a separate edit-distance lookup. Character trigrams indexed at write time let the query engine retrieve candidates by shared substrings, which a secondary ranker can then reorder by true edit distance.
Your text contains many compound words, URLs, product codes, or morphologically rich tokens where word boundaries are not reliable splits.
You are doing language identification — character n-gram profiles are a fast and accurate signal for distinguishing languages.

Tradeoffs to keep in mind:

N-gram indexing multiplies the number of index terms. Bigrams of a 10-token sentence produce 9 entries on top of the 10 unigrams. Storage and index build time grow proportionally; set n only as large as your phrase-matching requirements demand.
For semantic similarity, n-grams capture surface overlap only. "affordable accommodation" shares no bigrams with "cheap hotel". Dense vector embeddings handle paraphrase; n-grams do not.
Classical n-gram language models suffer from sparsity for large n: most trigrams and almost all 5-grams that could grammatically exist never appear in any finite training corpus. Smoothing techniques (Kneser-Ney, Good-Turing) or neural models are needed to handle unseen sequences.