N-Gram

What it is

An n-gram is a contiguous sequence of n items drawn from a stream of text. The items are most commonly words (word n-grams) or characters (character n-grams), and n is a small positive integer. A sequence of two items is a bigram, three items a trigram, and one item a unigram. For n ≥ 4 the terms “4-gram”, “5-gram”, and so on are standard.

N-grams are one of the oldest and most broadly applied primitives in NLP. The same mechanism — slide a window of width n across a sequence and emit each window’s contents — underpins language model probability estimation, full-text index construction, spelling correction, document similarity, and plagiarism detection.

How it works

Given a token sequence, a window of width n slides one position at a time from left to right. At each position, the n tokens currently inside the window form one n-gram. A sequence of L tokens produces L − n + 1 n-grams (assuming no padding).

For character n-grams, the same logic applies at the character level rather than the word level. The string "fox" with n = 2 produces: fo, ox.

The mechanics in Python:

def ngrams(tokens: list[str], n: int) -> list[tuple[str, ...]]:
    return [tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)]

tokens = ["the", "quick", "brown", "fox"]
print(ngrams(tokens, 2))
# [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox')]
print(ngrams(tokens, 3))
# [('the', 'quick', 'brown'), ('quick', 'brown', 'fox')]

For character n-grams, pass list(word) or list(text) as the token sequence.

Example

Input sentence: "the quick brown fox"

After whitespace tokenisation: ["the", "quick", "brown", "fox"]

n Name N-grams produced
1 Unigrams the, quick, brown, fox
2 Bigrams the quick, quick brown, brown fox
3 Trigrams the quick brown, quick brown fox

Character bigrams of the word "brown": br, ro, ow, wn

N-grams in an inverted index. A search engine indexing bigrams from these four tokens would store entries for the quick, quick brown, and brown fox alongside the standard unigram entries. A query for "quick brown" can now match the bigram directly, supporting exact phrase matching without requiring positional data in the postings list.

Variants and history

Language model n-grams. The statistical use of n-grams for language modelling dates to the work of Andrei Markov (1913) on letter sequences, and was formalised for words by Claude Shannon in 1948. A trigram language model assigns a probability to each word given the two preceding words: P(wᵢ | wᵢ₋₂, wᵢ₋₁). This was the dominant approach to language modelling before neural networks made longer-range dependencies tractable.

Character n-grams. Breaking words into overlapping character sequences before indexing allows partial-word matching without wildcards. The query "browni" has character trigrams bro, row, owi, win — several of which overlap with "brownie", enabling fuzzy matching via index lookup rather than expensive edit-distance computation over the full vocabulary.

Edge n-grams. A variant that generates n-grams only from the start of a token: "quick" with edge bigrams and trigrams produces qu, qui, quic, quick. Search engines use edge n-grams to power as-you-type autocomplete — each prefix is indexed so that partial queries can match fully typed terms.

Skip-grams. A generalisation of n-grams that allows gaps between tokens. A (2,1)-skip-gram over ["the", "quick", "brown"] would include the brown (skipping “quick”). Skip-grams increase vocabulary size rapidly but capture more non-local relationships.

Shingling. Document-level n-grams — also called shingles — are used in near-duplicate detection. A document is represented as its set of (typically character) n-grams or word n-grams. The Jaccard similarity of two shingle sets provides a cheap estimate of document overlap, which locality-sensitive hashing (MinHash) can approximate at scale.

When to use it

Use word n-grams when:

  • You need phrase matching in a search index. Indexing bigrams and trigrams alongside unigrams lets you support "quick brown fox" as a phrase query without the storage cost of full positional indexes.
  • You are building a classical language model for text generation, autocomplete, or perplexity scoring and your corpus is too small for a neural approach.
  • You need query expansion — n-grams of query terms can be used to retrieve documents containing partial phrase matches.

Use character n-grams when:

  • You need fuzzy or typo-tolerant search without a separate edit-distance lookup. Character trigrams indexed at write time let the query engine retrieve candidates by shared substrings, which a secondary ranker can then reorder by true edit distance.
  • Your text contains many compound words, URLs, product codes, or morphologically rich tokens where word boundaries are not reliable splits.
  • You are doing language identification — character n-gram profiles are a fast and accurate signal for distinguishing languages.

Tradeoffs to keep in mind:

  • N-gram indexing multiplies the number of index terms. Bigrams of a 10-token sentence produce 9 entries on top of the 10 unigrams. Storage and index build time grow proportionally; set n only as large as your phrase-matching requirements demand.
  • For semantic similarity, n-grams capture surface overlap only. "affordable accommodation" shares no bigrams with "cheap hotel". Dense vector embeddings handle paraphrase; n-grams do not.
  • Classical n-gram language models suffer from sparsity for large n: most trigrams and almost all 5-grams that could grammatically exist never appear in any finite training corpus. Smoothing techniques (Kneser-Ney, Good-Turing) or neural models are needed to handle unseen sequences.

See also