Shingle

What it is

A shingle is an n-gram — a contiguous sequence of n tokens or characters — extracted from a document and treated as a set element rather than a positional index entry.

The word itself carries meaning. In search and IR literature, “n-gram” implies a sequence with position: the third bigram of a document is distinct from the fifth. A “shingle” implies membership in a set: the document either contains this subsequence or it does not, and how many times it appears, or where, is discarded. That framing shift — from sequence to set — is what the terminology signals.

Why a different name?

N-gram and shingle describe the same extracted string. The distinction is in how it is used downstream:

Term Retains position? Primary use
N-gram Yes Language modelling, phrase indexing, autocomplete
Shingle No Document similarity, near-duplicate detection

When a paper or system says “shingle”, it is announcing that documents will be compared as unordered bags of substrings — not that the extraction procedure differs.

k-shingling and w-shingling

Two granularities are standard:

  • k-shingling — the source text is treated as a raw character sequence; every contiguous run of k characters is one shingle. Robust to tokenisation differences and minor whitespace edits.
  • w-shingling — the text is tokenised first; every contiguous run of w words is one shingle. More interpretable; sensitive to exact wording.

The letters k and w are conventional in the literature and map directly to n in n-gram notation.

The MinHash connection

Representing a document as its shingle set makes Jaccard similarity the natural measure of document overlap. Computing exact Jaccard across millions of document pairs is infeasible — O(N²) comparisons. MinHash provides an unbiased approximation by hashing each shingle set to a compact signature. That pipeline — extract shingles, hash to a MinHash signature, compare signatures — is the standard approach for near-duplicate detection in web crawls and LLM pre-training data deduplication.

The shingle is the atomic unit that pipeline depends on. Choosing k or w determines what counts as a shared feature between documents; everything downstream in the similarity pipeline follows from that choice.

See also