Inverted Index

Data-Structures Information-Retrieval Search Indexing Full-Text-Search Needs-Review

What it is

An inverted index is the core data structure behind full-text search engines. Rather than storing a document and listing the words it contains (a forward index), it flips the relationship: each unique term becomes a key, and its value is a list of every document in which that term appears. This list is called a postings list.

The name comes from this inversion. A library card catalogue is a rough analogy: instead of browsing a shelf and reading each book to find mentions of “recursion”, you look up “recursion” in the index and get back a list of page references directly.

Elasticsearch, OpenSearch, Solr, Tantivy, and virtually every other production search engine store their searchable text in an inverted index. Without one, answering even a simple keyword query would require scanning every document in the corpus — a linear operation that becomes impractical above a few thousand documents.

How it works

Building an inverted index happens in two phases: analysis and indexing.

Analysis converts raw document text into a stream of tokens. This typically involves lowercasing, splitting on whitespace and punctuation, removing stop words, and applying stemming or lemmatisation. The exact steps depend on the analyser configuration. Each resulting token is the term that will appear as a key in the index.

Indexing inserts each token into the index structure. For each token, the engine records:

the document ID — which document the token came from
optionally, the term frequency — how many times the token appears in that document
optionally, the position(s) — the ordinal offset(s) of each occurrence within the document

Position data is necessary for phrase queries (“quick brown fox” must appear as three consecutive tokens) and proximity queries (“dog” within five words of “cat”). Omitting positions saves space and speeds up indexing when you only need single-term or boolean matching.

[illustrate: three short documents mapping to a term dictionary with postings lists — each entry shows doc IDs, term frequencies, and token positions; highlight how “sat” maps to D1 (pos 2) and D2 (pos 2) while “dog” maps only to D1 (pos 0) and D3 (pos 1)]

At query time, the engine looks up each query term in the index — an O(1) hash or binary search operation — and retrieves its postings list. For a multi-term query, it merges (intersects or unions) the postings lists depending on whether the query is AND or OR logic, then passes the matching document IDs to a scorer such as BM25.

[illustrate: pipeline diagram — query string enters left, splits into tokens, each token fetched from term dictionary, postings lists merged, merged doc IDs passed to scoring layer, ranked results exit right]

Example

Three documents:

ID	Text
D1	`"the dog sat on the mat"`
D2	`"the cat sat on the mat"`
D3	`"the dog chased the cat"`

After lowercasing and stop-word removal (removing “the”, “on”), analysis produces:

Doc	Tokens
D1	dog, sat, mat
D2	cat, sat, mat
D3	dog, chased, cat

The resulting inverted index:

Term	Postings (doc, freq, positions)
cat	D2 (1, [0]), D3 (1, [2])
chased	D3 (1, [1])
dog	D1 (1, [0]), D3 (1, [0])
mat	D1 (1, [2]), D2 (1, [2])
sat	D1 (1, [1]), D2 (1, [1])

Query: “dog sat” (AND)

Fetch postings for “dog”: {D1, D3}
Fetch postings for “sat”: {D1, D2}
Intersect: {D1}
Score D1 with BM25 and return it as the sole result.

Query: “dog” OR “cat”

Fetch “dog”: {D1, D3}
Fetch “cat”: {D2, D3}
Union: {D1, D2, D3} — all three documents match and proceed to scoring.

[illustrate: before/after showing raw document set on the left and the fully built term dictionary with postings lists on the right, with arrows tracing each token from its source document into its postings entry]

Variants and history

The inverted index concept predates computers — printed concordances (exhaustive word-location indexes for texts such as the Bible) were produced by hand from the 13th century onward. The first computerised inverted indexes appeared in the 1950s for document retrieval systems.

Compressed postings lists. In large corpora, postings lists for common terms (e.g. “price”) can contain millions of entries. Modern engines delta-encode doc IDs (storing the gap between successive IDs rather than absolute values) and then apply variable-byte or PFor-delta compression. This can reduce storage by an order of magnitude and keeps postings lists cache-friendly.

Block-max indexes. Lucene-based engines partition postings lists into blocks and store the maximum BM25 score possible in each block. During retrieval, blocks whose max score cannot affect the top-k result are skipped entirely, dramatically improving query throughput on large corpora.

Forward vs inverted. A forward index (document → term list) is retained alongside the inverted index in many engines to support features like field collapsing, faceting, and “more like this” queries that start from a document rather than a query string.

Column-oriented stores. Some analytical search systems (e.g. Clickhouse, Parquet-based engines) use columnar storage rather than traditional inverted indexes, trading some query flexibility for better compression and vectorised execution on aggregate queries.

When to use it

An inverted index is the right structure whenever your primary access pattern is “find documents containing term X”. It is the correct default for:

Full-text search across any corpus larger than a few hundred documents.
Keyword and boolean queries where exact or stemmed term matching is the requirement.
Any engine that needs to rank results by BM25 or TF-IDF, since both scoring functions read directly from the index’s term frequencies and document frequencies.

It is less suited to:

Semantic similarity search. Inverted indexes operate on exact (or analysed) terms. If “cheap hotel” should match “affordable accommodation”, you need vector embeddings and an approximate nearest-neighbour (ANN) index instead — or a hybrid that combines both.
Prefix and wildcard queries without n-gram indexing. A plain inverted index cannot efficiently serve "dog*" without either a trie structure, a n-gram index, or an edge-n-gram token filter applied at index time.
Update-heavy workloads. Inverted indexes are expensive to update in place. Engines like Lucene handle this by writing immutable segments and merging them in the background — writes are batched and delayed, so freshness guarantees are weaker than in a row-store database.