Index-time Analysis

What it is

Index-time analysis is the processing applied to a document’s text fields when the document is added to the index. The analysis chain — tokeniser plus token filters — runs over the raw field value and produces a stream of terms. Those terms are what get written into the inverted index and stored in postings lists.

Index-time analysis determines the vocabulary of the index. Whatever terms it produces are the only ones that can be matched at search time.

How it works

When a document is indexed:

  1. The field’s configured analyzer is invoked on the field value.
  2. The analysis chain produces a token stream.
  3. Each token in the stream becomes an entry in the inverted index postings lists, keyed by term text.
  4. Position, frequency, and offset information is stored alongside each posting if configured.

The resulting terms in the index are the analysed forms — not the original text. If the analyzer lowercases and stems, the index contains "fox", not "Foxes".

Index-time analysis is typically more aggressive than query-time analysis: it may generate additional tokens (synonyms, edge n-grams) that expand recall. These tokens don’t need to appear in the query because they’re pre-computed at index time.

Example

Index-time analyzer: lowercase → stop words → Porter2 stemmer → edge n-gram (min=2, max=8)

Input field value: "Running shoes"

Step Tokens
Lowercase ["running", "shoes"]
Stop words ["running", "shoes"] (no change)
Porter2 ["run", "shoe"]
Edge n-gram ["ru", "run", "sh", "sho", "shoe"]

The index contains ["ru", "run", "sh", "sho", "shoe"]. A query for "sho" matches because "sho" is in the index — the edge n-gram filter pre-computed prefix tokens at index time.

Variants and history

The index-time vs query-time distinction emerged as search engines began applying different processing to ingestion and retrieval. It became explicit in Solr’s schema.xml (early 2000s) and Elasticsearch’s mappings (2010s).

Common index-time-only operations:

  • Edge n-gram generation — prefix tokens for autocomplete; generating them at index time avoids expensive wildcard queries.
  • Synonym injection at index time — adds synonym variants to the index so queries without synonyms still match.
  • Shingle generation — pre-indexes bigrams and trigrams for phrase-boosting.

When to use it

Understand index-time analysis when:

  • Debugging missing results — if a query that should match isn’t, the first step is checking whether the document’s field produced the expected index terms. Use _termvectors in Elasticsearch to inspect what was actually indexed.
  • Designing autocomplete — edge n-gram at index time is the standard pattern.
  • Managing index size — every additional token filter at index time increases the number of terms stored, growing the index. Edge n-grams in particular can multiply index size significantly.

Changing index-time analysis requires reindexing — unlike query-time analysis, which takes effect immediately.

See also