Index-time Analysis
What it is
Index-time analysis is the processing applied to a document’s text fields when the document is added to the index. The analysis chain — tokeniser plus token filters — runs over the raw field value and produces a stream of terms. Those terms are what get written into the inverted index and stored in postings lists.
Index-time analysis determines the vocabulary of the index. Whatever terms it produces are the only ones that can be matched at search time.
How it works
When a document is indexed:
- The field’s configured
analyzeris invoked on the field value. - The analysis chain produces a token stream.
- Each token in the stream becomes an entry in the inverted index postings lists, keyed by term text.
- Position, frequency, and offset information is stored alongside each posting if configured.
The resulting terms in the index are the analysed forms — not the original text. If the analyzer lowercases and stems, the index contains "fox", not "Foxes".
Index-time analysis is typically more aggressive than query-time analysis: it may generate additional tokens (synonyms, edge n-grams) that expand recall. These tokens don’t need to appear in the query because they’re pre-computed at index time.
Example
Index-time analyzer: lowercase → stop words → Porter2 stemmer → edge n-gram (min=2, max=8)
Input field value: "Running shoes"
| Step | Tokens |
|---|---|
| Lowercase | ["running", "shoes"] |
| Stop words | ["running", "shoes"] (no change) |
| Porter2 | ["run", "shoe"] |
| Edge n-gram | ["ru", "run", "sh", "sho", "shoe"] |
The index contains ["ru", "run", "sh", "sho", "shoe"]. A query for "sho" matches because "sho" is in the index — the edge n-gram filter pre-computed prefix tokens at index time.
Variants and history
The index-time vs query-time distinction emerged as search engines began applying different processing to ingestion and retrieval. It became explicit in Solr’s schema.xml (early 2000s) and Elasticsearch’s mappings (2010s).
Common index-time-only operations:
- Edge n-gram generation — prefix tokens for autocomplete; generating them at index time avoids expensive wildcard queries.
- Synonym injection at index time — adds synonym variants to the index so queries without synonyms still match.
- Shingle generation — pre-indexes bigrams and trigrams for phrase-boosting.
When to use it
Understand index-time analysis when:
- Debugging missing results — if a query that should match isn’t, the first step is checking whether the document’s field produced the expected index terms. Use
_termvectorsin Elasticsearch to inspect what was actually indexed. - Designing autocomplete — edge n-gram at index time is the standard pattern.
- Managing index size — every additional token filter at index time increases the number of terms stored, growing the index. Edge n-grams in particular can multiply index size significantly.
Changing index-time analysis requires reindexing — unlike query-time analysis, which takes effect immediately.