Analysis Chain

What it is

An analysis chain is the complete text processing pipeline applied to a field’s content before it is stored in the index or matched against a query. It begins with a tokeniser that splits raw text into tokens, followed by zero or more token filters that transform, remove, or inject tokens.

The analysis chain is the configuration unit that determines what terms end up in the inverted index — and therefore what queries will match what documents.

How it works

A typical analysis chain has three stages:

Character filters (optional) — applied to the raw character stream before tokenisation. Examples: strip HTML tags, replace & with and, fold accented characters.
Tokeniser — splits the filtered character stream into tokens. Examples: whitespace tokeniser, standard (word boundary) tokeniser, regex tokeniser. Exactly one tokeniser per chain.
Token filters — applied to the token stream in sequence. Examples: lowercase, stop word removal, stemmer, synonym expansion, edge n-gram. Any number of filters, applied in order.

raw text
  → [char filter: html_strip]
  → [char filter: mapping (& → and)]
  → [tokeniser: standard]
  → [token filter: lowercase]
  → [token filter: stop words]
  → [token filter: porter2 stemmer]
  → index terms

Order matters. Lowercasing before stop word removal ensures that “The” and “the” are both caught by a lowercase stop list. Stemming after stop word removal avoids wasting cycles on terms that will be discarded.

Example

Input: "The quick brown foxes jumped!"

Stage	Output
Standard tokeniser	`["The", "quick", "brown", "foxes", "jumped"]`
Lowercase	`["the", "quick", "brown", "foxes", "jumped"]`
Stop word filter	`["quick", "brown", "foxes", "jumped"]`
Porter2 stemmer	`["quick", "brown", "fox", "jump"]`

The index stores ["quick", "brown", "fox", "jump"]. A query for "jumping foxes" after the same chain becomes ["jump", "fox"] — and matches D1 even though neither exact word appears.

Variants and history

The analysis chain concept was formalised in Apache Lucene and made explicit in Elasticsearch’s analyzer configuration. Earlier IR systems (SMART, Okapi) applied equivalent processing but did not name it a “chain.”

Solr calls the full pipeline an analyzer and exposes it in fieldType schema definitions. Elasticsearch uses the same terminology.

Key considerations:

Index-time vs query-time analysis — these chains can differ, but the terms they produce must be compatible. The query chain is often simpler (e.g. no edge n-gram generation).
Multi-language analysis — different fields or documents may need different analysis chains; Elasticsearch supports per-field analyzer configuration.

When to use it

Understanding the analysis chain is essential for debugging unexpected search behaviour. The most common issues are:

Index and query chains out of sync — indexing with stemming but querying without means "foxes" in the query never matches "fox" in the index.
Wrong tokeniser — using a whitespace tokeniser on "don't" keeps it as one token; a standard tokeniser splits it into ["do", "n't"].
Filter ordering — stemming before stop word removal stems the stop words; stemming after skips them.

In Elasticsearch, use the _analyze API to test any analyzer against sample text before deploying.