Analyzer

What it is

An analyzer is the named configuration object that wraps an analysis chain in Solr, Elasticsearch, and OpenSearch. Where “analysis chain” is the conceptual pipeline, “analyzer” is the concrete, deployable configuration you assign to fields in your schema.

Analyzers are reusable: define one english analyzer with lowercasing, stop words, and Porter2 stemming, then assign it to every text field that needs English analysis — without repeating the configuration.

How it works

In Elasticsearch and OpenSearch, an analyzer is defined in the index settings:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stop", "porter_stem"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

Fields reference the analyzer by name:

{
  "mappings": {
    "properties": {
      "body": {
        "type": "text",
        "analyzer": "my_english",
        "search_analyzer": "my_english"
      }
    }
  }
}

The analyzer field controls index-time analysis; search_analyzer controls query-time analysis. They can differ — a common pattern is to use an edge n-gram analyzer at index time and the standard analyzer at query time for prefix-completion fields.

[illustrate: Elasticsearch index settings JSON — left column showing analyzer definition (tokenizer + filters) — right column showing field mapping referencing the analyzer by name — arrow connecting them labelled “used at index and query time”]

Example

Built-in analyzers in Elasticsearch:

Analyzer Behaviour
standard Unicode word boundaries, lowercase
english Standard + English stop words + Porter2 stemming
simple Split on non-letters, lowercase
whitespace Split on whitespace only, no lowercasing
keyword No tokenisation — treats whole field as one token
fingerprint Tokenises, sorts, deduplicates, rejoins — for fingerprinting

For most text search use cases, english (or its equivalent for other languages) is the right starting point.

Variants and history

Solr’s equivalent is the fieldType element in schema.xml, where <analyzer> elements are children of the field type definition. Solr distinguishes index-time and query-time analyzers with <analyzer type="index"> and <analyzer type="query"> sub-elements.

Lucene’s Analyzer class is the underlying abstraction: it exposes a tokenStream() method that returns a TokenStream for any input string.

When to use it

Define a custom analyzer whenever the built-in analyzers don’t match your content:

  • Domain-specific stop words — medical, legal, or financial corpora have domain-specific high-frequency terms to suppress.
  • Custom synonym handling — inject a synonym filter with your product catalog’s abbreviations and aliases.
  • Multilingual fields — use a language-specific analyzer per language, then combine results at query time.
  • Autocomplete fields — use an edge n-gram analyzer at index time to enable prefix matching without wildcard queries.

Use the _analyze API in Elasticsearch to test your analyzer before assigning it to fields:

POST _analyze
{
  "analyzer": "my_english",
  "text": "The quick brown foxes jumped"
}

See also