Pattern Replace Filter

Preprocessing Tokenisation Normalisation Information-Retrieval Query-Parsing Needs-Review

What it is

A pattern replace filter is a token filter that applies a regular expression substitution to each token in turn, rewriting its text and passing it downstream. It operates on already-segmented tokens, not on the raw character stream, which makes it fundamentally different from a pattern tokeniser.

The distinction is worth pinning down before anything else:

Stage	Component	Operates on	Produces
Pre-tokenisation	Character filter (e.g. HTML strip)	Raw character stream	Cleaned character stream
Tokenisation	Pattern tokeniser / standard tokeniser	Character stream	Token stream
Post-tokenisation	Pattern replace filter	Individual tokens	Rewritten tokens

A pattern tokeniser decides where to cut text into tokens. A pattern replace filter takes tokens that already exist and changes what they say. You can use both in the same analysis chain — and often should — but they are separate tools solving separate problems.

In Lucene-based engines (Elasticsearch, OpenSearch, Solr), the pattern replace filter is a named token filter wired into a custom analyser. It is applied at both index time and query time, and like every other token filter, it must be configured identically in both directions to avoid analysis asymmetry.

How it works

The filter iterates over the token stream. For each token it runs the configured regex against the token text and applies the replacement string, producing a new token text of the same position and type. If the replacement produces an empty string, the token’s text becomes empty — see Gotchas for why that matters.

In Elasticsearch and OpenSearch the filter is declared as type: pattern_replace and accepts three parameters:

pattern — a Java regular expression applied to each token’s text.
replacement — the replacement string; Java capture-group back-references ($1, $2, …) are supported.
all — boolean (default true). When true, all non-overlapping matches within the token are replaced. When false, only the first match is replaced.

{
  "settings": {
    "analysis": {
      "filter": {
        "strip_version_build": {
          "type": "pattern_replace",
          "pattern": "-SNAPSHOT$",
          "replacement": ""
        }
      },
      "analyzer": {
        "version_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase", "strip_version_build"]
        }
      }
    }
  }
}

In Solr the equivalent is PatternReplaceFilterFactory:

<filter class="solr.PatternReplaceFilterFactory"
        pattern="-SNAPSHOT$"
        replacement=""
        replace="all"/>

The Solr replace attribute accepts "all" or "first", corresponding to Elasticsearch’s all: true and all: false.

Example

Use case: normalise hyphenated compounds.

A product catalogue indexes terms like "e-mail", "e-commerce", and "co-op". Users search for both hyphenated and unhyphenated forms. A pattern replace filter collapses the hyphen:

{
  "filter": {
    "dehyphenate": {
      "type": "pattern_replace",
      "pattern": "(?<=\\w)-(?=\\w)",
      "replacement": ""
    }
  }
}

Input token	Output token
`e-mail`	`email`
`e-commerce`	`ecommerce`
`co-op`	`coop`
`state-of-the-art`	`stateoftheart`

The lookahead and lookbehind anchors ((?<=\w) and (?=\w)) match only hyphens flanked by word characters, leaving leading or trailing hyphens intact.

Use case: strip noise suffixes from identifiers.

A log search pipeline indexes error codes like NullPointerException, IllegalArgumentException, and IOException. Stripping the trailing Exception suffix during analysis normalises all error-code tokens to their base form:

{
  "filter": {
    "strip_exception_suffix": {
      "type": "pattern_replace",
      "pattern": "Exception$",
      "replacement": ""
    }
  }
}

Input token	Output token
`NullPointerException`	`NullPointer`
`IllegalArgumentException`	`IllegalArgument`
`IOException`	`IO`

This is a lightweight alternative to a stemmer for domain-specific suffix stripping where the suffix is fixed and the stemmer would produce wrong results.

[illustrate: before/after table showing six log-line tokens entering the pattern replace filter on the left and their rewritten forms on the right, with the matched suffix segment highlighted in each input token to make the substitution visually explicit]

Variants and history

Lucene’s PatternReplaceFilter has been part of the core lucene-analyzers module since Lucene 3.x. The Elasticsearch pattern_replace token filter is a thin wrapper around it, as is Solr’s PatternReplaceFilterFactory. The parameters are stable across versions; only the all/replace naming differs between the two engines.

A related component is Elasticsearch’s pattern_replace character filter, which applies the same regex substitution at the character-stream level — before tokenisation. The character filter uses identical parameters (pattern, replacement, all) but the scope is the entire raw document string. Use the character filter when you need to rewrite the text before the tokeniser sees it (replacing & with &, collapsing repeated punctuation); use the token filter when the rewriting must happen token-by-token after segmentation.

Gotchas

Empty-string replacement does not remove the token. Replacing matched text with "" rewrites the token’s text to an empty string — it does not drop the token from the stream. An empty-string token occupies a position slot and can cause phrase-query position arithmetic to misalign. If removal is the intent, chain a length token filter downstream set to min: 1 to discard any zero-length tokens:

"filter": ["my_pattern_replace", "min_length_1"]

where min_length_1 is:

{
  "type": "length",
  "min": 1
}

Greedy matching consumes more than intended. The default Java regex quantifiers are greedy. A pattern like $.*$ applied to the token "(foo) and (bar)" matches the entire span from the first ( to the last ), not just "(foo)". Use reluctant quantifiers (.*?) or more tightly scoped patterns when the token text may contain multiple match candidates.

The all: false footgun. Setting all: false replaces only the first match. If all is set differently between index-time and query-time analysers, tokens with multiple pattern matches will differ between the two contexts — producing misses that are hard to debug because _analyze only tests one direction at a time.

Position offsets and highlighting. Rewriting a token’s text changes its character length. Lucene’s highlighter uses stored character offsets to mark up the original document text. If the rewritten token length diverges significantly from the original, highlight spans can shift. For single-character substitutions (hyphen removal, suffix stripping) this is negligible; for wholesale rewrites it can break highlighting entirely. Test with the highlight API before deploying wide rewrites into a production index.

Filter order relative to lowercasing. A pattern replace filter running before lowercase sees mixed-case input; running after it sees only lowercase. Case-sensitive suffix patterns — Exception$ for example — must run before lowercasing, or the pattern must be adjusted to match the lowercased form (exception$). Trace a representative token through each stage in the chain before committing to the configuration.

When to use it

Use it when:

You need to normalise domain-specific surface forms that a stemmer or lemmatiser does not handle — version strings, product codes, chemical identifiers, error codes with fixed suffixes.
You want to collapse alternate spellings at index time (e-mail → email) without maintaining an explicit synonym list for every variant.
You need to strip or rewrite structural noise in identifiers (build metadata, trailing qualifiers) before full-text scoring.
The rewrite rule is stable, auditable, and expressible as a single regex substitution.

Prefer alternatives when:

The transformation is morphological. A Stemmer or Lemmatiser handles inflectional and derivational variation more robustly than a handwritten suffix pattern. Reserve the pattern replace filter for domain cases those tools miss.
The transformation applies to raw markup or encoding. A character filter (HTML Strip, mapping char filter) operates before tokenisation and is the correct place to strip tags or decode entities. Using a pattern replace token filter to clean markup means the tokeniser already ran over the dirty input.
You need to remove tokens entirely. Pattern replace does not remove tokens; it rewrites them. If removal is the goal, use a Stop Word Filter or chain a length filter to discard the resulting empty strings.