Stop Word Filter

Preprocessing Tokenisation Information-Retrieval Indexing Query-Parsing Needs-Review

What it is

A stop word filter is the analysis-chain component responsible for removing stop words from a token stream. Where Stop Word covers the concept and the rationale for removal, this entry covers the filter as a configurable pipeline stage: what knobs it exposes, how it is wired into Lucene-based engines (Elasticsearch, OpenSearch, Solr), and where it can go wrong.

The filter sits between the tokeniser and any downstream filters (stemmer, synonym filter, etc.). It receives tokens one at a time, performs a hash-set membership test against its configured word list, and either discards the token or passes it on.

How it works

Position increments and gaps

When a token is dropped, the filter does not simply close the gap. Lucene increments the position counter for the next surviving token by the number of removed tokens plus one. A stream ["the"(pos=1), "quick"(pos=2), "fox"(pos=3)] after removing "the" becomes ["quick"(pos=2), "fox"(pos=3)] — position 2 is now the first token but the gap to position 1 is preserved as a position increment of 2.

This matters for phrase queries. If position gaps are preserved correctly, a phrase query for "quick fox" matches even though "the" was removed from the middle of "the quick fox", because the engine knows the tokens are not actually adjacent. If position increments are collapsed (increment reset to 1), phrase queries can produce false positives by matching tokens that were never adjacent in the original text.

Lucene’s StopFilter preserves position increments by default. The enablePositionIncrements parameter (deprecated but still encountered in legacy configs) controlled this behaviour; it now defaults to true and should not be changed.

Symmetric application

The filter must be applied identically at index time and query time. Asymmetric application produces silent mismatches: a stop word removed at index time but retained in a query token stream generates a lookup against a postings list that does not exist, returning zero results for a query that should match.

In Elasticsearch and OpenSearch this is enforced by defining the filter inside an analyzer object: both index and search stages reference the same analyzer, so the same filter runs in both contexts. In Solr the fieldType definition covers both index and query analysis, again applying the same filter chain in both directions.

Configuration

Elasticsearch / OpenSearch — inline list

{
  "settings": {
    "analysis": {
      "filter": {
        "my_stop_filter": {
          "type": "stop",
          "stopwords": ["the", "is", "at", "of", "and", "a", "to"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop_filter"]
        }
      }
    }
  }
}

{
  "filter": {
    "english_stop": {
      "type": "stop",
      "stopwords": "_english_"
    }
  }
}

Predefined identifiers (_english_, _french_, _german_, etc.) map to Lucene’s bundled Snowball stop lists. The English list contains 33 terms; the SMART list (referenced in some tooling as _smart_) contains 571.

Elasticsearch / OpenSearch — file-based list

{
  "filter": {
    "domain_stop": {
      "type": "stop",
      "stopwords_path": "analysis/domain_stopwords.txt"
    }
  }
}

The path is relative to the Elasticsearch config/ directory. The file should contain one word per line, UTF-8 encoded, with # for comments. File-based lists are preferable for large or frequently updated domain lists — editing an inline array inside a JSON settings object is error-prone and requires an index close/reopen cycle to apply.

Solr — `StopFilterFactory`

<fieldType name="text_en" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"
            words="lang/stopwords_en.txt"
            ignoreCase="true"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"
            words="lang/stopwords_en.txt"
            ignoreCase="true"/>
  </analyzer>
</fieldType>

The words attribute points to a file inside Solr’s conf/ directory. Setting ignoreCase="true" means the filter lowercases each token before testing membership, so the list can be stored in lowercase and still match mixed-case input — but only if the stop filter runs after LowerCaseFilterFactory, or if ignoreCase is set to compensate. Placing StopFilterFactory before LowerCaseFilterFactory with ignoreCase="false" is a common misconfiguration that causes stop words to pass through silently on capitalised input.

Snowball-embedded stop lists

The Snowball stemmer implementations bundle their own language stop lists. Lucene exposes these via SnowballFilter — when you select a Snowball language stemmer, you can optionally activate the associated stop list in the same step. Some Solr configurations use this to reduce the filter chain length, but the two concerns — stemming and stop-word removal — are better kept as separate, auditable stages.

Protected words and keep-words mode

Some deployments need the inverse behaviour: rather than specifying words to remove, they specify words to keep — discarding everything else. Lucene’s KeepWordFilter and Elasticsearch’s keep token filter type implement this. It is rarely the right tool for general search but is useful for extracting a controlled vocabulary from free text.

More commonly, a protected words list is used alongside another filter (typically a stemmer) rather than a stop filter directly. For stop-word filtering the equivalent is simply maintaining two lists: a primary stop list and an exceptions file. Words in the exceptions file are excluded from the stop list lookup. Lucene does not have a built-in stop-filter exceptions mechanism — the typical approach is to remove the excepted terms from the stop list file manually, or to apply a KeywordMarkerFilter before the stop filter to flag tokens that must not be removed.

Operational gotchas

Query-time only stop words. Some deployments apply stop-word removal at query time but not at index time, on the theory that the index should remain complete. This is a valid strategy but requires care: phrase queries will fail if the index contains the stop word at a given position and the query does not, because the position slot is present in the index but absent from the query’s position sequence. If you use this approach, rely on BM25/IDF suppression for scoring rather than removal for phrase correctness.

Stop words and phrase query slop. A phrase query with slop > 0 allows tokens to be out of order or non-adjacent. Position gaps introduced by stop-word removal count against the slop budget. Removing two stop words in a five-word phrase consumes two slop units even if the remaining tokens are in the correct order. High-slop queries on stop-word-filtered indexes can return unexpected results.

Index rebuild required on list changes. Changing the stop list after indexing requires a full reindex — existing postings for newly stopped terms remain in the index and will be matched, while the query analyser will no longer emit those terms. Partial updates do not repair this inconsistency.

Case sensitivity. The stop list must use the same case as the tokens it is tested against. If the analysis chain lowercases before the stop filter, the list should be in lowercase. If the stop filter runs before lowercasing (unusual but possible), the list must include all relevant case variants or ignoreCase must be enabled.

[illustrate: pipeline diagram showing two misconfigured analysis chains side by side — left chain: StopFilter → LowerCaseFilter, with “The” passing through the stop filter unchecked because the list contains only “the”; right chain: LowerCaseFilter → StopFilter, with “the” correctly caught — annotate the difference in token value at the point of the stop filter in each chain]

When to use it

The stop word filter is appropriate in the same circumstances as stop-word removal generally — large English-language corpora, acceptable recall-over-precision tradeoff, and no hard requirement for exact phrase matching on function words. See Stop Word for the full decision framework.

When configuring the filter specifically:

Prefer file-based lists over inline arrays for anything beyond a handful of terms — they are easier to audit, diff, and update.
Use a language-specific predefined list (_english_, _french_, etc.) as the starting point; extend it with a domain exceptions file rather than replacing it wholesale.
Always configure the filter identically in both index and query analyzers. Diff the two analyzer configs in your schema before deploying to production.
Test phrase queries against a sample index after any stop list change — position gap behaviour is the most common source of subtle phrase-matching regressions.

Stop Word Filter

What it is

How it works

Position increments and gaps

Symmetric application

Configuration

Elasticsearch / OpenSearch — inline list

Elasticsearch / OpenSearch — predefined language list

Elasticsearch / OpenSearch — file-based list

Solr — `StopFilterFactory`

Snowball-embedded stop lists

Protected words and keep-words mode

Operational gotchas

When to use it

See also

Stop Word Filter

What it is

How it works

Position increments and gaps

Symmetric application

Configuration

Elasticsearch / OpenSearch — inline list

Elasticsearch / OpenSearch — predefined language list

Elasticsearch / OpenSearch — file-based list

Solr — StopFilterFactory

Snowball-embedded stop lists

Protected words and keep-words mode

Operational gotchas

When to use it

See also

Solr — `StopFilterFactory`