Length Filter

Preprocessing Tokenisation Normalisation Information-Retrieval Indexing Needs-Review

What it is

A length filter is a post-tokenisation token filter that discards tokens based on their character length. Any token shorter than min characters or longer than max characters is dropped from the stream; every other token passes through unchanged.

The filter has no opinion about what a token means — it only counts characters. That simplicity is both its strength (near-zero cost, trivial to reason about) and the source of its main failure mode: length is a weak proxy for utility, and a bound that is safe for one language can be destructive in another.

How it works

The filter receives each token from the stream, measures token.length(), and tests two conditions:

token.length() >= min  AND  token.length() <= max

Tokens that satisfy both conditions pass through. Tokens that fail either condition are discarded.

Position increment behaviour

The length filter follows the same position-gap convention as the Stop Word Filter: when a token is dropped, the position increment of the next surviving token is raised to account for the gap. This preserves phrase-query correctness — downstream components know that surviving tokens are not contiguous with their neighbours in the original stream, even though the dropped tokens are no longer visible.

Configuration

Elasticsearch / OpenSearch

{
  "settings": {
    "analysis": {
      "filter": {
        "my_length_filter": {
          "type": "length",
          "min": 3,
          "max": 20
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "my_length_filter"]
        }
      }
    }
  }
}

Both min and max are optional. Omitting min defaults to 0 (no lower bound). Omitting max defaults to Integer.MAX_VALUE (no upper bound). In practice, always set at least one bound explicitly — an unconfigured length filter is a no-op.

Solr — `LengthFilterFactory`

<filter class="solr.LengthFilterFactory" min="3" max="20"/>

Place this inside both <analyzer type="index"> and <analyzer type="query"> blocks if the field type defines them separately, to keep analysis symmetric.

Example

Input sentence: "A DD-WRT router configuration file"

Tokenised by the standard tokeniser: ["A", "DD", "WRT", "router", "configuration", "file"]

After the length filter with min=3, max=20:

Token	Length	Outcome
`A`	1	dropped
`DD`	2	dropped
`WRT`	3	passes
`router`	6	passes
`configuration`	13	passes
`file`	4	passes

Output stream: ["WRT", "router", "configuration", "file"]

Interaction with pattern replace filter

The most common production use of the length filter is as a cleanup step after a Pattern Replace Filter that rewrites tokens to an empty string. Replacing matched text with "" does not remove the token — it leaves a zero-length token occupying a position slot. Chaining a length filter with min: 1 immediately downstream discards these ghosts:

"filter": ["my_pattern_replace", "min_length_1"]

where min_length_1 is:

{
  "type": "length",
  "min": 1
}

This two-filter idiom — pattern replace produces empty strings, length filter removes them — is the idiomatic Lucene way to conditionally delete tokens by content rather than by list membership.

Practical use cases

Removing single-character noise. Tokenisers that split on whitespace and punctuation frequently produce single-character tokens: stray letters, lone punctuation, unit symbols. A min: 2 or min: 3 bound suppresses these without maintaining an explicit stop list.

Discarding oversized tokens. Base64-encoded blobs, URLs, UUIDs, and minified code fragments that slip through the tokeniser as single tokens are useless for text search and inflate the index. A max bound of 30–50 characters discards most of them. This is a defence-in-depth measure — a character filter or pattern replace filter should strip obvious non-text content earlier in the chain, but the length filter catches stragglers.

Cleanup after aggressive stemming. An over-reducing stemmer may collapse long tokens to one or two characters. A min: 3 bound prevents those noise stems from entering the index, though the better fix is usually a more conservative stemmer or a protected-words list.

When not to use it

CJK scripts. Chinese, Japanese, and Korean text is routinely tokenised into single characters or bigrams, each fully meaningful. A min: 3 bound on a CJK field silently discards the majority of the vocabulary. Apply the length filter only to field types where single-character tokens are genuinely noise.

Fields where short tokens are searchable keys. Part numbers, stock ticker symbols, ISO country codes ("UK", "DE"), chemical element symbols ("Fe", "Cu"), and similar controlled-vocabulary identifiers are typically two to four characters long. On these fields, use a keyword tokeniser with no length filter, or define a separate field type without the filter.

Asymmetric application. Like the stop word filter, the length filter must be applied symmetrically at index time and query time. Applying it at index time but not query time means short query tokens look up postings lists that do not contain them, producing silent misses.