Trim Filter
What it is
A trim filter is a post-tokenisation token filter that removes leading and trailing whitespace from every token passing through the analysis stream. The token’s interior content is not touched — only the outer edges are cleaned. No tokens are added or removed; only their text changes.
It is the analysis-chain equivalent of calling String.strip() on each token.
How it works
For each token, the filter scans inward from both ends and discards any whitespace characters it finds before reaching non-whitespace content. The cleaned text replaces the original token text; the position, offset, and type metadata attached to the token are preserved unchanged.
What counts as whitespace
In Lucene’s TrimFilter — the implementation used by both Elasticsearch/OpenSearch and Solr — whitespace is defined by Character.isWhitespace(), which matches the ASCII space (U+0020), horizontal tab (U+0009), newline (U+000A), and related ASCII control characters. U+00A0 (NO-BREAK SPACE) does not satisfy Character.isWhitespace() in Java and is therefore not stripped by the standard trim filter. If a field receives content with non-breaking spaces at token boundaries, a Pattern Replace Filter targeting \u00A0 must precede the trim filter.
Where tokens arrive with whitespace attached
The standard tokeniser and most word-oriented tokenisers split on whitespace and strip it automatically — trim is redundant after them. Whitespace-padded tokens appear in several other situations:
- Keyword tokeniser. The
keywordtokeniser emits the entire field value as a single token verbatim, padding and all. A field storing" en "(a language code with padding) indexes the string with the spaces intact, breaking exact-match queries. This is the primary use case for the trim filter. - Pattern tokenisers and regex tokenisers. A split pattern such as
,applied to"London, Paris, Berlin"produces["London", " Paris", " Berlin"]— space-prefixed tokens. Adding a trim filter after the tokeniser cleans these without changing the split pattern. - CSV or delimiter splits via character filters. Delimiter-separated fields processed before tokenisation can deposit boundary spaces when the source data has inconsistent formatting.
- Whitespace tokeniser. The Whitespace Tokeniser splits on whitespace but does not strip it from surrounding tokens in all configurations; a trim filter is occasionally needed as insurance.
Configuration
Elasticsearch / OpenSearch
{
"settings": {
"analysis": {
"analyzer": {
"keyword_trimmed": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]
}
}
}
}
}
"trim" is a built-in token filter name in Elasticsearch and OpenSearch — no separate filter definition is required.
Solr — TrimFilterFactory
<fieldType name="keyword_trimmed" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Example
Field value received by the keyword tokeniser: " OpenSearch "
| Stage | Token text |
|---|---|
| After tokeniser | " OpenSearch " |
| After trim filter | "OpenSearch" |
| After lowercase | "opensearch" |
Without the trim filter, a query for "opensearch" against this field returns no results because " opensearch " (padded, lowercased) does not equal "opensearch".
When the filter is redundant
Skip the trim filter when:
- The tokeniser is the standard tokeniser, Unicode tokeniser, ICU tokeniser, or any other word-segmentation tokeniser — these never emit whitespace-padded tokens.
- The field uses a character filter that normalises whitespace before tokenisation, such as a pattern replace that collapses all whitespace.
Adding a redundant trim filter carries near-zero runtime cost, but it adds noise to the analysis chain definition and can mislead a future reader into thinking whitespace is actually present in the incoming tokens.