Elision Filter

Preprocessing Tokenisation Normalisation Indexing Query-Parsing Needs-Review

What it is

An elision filter is a post-tokenisation token filter that removes elided prefixes from the front of tokens. An elision is a phonological contraction in which a word-final vowel is dropped before a word beginning with a vowel or silent h, and the two words are fused with an apostrophe. In French, the definite articles le and la both become l’ before a vowel-initial noun: le arbre is never written — the correct form is l’arbre (the tree). The preposition de similarly contracts to d’: d’accord, d’eau.

Because the apostrophe is not a letter, some tokenisers treat l’arbre as a single token. Without further processing, l’arbre and arbre are distinct index terms that will not match each other. The elision filter resolves this by stripping the prefix up to and including the apostrophe, emitting arbre in place of l’arbre.

Which languages use elisions

French has the most extensive use of elisions and is the primary target language for this filter in Lucene-based tooling. The prefixes l', d', j', m', t', s', c', n', qu' all appear in standard French.

Other languages exhibit similar patterns:

Italian — elided articles and prepositions: dell’, nell’, sull’, all’, dall’, l’.
Catalan — l’, d’, m’, t’, s’, n’.
Occitan — similar to Catalan; l’, d’.
Irish — the particle a’ and other elided forms appear before vowel-initial words.

Each language requires its own configured prefix list; no universal list covers all cases correctly.

How it works

The filter holds a configured set of prefix strings. For each incoming token it checks whether the token text starts with any listed prefix, performing a case-insensitive comparison. If a match is found, the prefix — including the apostrophe — is stripped and the remainder of the token text replaces the original. Position, offset, and type metadata are preserved; the start offset still points to the beginning of the original input, so highlighting can underline the full l’arbre span even though the indexed term is arbre.

If no prefix matches, the token passes through unchanged.

The apostrophe variants problem

Natural language text uses at least three characters that look like apostrophes:

Character	Unicode point	Name	Source
`'`	U+0027	APOSTROPHE (straight)	ASCII; keyboard default
`'`	U+2019	RIGHT SINGLE QUOTATION MARK (curly)	Word processors; smart-quote substitution
`ʼ`	U+02BC	MODIFIER LETTER APOSTROPHE	Linguistic transcription

Lucene’s ElisionFilter normalises all three to U+0027 before prefix comparison, so a prefix list written with straight apostrophes matches tokens that arrived with curly apostrophes. The configured articles list should always use the plain ASCII apostrophe.

If text reaches the filter from a source that has not been through Unicode normalisation, it is safer to add a Unicode Normalisation step upstream to guarantee consistent character forms before the elision filter sees the tokens.

Example

Input sentence: “L’eau et d’autres liquides” (Water and other liquids)

Assume the standard tokeniser emits l'eau, et, d'autres, liquides as separate tokens.

Token in	Prefix matched	Token out
`l'eau`	`l'`	`eau`
`et`	—	`et`
`d'autres`	`d'`	`autres`
`liquides`	—	`liquides`

The indexed terms are eau, et, autres, liquides — a query for eau matches the document even though the source text contains l’eau.

Interaction with tokenisers

The elision filter only acts on tokens the tokeniser has already produced. The critical question is whether the tokeniser keeps an elided form as one token or splits it.

Standard tokeniser (Lucene). The standard tokeniser uses Unicode word-break rules (UAX #29). The apostrophe is treated as a mid-word character in certain contexts, so l’arbre is typically emitted as the single token l'arbre. The elision filter then strips the prefix, producing arbre. This is the intended pipeline.

Splitting on the apostrophe at tokenisation. If the tokeniser — or a char filter substitution — replaces the apostrophe with a space before tokenisation, the input l’arbre is split into two tokens: l and arbre. The elision filter receives them separately; l matches no configured prefix (there is no prefix that is the letter l without an apostrophe), so both tokens survive. You now have a spurious l in the index that acts like a stop word but was never declared as one.

Recommendation. Do not split on the apostrophe at tokenisation if you intend to use an elision filter. Let the standard tokeniser keep the elided form together, then strip the prefix with the filter. If apostrophe-splitting cannot be avoided, combine it with a Stop Word Filter that discards the bare article fragments (l, d, j, m, etc.) — a cruder approach that loses positional fidelity.

Configuration

Elasticsearch / OpenSearch

The elision token filter type accepts an articles array listing the prefix stems to strip. The articles_case parameter controls case-insensitive matching (default true).

{
  "settings": {
    "analysis": {
      "filter": {
        "french_elision": {
          "type": "elision",
          "articles_case": true,
          "articles": ["l", "m", "t", "qu", "n", "s", "j", "d", "c",
                       "jusqu", "quoiqu", "lorsqu", "puisqu"]
        }
      },
      "analyzer": {
        "french": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["french_elision", "lowercase"]
        }
      }
    }
  }
}

The articles array lists prefix stems without the apostrophe — Lucene appends it internally during comparison. The built-in french analyser in Elasticsearch already includes an elision filter pre-configured for French; the above is for custom or multilingual configurations.

Solr — `ElisionFilterFactory`

<fieldType name="text_fr" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ElisionFilterFactory"
            articles="lang/contractions_fr.txt"
            ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"
            words="lang/stopwords_fr.txt"
            ignoreCase="true"/>
  </analyzer>
</fieldType>

The articles attribute points to a file in Solr’s conf/ directory listing one prefix per line, without the apostrophe, UTF-8 encoded. Solr ships a default lang/contractions_fr.txt covering standard French elisions.

Italian example

{
  "filter": {
    "italian_elision": {
      "type": "elision",
      "articles_case": true,
      "articles": ["c", "l", "all", "dall", "dell", "nell", "sull",
                   "coll", "pell", "gl", "agl", "dagl", "degl",
                   "negl", "sugl", "un", "m", "t", "s", "v", "d"]
    }
  }
}

When to use it vs splitting on the apostrophe

Situation	Preferred approach
French or Italian field; want `l'arbre` → `arbre`	Standard tokeniser + elision filter
Multilingual field mixing eliding and non-eliding languages	Elision filter with the union of both prefix lists
Apostrophe used for possessives (English `'s`) in the same field	Standard tokeniser handles possessives via its mid-word rules; elision filter does not interfere
Need exact storage of the elided form (autocomplete, faceting)	Index a second sub-field without the elision filter; do not apply it to the raw keyword field
Tokeniser already splits on the apostrophe and cannot be changed	Use a stop word filter to drop bare article fragments; accept the loss of prefix-position fidelity

The elision filter is almost always preferable to apostrophe-splitting for languages where elisions are grammatically systematic. Splitting produces extra tokens (noise); the filter produces the correct stem (signal) with no index-size penalty.