Case Folding
What it is
Case folding is a normalisation step that converts text to a canonical lowercase form in a way that is correct across all Unicode scripts and locales. It is a superset of simple ASCII lowercasing: where A → a is obvious, case folding handles subtleties like the Turkish İ → i (dotted capital I) vs the regular I → ı (dotless lowercase i), or the German ß → ss (sharp-s expands to two characters).
Most search engines default to simple ASCII lowercasing, which works well for English but silently produces wrong results in other languages. Case folding is the correct, language-aware alternative.
How it works
Unicode defines a case folding mapping (the CaseFolding.txt data file). Each code point is mapped to its folded equivalent — a string of one or more code points. The process:
- Iterate over the input string code point by code point.
- Look up each code point in the Unicode case fold table.
- Replace it with the mapped sequence (which may be longer than the input).
- Concatenate results to form the output string.
The Unicode standard distinguishes two variants:
- Simple case folding — one-to-one mapping, always produces the same number of code points. Misses the
ß → ssexpansion. - Full case folding — one-to-many mapping, handles all edge cases. Preferred for search.
Example
| Input | Simple lowercase | Full case fold |
|---|---|---|
İstanbul (Turkish) |
i̇stanbul (wrong dotted i) |
istanbul |
Straße (German) |
straße |
strasse |
HELLO |
hello |
hello |
fi (fi ligature) |
fi (unchanged) |
fi |
In the Turkish case, simple lowercasing of İ (U+0130, Latin Capital Letter I With Dot Above) produces i̇ — a character that won’t match plain i. Case folding maps it correctly to i.
Variants and history
Unicode case folding was formalised in the Unicode Standard and is defined in CaseFolding.txt, published with each Unicode release.
Two additional sub-variants exist:
- Locale-insensitive case folding (
Tmapping) — handles Turkic languages by applying a separate mapping forIandİ. - Special casing — defined in
SpecialCasing.txtfor title-case and language-specific rules beyond simple folding.
ICU (International Components for Unicode) provides a production-ready implementation used by Elasticsearch’s ICU Analysis plugin and Apache Solr.
When to use it
Use full Unicode case folding whenever your content includes non-English text. For purely ASCII content, simple lowercasing is equivalent and faster.
In Elasticsearch and OpenSearch, case folding is available via the icu_normalizer token filter (mode nfkc_cf) from the ICU Analysis plugin. Solr users can apply it through the ICUFoldingFilterFactory.
Avoid applying both case folding and ASCII folding independently — together they are redundant and may interact unexpectedly. ICU normalisation with nfkc_cf mode handles both in a single pass.