ASCII Folding

Normalisation Preprocessing Unicode Information-Retrieval Query-Parsing Needs-Review

What it is

ASCII folding is a character-normalisation algorithm that replaces non-ASCII characters — accented letters, ligatures, and certain typographic symbols — with their nearest ASCII equivalents. "café" becomes "cafe", "naïve" becomes "naive", "Ångström" becomes "Angstrom". The result is a string composed entirely of characters in the 7-bit ASCII range (U+0000–U+007F).

The technique exists to close the gap between users who type with diacritics and users who do not, and between documents that were authored with full Unicode characters and search queries typed on keyboards without easy access to accent keys. It is one of several approaches to this problem — see Normalisation for the broader picture — and it is the oldest and most widely deployed in search engines built on the Lucene stack.

How it works

ASCII folding operates on a static lookup table: a mapping from specific non-ASCII codepoints (or small codepoint ranges) to one or more ASCII replacement characters. The algorithm walks the input string one codepoint at a time:

If the codepoint is below U+0080 (already ASCII), emit it unchanged.
If the codepoint appears in the lookup table, emit the mapped ASCII string (one or more characters).
If the codepoint is not in the table, drop it or emit a configurable replacement — typically a space or nothing.

The “one or more characters” in step 2 matters. Some characters fold to two-character sequences: the ligature "æ" (U+00E6) folds to "ae", "œ" (U+0153) to "oe", "ß" (U+00DF) to "ss". The output string can therefore be longer than the input.

The Lucene ASCIIFoldingFilter, which backs Elasticsearch and OpenSearch’s asciifolding token filter, covers the following ranges explicitly:

Input range	Examples	Fold target
U+00C0–U+00FF (Latin-1 Supplement)	`À Á Â Ã Ä Å` → `A`, `æ` → `ae`, `ñ` → `n`	ASCII Latin letters
U+0100–U+017E (Latin Extended-A)	`Ā ā Ă ă Ą ą` → `A` / `a`	ASCII Latin letters
U+0180–U+024F (Latin Extended-B)	`ƀ` → `b`, `Ƈ` → `C`	ASCII Latin letters
U+02B0–U+02FF (Spacing Modifiers)	`ʼ` → `'`	Apostrophe / quote
U+2018–U+201F (Typographic quotes)	`"` `"` → `"`, `'` `'` → `'`	Straight ASCII quotes
U+2100–U+214F (Letterlike symbols)	`℃` → `C`, `™` → `TM`	ASCII sequences
U+FB00–U+FB06 (Latin ligatures)	`ﬁ` → `fi`, `ﬄ` → `ffl`	ASCII pairs/triples

Characters outside these ranges — Cyrillic, Arabic, Hebrew, Greek, CJK, Devanagari, and most of the Unicode BMP above U+FB06 — are not in the table and are either dropped or passed through depending on the implementation’s fallback behaviour.

[illustrate: step-by-step transform of “Ångström” — character-by-character table lookup showing Å→A, n→n (pass-through), g→g, s→s, t→t, r→r, ö→o, m→m, with the lookup table entries highlighted for the two folded characters and the output string assembling beneath]

Example

Input tokens after lowercasing: ["naïve", "café", "æsthetic", "москва"]

Token	Fold result	Note
`naïve`	`naive`	`ï` (U+00EF) → `i`
`café`	`cafe`	`é` (U+00E9) → `e`
`æsthetic`	`aesthetic`	`æ` (U+00E6) → `ae`; output is longer
`москва`	(dropped or empty)	Cyrillic — not in table

A query for "naive" now matches a document containing "naïve" because both sides pass through the same filter before indexing and at query time. The Cyrillic token for Moscow either disappears from the index entirely or passes through un-folded, depending on whether the token filter is configured to preserve_original.

[illustrate: before/after showing four tokens on the left — “naïve”, “café”, “æsthetic”, “москва” — with arrows to their folded forms on the right, the Cyrillic token ending at a red “dropped” label, the æ→ae expansion annotated as “1 char → 2 chars”]

Variants and history

ASCII folding is an informal technique with no single specification. Implementations vary in which codepoints they cover and what they do with unrecognised characters.

Lucene ASCIIFoldingFilter (2008–present) is the de-facto standard in the JVM search world. It covers approximately 460 codepoints across the Latin, Latin Extended, and a handful of symbol blocks. Elasticsearch and OpenSearch expose it as the asciifolding token filter; the preserve_original: true option emits both the folded and the original token, letting the index match either.

Custom lookup tables are common in legacy search applications predating ICU. A hand-maintained CSV of (source codepoint, ASCII replacement) pairs was often enough for a product serving one or two Western European markets.

NFKC + diacritic stripping is the principled Unicode-native alternative (see below). It arrived later but is now preferred in new systems.

ASCII folding vs NFKC + diacritic stripping

Both techniques move "café" to "cafe", so they appear equivalent at first glance. The mechanisms — and the edge cases — differ significantly.

ASCII folding uses a hard-coded lookup table. It is fast, predictable, and easy to audit: you can read the table and know exactly what folds to what. It fails silently for any codepoint not in the table (Cyrillic, Arabic, Greek accents, Vietnamese tone marks, and so on), which becomes a correctness problem as content becomes multilingual.

NFKC normalisation followed by diacritic stripping works entirely within the Unicode standard:

Apply NFKC normalisation: decompose characters into base letter + combining mark(s), apply compatibility mappings (ligatures, fullwidth, fractions).
Remove all combining characters in Unicode category Mn (Mark, Nonspacing).
The result is a string of base letters with no attached diacritical marks.

import unicodedata

def strip_accents(s: str) -> str:
    # Step 1: NFKD decomposition (use NFKD here to decompose before stripping)
    nfkd = unicodedata.normalize("NFKD", s)
    # Step 2: drop combining (nonspacing mark) characters
    return "".join(c for c in nfkd if unicodedata.category(c) != "Mn")

strip_accents("naïve")    # → "naive"
strip_accents("café")     # → "cafe"
strip_accents("Ångström") # → "Angstrom"
strip_accents("Ελλάδα")   # → "Ελλαδα"  — Greek base letters preserved, tonos removed

The Unicode approach generalises correctly to Greek, Vietnamese, and any other script that uses combining diacritical marks. It does not help with scripts where there is no base-letter + combining-mark decomposition — Arabic, Hebrew vowel points, and CJK have no diacritics to strip.

Neither approach produces ASCII output for Cyrillic or Arabic — for those scripts, transliteration (a separate, lossy process) is the only way to produce ASCII output.

[illustrate: side-by-side comparison of ASCII folding vs NFKC+strip applied to five inputs — “café”, “naïve”, “Ångström”, “Ελλάδα” (Greek), “москва” (Cyrillic) — with green/red cells showing where each method succeeds or silently fails]

When to use it

Use ASCII folding when you are building on the Lucene stack (Elasticsearch, OpenSearch, Solr) and your content is predominantly Western European. The asciifolding token filter is a one-line addition to an analysis chain and requires no Unicode expertise to configure correctly.

Prefer NFKC + diacritic stripping when you are building a new pipeline in Python, Rust, Go, or any language with good Unicode support, or when your content includes Greek, Vietnamese, or other diacritic-using scripts beyond the Latin Extended blocks. It is more principled, covers more of Unicode, and does not require maintaining a lookup table.

Enable preserve_original: true (Lucene) or emit both folded and unfolded tokens whenever your corpus contains accented proper nouns that carry identity meaning: "Björk" and "Bjork" should both match, but a search for "Björk" should rank the exact-accent match higher. Preserving the original token lets a relevance model weight exact matches above folded matches.

Be explicit about the precision tradeoff. ASCII folding increases recall — more queries find more documents — but it collapses distinctions that are semantically real in some domains:

Spanish "si" (if) vs "sí" (yes)
Danish "for" (for) vs "før" (before)
Proper names: "Müller" and "Muller" are different German surnames
Place names: "São Paulo" and "Sao Paulo" refer to the same city, but "Côte d'Ivoire" and "Cote d'Ivoire" may not resolve the same way in a geographic database

In legal, medical, or name-matching domains, audit whether the fold is safe before enabling it globally.

Apply identically at index time and query time. A filter applied to documents but not to queries — or vice versa — produces silent mismatches. Every token filter in an analysis chain must be mirrored on both sides.