Unicode Normalisation

Unicode Preprocessing Normalisation Information-Retrieval String-Matching Needs-Review

What it is

A single visible character in Unicode can be represented by more than one valid sequence of codepoints. The letter é can be stored as the single precomposed codepoint U+00E9 (LATIN SMALL LETTER E WITH ACUTE) or as the two-codepoint sequence U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT). Both are legal Unicode. Both render identically. They are not equal when compared as byte strings.

Unicode normalisation is the family of algorithms that resolve this ambiguity by transforming any input into a single, predictable canonical form. The Unicode Standard (Unicode Annex #15, UAX #15) defines four normalisation forms — NFC, NFD, NFKC, NFKD — each the product of two independent choices: which equivalences to apply and whether to compose or decompose the result.

Without normalisation, two documents that look byte-for-byte identical on screen can fail string equality tests, hash to different values, tokenise to different subword sequences, and produce different byte offsets for regex spans. These are silent bugs — nothing crashes, the wrong answer just comes back.

How it works

The two axes

Axis 1 — Equivalence type: canonical vs compatibility

Canonical equivalence means two sequences represent the same abstract character — they are indistinguishable in meaning and appearance. é as U+00E9 and e + U+0301 are canonically equivalent.

Compatibility equivalence is broader: two sequences represent characters that have the same semantic identity but may differ in appearance or typographic intent. The ligature ﬁ (U+FB01) is compatibility-equivalent to the two characters fi. The superscript ² (U+00B2) is compatibility-equivalent to the digit 2. The fullwidth letter Ａ (U+FF21) is compatibility-equivalent to A. Compatibility mappings flatten typographic or formatting distinctions that are preserved in canonical equivalences.

Axis 2 — Composition direction: decomposing vs composing

Decomposing forms expand precomposed characters into their constituent base letter plus combining mark sequence. é (U+00E9) → e + U+0301.

Composing forms first decompose — putting all combining sequences into a canonical order — then re-combine sequences back into precomposed characters wherever a precomposed form exists. e + U+0301 → é (U+00E9).

The four forms

Crossing the two axes produces the four normalisation forms:

	Canonical equivalences only	Canonical + compatibility equivalences
Composing	NFC	NFKC
Decomposing	NFD	NFKD

NFC — Canonical Decomposition, followed by Canonical Composition

The default form for text on the web and in most operating systems. NFC decomposes all characters into their canonical sequences, sorts combining marks into canonical order, then recomposes sequences into precomposed characters wherever a precomposed form is defined.

e + U+0301 → (decompose: already decomposed) → (compose) → é U+00E9

Most characters in Western European scripts have precomposed forms, so NFC text is compact. CJK characters and many combining diacritics have no precomposed forms and are left in their decomposed state after NFC composition.

NFD — Canonical Decomposition

NFD decomposes every character into its canonical sequence of base letter plus combining marks, then sorts combining marks according to their canonical combining class — a per-character integer in the Unicode Character Database that governs the canonical ordering of simultaneous diacritics.

é U+00E9 → e + U+0301 ộ (o with circumflex and dot below) → o + U+0323 (combining dot below, class 220) + U+0302 (combining circumflex, class 230)

Combining marks are sorted in ascending order of combining class. This canonical ordering ensures that any two canonically equivalent sequences, however the combining marks were originally ordered, arrive at the same NFD form and compare equal.

NFKC — Compatibility Decomposition, followed by Canonical Composition

NFKC applies compatibility mappings in addition to canonical decomposition, then recomposes. This is the most aggressive standardisation short of case folding.

ﬁ (U+FB01 LATIN SMALL LIGATURE FI) → fi ² (U+00B2 SUPERSCRIPT TWO) → 2 Ａ (U+FF21 FULLWIDTH LATIN CAPITAL LETTER A) → A ℃ (U+2103 DEGREE CELSIUS) → °C e + U+0301 → é

NFKC is what most Unicode-aware search normalisation pipelines want: it compresses typographic variants that carry no semantic distinction in a search context while keeping the text in a composed, space-efficient form.

NFKD — Compatibility Decomposition

NFKD applies all compatibility mappings and then fully decomposes, but does not recompose. The result is maximally expanded: every character is a base letter plus zero or more combining mark codepoints.

ﬁ → f + i é → e + U+0301 ộ → o + U+0323 + U+0302

NFKD is the form used when you want to strip diacritics: decompose with NFKD, then filter out all codepoints whose Unicode General Category is Mn (Mark, Nonspacing). The ASCII Folding citation shows this pattern in Python.

[illustrate: four-quadrant grid with NFC / NFD / NFKC / NFKD as labelled cells; each cell shows the same three inputs — “é” (precomposed), “e+U+0301” (decomposed e-acute), “ﬁ” (fi ligature) — transforming to their output in that form, with arrows annotated “canonical only” vs “canonical + compat” on the horizontal axis and “compose” vs “decompose” on the vertical axis]

Canonical combining class and mark ordering

A subtlety that surfaces when building your own normalisation: Unicode does not sort combining marks alphabetically or by codepoint value. Each combining mark has a canonical combining class (CCC), a number from 0 to 254, assigned in the UCD. Marks are sorted in ascending CCC order. Marks with CCC 0 are blocking — they reset the sort boundary. A sequence with CCC 0 in the middle is treated as two independent stacks.

This matters in practice when text is assembled programmatically — for example, when a system appends combining marks to base characters one at a time without going through a normalisation step. The resulting sequence may be semantically correct but not in canonical order, and will fail byte-level equality checks against the normalised form.

[illustrate: step-by-step NFD of “ộ” — base “o” at left, then two combining marks arriving in wrong order (U+0302 then U+0323), canonical combining classes shown as numbers above each mark, reordering arrow swapping them to ascending class order, final canonical sequence highlighted]

Example

The same string, "re\u0301sume\u0301" (decomposed e-acutes), processed through all four forms:

Form	Codepoints	Visible	Notes
Input (NFD)	`r e U+0301 s u m e U+0301`	`résumé`	8 codepoints
NFC	`r U+00E9 s u m U+00E9`	`résumé`	7 codepoints — é precomposed
NFD	`r e U+0301 s u m e U+0301`	`résumé`	8 codepoints — unchanged
NFKC	`r U+00E9 s u m U+00E9`	`résumé`	Same as NFC for this input
NFKD	`r e U+0301 s u m e U+0301`	`résumé`	Same as NFD for this input

Now add a compatibility character — "re\u0301sume\u0301 \ufb01led" (ending in "ﬁled" with the fi ligature):

Form	Visible result	fi ligature
NFC	`résumé ﬁled`	preserved
NFD	`résumé ﬁled`	preserved
NFKC	`résumé filed`	expanded to `fi`
NFKD	`résumé filed`	expanded to `fi`

Only the K-forms flatten the ligature. A search index normalised to NFKC will match a query for "filed" against a document containing "ﬁled". An index normalised to NFC will not.

[illustrate: before/after side-by-side showing the string “résumé ﬁled” (with fi ligature visible) on the left; four output rows on the right, one per normalisation form, with the fi ligature cell highlighted green when expanded and orange when preserved; codepoint count shown in a badge on each row]

Failure modes

Unicode normalisation bugs are silent and environment-dependent. These are the patterns that actually hurt.

String equality and dictionary lookup

import unicodedata

a = "café"          # NFC: é as single codepoint U+00E9
b = "cafe\u0301"    # NFD: e + combining acute

a == b              # False — different byte sequences
len(a)              # 4
len(b)              # 5
unicodedata.normalize("NFC", b) == a   # True

# Dictionary lookup fails silently
d = {a: "precomposed"}
d.get(b)            # None — key not found

A user-supplied string and a database-sourced string can carry the same visible text but different normalisation forms. The dictionary miss produces no exception; the lookup just returns None.

Byte-level hashing and deduplication

Content fingerprinting pipelines that SHA-256 a document’s text before normalisation will assign different hashes to "café" (NFC) and "cafe\u0301" (NFD). Two documents with identical visible content will not be detected as duplicates. Normalise to NFC before hashing.

Subword tokeniser vocabulary mismatches

BPE and WordPiece vocabularies are built from a fixed corpus. If the vocabulary was built from NFC text and an inference-time document arrives in NFD, the tokeniser may not find the expected vocabulary entry for a precomposed character and will fall back to a sequence of [UNK] tokens or byte-level fallbacks. Hugging Face tokenisers apply NFC normalisation internally by default for most pretrained models — but custom vocabularies built without this step are a common source of elevated [UNK] rates.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")

# NFC input — é as single codepoint
tok.tokenize("café")         # ['ca', '##fe']  (after accent stripping)

# NFD input — e + combining acute
tok.tokenize("cafe\u0301")   # same output IF the tokeniser normalises internally
                              # different output if it does not

The safe practice: normalise to NFC at the document ingestion boundary, before any tokeniser sees the text.

Regex span offsets

Python’s re module operates on Unicode strings by codepoint position, not byte position. A regex that matches é as a single character will succeed against NFC but fail against NFD — the NFD form is two codepoints, so the pattern \w matches e at position 0 and U+0301 is a zero-width combining mark at position 1. Span-based annotation frameworks that slice the original string by regex match offsets will produce incorrect character positions if the normalisation form of the indexed string and the query string differ.

Invisible and zero-width characters

NFD decomposition does not remove zero-width characters — U+200B (zero-width space), U+FEFF (BOM / zero-width no-break space), U+200C/D (zero-width non-joiner/joiner). These characters are canonically their own codepoints; they survive normalisation. A string that looks empty but contains only a zero-width space will fail an if not text: guard in Python (non-empty string is truthy) and will survive NFC/NFD normalisation unchanged. Strip these characters explicitly as a separate step before or after normalisation.

Variants and history

Unicode normalisation was formalised in Unicode 3.1 (2001) and stabilised — no normalisation form will ever produce a different output for the same input — from Unicode 3.1 forward for NFC/NFD, and from Unicode 4.1 forward for NFKC/NFKD. The stability guarantee is intentional: a document normalised once does not need to be renormalised when the Unicode version is upgraded.

The four forms originate from two separate lines of Unicode work. Canonical equivalence and NFC/NFD were motivated by multilingual text interchange — ensuring that text from one system compares correctly with text from another, regardless of which encoding path was used to construct the string. Compatibility equivalence and NFKC/NFKD were motivated by legacy character set migration, absorbing characters from older standards (such as JIS X 0208’s fullwidth Latin block and various ISO 6937 combinations) that had functional equivalents in plain Unicode.

W3C recommendation: HTML5 and XML normalise to NFC. HTTP headers and URIs operate on bytes (percent-encoding), not Unicode normalisation forms — a source of confusion when comparing URLs that contain accented characters.

macOS HFS+ stores filenames in a form close to NFD (specifically, an Apple-modified NFD). Linux ext4 stores filenames as raw bytes. Copying a file with an accented name between macOS and Linux and comparing filenames programmatically will fail equality checks unless normalisation is applied first.

When to use it

Context	Recommended form
Text storage, API interchange, database	NFC
Full-text search index normalisation	NFKC
Diacritic stripping (intermediate step)	NFKD → strip Mn → NFC
Byte-level hashing / deduplication	NFC (before hashing)
Subword tokeniser input	NFC (most tokenisers assume this)
HFS+ filename round-tripping (macOS)	NFD (platform-specific)

Default to NFC for all text storage and interchange. NFC is the W3C recommendation and the form produced by most keyboard input methods. It is compact, well-supported, and the correct target for string equality comparisons. Apply it at the point where text enters the system.

Use NFKC when building search indexes or ML features. NFKC collapses typographic variants (fullwidth, ligatures, superscripts, fractions) that NFC leaves intact and that users rarely distinguish from their plain equivalents.

Use NFKD as an intermediate step for diacritic stripping. Decompose with NFKD, remove all Mn-category codepoints, recompose with NFC if compact form is needed.

NFD is rarely the right storage or output form. Its main use is as an internal intermediate during processing, or when a specific platform requires it.

Normalisation is idempotent. Applying NFC twice produces the same result as applying it once. It is always safe to normalise again on read as a defensive measure.