Decimal Digit Filter

Normalisation Preprocessing Unicode Information-Retrieval Indexing Needs-Review

What it is

A decimal digit filter is a character- or token-level normalisation step that maps any Unicode character in the Nd (Decimal_Number) general category to its ASCII digit equivalent (U+0030–U+0039). After the filter runs, every digit in the token stream is expressed as one of 0 1 2 3 4 5 6 7 8 9, regardless of which script’s numeral system it originally came from.

Unicode defines at least thirty distinct decimal digit systems across different scripts. Without normalisation, the string "2024" written in ASCII digits and the same number written in Eastern Arabic digits ("٢٠٢٤") are four entirely different codepoints — they produce different index terms and will not match each other.

How it works

Every Unicode codepoint in the Nd category carries a numeric value property (0–9) in the Unicode Character Database. The filter reads that property and substitutes the corresponding ASCII digit. No heuristic or lookup table is required — the mapping is defined by the Unicode standard itself and is stable across versions.

The substitution is one-to-one: every Nd codepoint maps to exactly one ASCII digit, and the length of the token never changes.

[illustrate: step-by-step transform of the token “٢٠٢٤” (Eastern Arabic) — each codepoint shown with its Unicode name and Nd numeric value (2, 0, 2, 4), arrows leading to the ASCII substitution, output token “2024” assembling beneath; a second row shows Devanagari “२०२४” undergoing the same process]

Scripts with distinct decimal digit forms

The table below lists the most commonly encountered Nd digit blocks. All map cleanly to ASCII 0–9.

Script	Digit range	Example: 2024
ASCII (baseline)	U+0030–U+0039	`2024`
Eastern Arabic (Arabic-Indic)	U+0660–U+0669	`٢٠٢٤`
Extended Arabic-Indic (Perso-Arabic)	U+06F0–U+06F9	`۲۰۲۴`
Devanagari	U+0966–U+096F	`२०२४`
Bengali	U+09E6–U+09EF	`২০২৪`
Gujarati	U+0AE6–U+0AEF	`૨૦૨૪`
Gurmukhi	U+0A66–U+0A6F	`੨੦੨੪`
Tamil	U+0BE6–U+0BEF	`௨௦௨௪`
Thai	U+0E50–U+0E59	`๒๐๒๔`
Tibetan	U+0F20–U+0F29	`༢༠༢༤`
Mongolian	U+1810–U+1819	`᠒᠐᠒᠔`
Fullwidth	U+FF10–U+FF19	`２０２４`

Unicode 15 defines 670+ Nd codepoints spread across more than sixty blocks. The filter handles all of them without per-script configuration.

Example

A document field contains the string "Invoice ٢٠٢٤-١١" (year and month in Eastern Arabic digits). A user queries for "2024-11" in ASCII.

Stage	Index term	Query term
Raw input	`٢٠٢٤-١١`	`2024-11`
After decimal digit filter	`2024-11`	`2024-11`
Match?	yes	—

Without the filter, no match. With it, the terms are identical before the inverted index lookup.

Configuration

Python (ingest or query time)

The cleanest cross-engine approach is to normalise digits before content reaches the index — at ingest time and again at query time:

import unicodedata

def normalise_digits(text: str) -> str:
    return "".join(
        str(unicodedata.digit(ch)) if unicodedata.category(ch) == "Nd" else ch
        for ch in text
    )

normalise_digits("Invoice ٢٠٢٤-١١")  # → "Invoice 2024-11"
normalise_digits("मूल्य: ३४.५०")       # → "मूल्य: 34.50"  (Devanagari digits only)

unicodedata.digit(ch) returns the integer value (0–9) defined by the Unicode standard for any Nd character.

Elasticsearch / OpenSearch

Elasticsearch does not ship a dedicated decimal digit token filter. The Java regex character class \p{Nd} matches all Unicode decimal digits, but a pattern_replace character filter replaces each match with a static string — it cannot substitute each digit with its ASCII value dynamically. For complete Nd normalisation in the analysis chain, the most reliable option is ingest-time normalisation (as above) before the document reaches the indexer.

The ICU Analysis Plugin (analysis-icu) exposes icu_normalizer with NFKC transformations. NFKC maps fullwidth digits (U+FF10–U+FF19) to ASCII but does not cover Arabic-Indic or Devanagari digits — those blocks have no NFKC compatibility mapping to ASCII. ICU normalisation alone is therefore insufficient for full Nd coverage.

Solr — `ICUFoldingFilterFactory`

Solr’s ICUFoldingFilterFactory (in the analysis-extras module) applies NFKC folding, covering fullwidth digits. For complete Nd normalisation, pre-process the field value at ingest or use a custom CharFilter implementation.

Interaction with other normalisation steps

Apply the decimal digit filter before tokenisation when possible. A tokeniser that splits on punctuation may treat ٢٠٢٤ as an opaque token correctly — but if the digit string forms part of a compound token (e.g. v١.٢), normalising digits before the tokeniser ensures the output token v1.2 is predictable.

When used alongside ASCII Folding, note that Lucene’s ASCIIFoldingFilter does not fold Arabic-Indic or Devanagari digits — they lie outside its lookup table. The two filters address orthogonal character ranges: ASCII folding handles Latin diacritics; decimal digit normalisation handles numeral systems. Both are needed in a pipeline serving multilingual content.

When it is needed

Use this filter when:

Content is ingested from sources where users enter numbers in their native script — forms localised for Arabic, Hindi, Bengali, or Thai speakers frequently produce non-ASCII digits.
Documents are sourced from PDFs, OCR output, or copy-pasted content that may carry fullwidth digits (１２３) from East Asian typographic contexts.
You are building a multilingual search corpus and want "2024" to match regardless of which numeral system the author used.

Skip it when:

Your ingestion pipeline already guarantees ASCII-only digits (e.g. all values come from a structured database with numeric types).
Your search domain is a single locale where digit form is consistent throughout both documents and queries.

Apply the same normalisation symmetrically at index time and at query time. A digit filter on documents but not on queries — or vice versa — will silently fail to match the very terms it was meant to unify.