ICU Tokeniser

What it is

The ICU tokeniser uses the ICU4J library’s BreakIterator to locate word boundaries. Like the Unicode Tokeniser, it implements UAX #29 word-break rules. The distinction is what happens next: ICU supplies locale-aware dictionary segmenters for scripts that UAX #29’s boundary algorithm cannot segment at the word level on its own.

In Lucene-based search engines (Elasticsearch, OpenSearch, Solr), the ICU tokeniser is exposed via ICUTokenizerFactory — an alternative to the Standard Tokeniser that calls directly into ICU4J rather than using Lucene’s JFlex-generated UAX #29 implementation.

How it works

For most scripts — Latin, Cyrillic, Arabic, Devanagari — ICU’s boundary decisions are identical to those of the Standard Tokeniser: UAX #29 rules apply, and text is split at the usual word-break positions.

The key addition is a built-in dictionary segmentation layer that fires for scripts where characters are written without spaces between words:

Thai — ICU includes a Thai word-break dictionary. A run of Thai characters is passed to a dictionary lookup that finds the most probable segmentation rather than emitting the whole run as one token.
CJK (Chinese, Japanese, Korean) — ICU applies dictionary-based segmentation to Han, Hiragana, Katakana, and Hangul runs. The dictionary is generic and lacks the depth of a dedicated analyser (Kuromoji for Japanese, Nori for Korean), but it segments where UAX #29 alone would not.

Beyond the built-in dictionaries, ICU allows custom rule files written in ICU Rule-Based Break Iterator (RBBI) syntax. These override or extend the default behaviour for a specific script — useful when a domain has unusual tokenisation requirements (e.g. splitting on middle dots in scientific notation, or preserving hyphenated compound words in German).

Example

Input: "東京タワーと Thai: นักเรียน and Zürich"

Tokeniser	Thai run	CJK run
Standard (UAX #29)	`นักเรียน` (1 token)	`東京タワーと` (1 token)
ICU	`นัก`, `เรียน`	`東京`, `タワー`, `と`

Latin and accented characters (Zürich) tokenise identically in both — ICU’s dictionary layer does not affect scripts that UAX #29 handles well.

When to use it

Prefer the ICU tokeniser over the Standard Tokeniser when:

Your index contains Thai text. The Standard Tokeniser emits entire Thai sentences as single tokens; ICU is the minimum viable choice.
Your index contains CJK text and you do not want to configure a dedicated per-language analyser. ICU’s generic dictionary is a reasonable baseline; switch to Kuromoji or Nori when recall and precision for those languages become a priority.
You need to deploy a single analyser across many scripts and want dictionary segmentation for unspaced scripts without separate analyser branches per language.
You require custom RBBI rules to handle domain-specific tokenisation edge cases not solvable with filter chains.

The Standard Tokeniser is sufficient when:

All content is in Latin-script languages. Lucene’s JFlex implementation and ICU4J produce equivalent results for these scripts, and the Standard Tokeniser carries no extra dependency.
You are already using Kuromoji, Nori, or Smart Chinese analyser — these bundle their own segmenters and a second CJK-aware tokeniser layer adds no value.

In Solr, the factory class is solr.ICUTokenizerFactory. No configuration is required for default behaviour; custom rule files are supplied via the rulefiles parameter:

<tokenizer class="solr.ICUTokenizerFactory"/>

ICU Tokeniser

What it is

How it works

Example

When to use it

See also