ICU Tokeniser

What it is

The ICU tokeniser uses the ICU4J library’s BreakIterator to locate word boundaries. Like the Unicode Tokeniser, it implements UAX #29 word-break rules. The distinction is what happens next: ICU supplies locale-aware dictionary segmenters for scripts that UAX #29’s boundary algorithm cannot segment at the word level on its own.

In Lucene-based search engines (Elasticsearch, OpenSearch, Solr), the ICU tokeniser is exposed via ICUTokenizerFactory — an alternative to the Standard Tokeniser that calls directly into ICU4J rather than using Lucene’s JFlex-generated UAX #29 implementation.

How it works

For most scripts — Latin, Cyrillic, Arabic, Devanagari — ICU’s boundary decisions are identical to those of the Standard Tokeniser: UAX #29 rules apply, and text is split at the usual word-break positions.

The key addition is a built-in dictionary segmentation layer that fires for scripts where characters are written without spaces between words:

  • Thai — ICU includes a Thai word-break dictionary. A run of Thai characters is passed to a dictionary lookup that finds the most probable segmentation rather than emitting the whole run as one token.
  • CJK (Chinese, Japanese, Korean) — ICU applies dictionary-based segmentation to Han, Hiragana, Katakana, and Hangul runs. The dictionary is generic and lacks the depth of a dedicated analyser (Kuromoji for Japanese, Nori for Korean), but it segments where UAX #29 alone would not.

Beyond the built-in dictionaries, ICU allows custom rule files written in ICU Rule-Based Break Iterator (RBBI) syntax. These override or extend the default behaviour for a specific script — useful when a domain has unusual tokenisation requirements (e.g. splitting on middle dots in scientific notation, or preserving hyphenated compound words in German).

Example

Input: "東京タワーと Thai: นักเรียน and Zürich"

Tokeniser Thai run CJK run
Standard (UAX #29) นักเรียน (1 token) 東京タワーと (1 token)
ICU นัก, เรียน 東京, タワー,

Latin and accented characters (Zürich) tokenise identically in both — ICU’s dictionary layer does not affect scripts that UAX #29 handles well.

When to use it

Prefer the ICU tokeniser over the Standard Tokeniser when:

  • Your index contains Thai text. The Standard Tokeniser emits entire Thai sentences as single tokens; ICU is the minimum viable choice.
  • Your index contains CJK text and you do not want to configure a dedicated per-language analyser. ICU’s generic dictionary is a reasonable baseline; switch to Kuromoji or Nori when recall and precision for those languages become a priority.
  • You need to deploy a single analyser across many scripts and want dictionary segmentation for unspaced scripts without separate analyser branches per language.
  • You require custom RBBI rules to handle domain-specific tokenisation edge cases not solvable with filter chains.

The Standard Tokeniser is sufficient when:

  • All content is in Latin-script languages. Lucene’s JFlex implementation and ICU4J produce equivalent results for these scripts, and the Standard Tokeniser carries no extra dependency.
  • You are already using Kuromoji, Nori, or Smart Chinese analyser — these bundle their own segmenters and a second CJK-aware tokeniser layer adds no value.

In Solr, the factory class is solr.ICUTokenizerFactory. No configuration is required for default behaviour; custom rule files are supplied via the rulefiles parameter:

<tokenizer class="solr.ICUTokenizerFactory"/>

See also