ICU Tokeniser
What it is
The ICU tokeniser uses the ICU4J library’s BreakIterator to locate word boundaries. Like the Unicode Tokeniser, it implements UAX #29 word-break rules. The distinction is what happens next: ICU supplies locale-aware dictionary segmenters for scripts that UAX #29’s boundary algorithm cannot segment at the word level on its own.
In Lucene-based search engines (Elasticsearch, OpenSearch, Solr), the ICU tokeniser is exposed via ICUTokenizerFactory — an alternative to the Standard Tokeniser that calls directly into ICU4J rather than using Lucene’s JFlex-generated UAX #29 implementation.
How it works
For most scripts — Latin, Cyrillic, Arabic, Devanagari — ICU’s boundary decisions are identical to those of the Standard Tokeniser: UAX #29 rules apply, and text is split at the usual word-break positions.
The key addition is a built-in dictionary segmentation layer that fires for scripts where characters are written without spaces between words:
- Thai — ICU includes a Thai word-break dictionary. A run of Thai characters is passed to a dictionary lookup that finds the most probable segmentation rather than emitting the whole run as one token.
- CJK (Chinese, Japanese, Korean) — ICU applies dictionary-based segmentation to Han, Hiragana, Katakana, and Hangul runs. The dictionary is generic and lacks the depth of a dedicated analyser (Kuromoji for Japanese, Nori for Korean), but it segments where UAX #29 alone would not.
Beyond the built-in dictionaries, ICU allows custom rule files written in ICU Rule-Based Break Iterator (RBBI) syntax. These override or extend the default behaviour for a specific script — useful when a domain has unusual tokenisation requirements (e.g. splitting on middle dots in scientific notation, or preserving hyphenated compound words in German).
Example
Input: "東京タワーと Thai: นักเรียน and Zürich"
| Tokeniser | Thai run | CJK run |
|---|---|---|
| Standard (UAX #29) | นักเรียน (1 token) |
東京タワーと (1 token) |
| ICU | นัก, เรียน |
東京, タワー, と |
Latin and accented characters (Zürich) tokenise identically in both — ICU’s dictionary layer does not affect scripts that UAX #29 handles well.
When to use it
Prefer the ICU tokeniser over the Standard Tokeniser when:
- Your index contains Thai text. The Standard Tokeniser emits entire Thai sentences as single tokens; ICU is the minimum viable choice.
- Your index contains CJK text and you do not want to configure a dedicated per-language analyser. ICU’s generic dictionary is a reasonable baseline; switch to Kuromoji or Nori when recall and precision for those languages become a priority.
- You need to deploy a single analyser across many scripts and want dictionary segmentation for unspaced scripts without separate analyser branches per language.
- You require custom RBBI rules to handle domain-specific tokenisation edge cases not solvable with filter chains.
The Standard Tokeniser is sufficient when:
- All content is in Latin-script languages. Lucene’s JFlex implementation and ICU4J produce equivalent results for these scripts, and the Standard Tokeniser carries no extra dependency.
- You are already using Kuromoji, Nori, or Smart Chinese analyser — these bundle their own segmenters and a second CJK-aware tokeniser layer adds no value.
In Solr, the factory class is solr.ICUTokenizerFactory. No configuration is required for default behaviour; custom rule files are supplied via the rulefiles parameter:
<tokenizer class="solr.ICUTokenizerFactory"/>