CJK Tokeniser

What it is

A CJK tokeniser handles text written in Chinese, Japanese, or Korean — three scripts that share a characteristic that defeats whitespace-based splitting: words are not separated by spaces. A sentence in Mandarin, a paragraph in Japanese, and a line of Korean prose arrive as unbroken runs of characters, and a tokeniser must decide where one meaningful unit ends and the next begins.

The term “CJK tokeniser” covers a range of approaches from the deliberately simple (emit every character as its own token) to the linguistically sophisticated (dictionary lookup combined with a statistical or neural segmenter). Which approach is appropriate depends on the language, the application, and the acceptable latency and complexity budget.

How it works

Character-level splitting (the CJK bigram approach)

The simplest strategy treats every CJK character as an individual token. A single Han character often carries a complete concept, so single-character tokens are meaningful units. From a search perspective, this approach is forgiving: a query term will always match if its characters appear anywhere in a document, and there is no segmentation error to worry about.

Lucene’s built-in CJK Analyser extends this with bigram tokenisation: rather than emitting isolated characters, it generates overlapping two-character windows across every CJK run.

This guarantees that any two adjacent characters that form a word will appear together as a token in the index, without requiring a dictionary. Phrase matching and proximity queries work reasonably well because word-spanning bigrams are always present.

The tradeoff is index bloat and imprecision. A document about 北京 (Beijing) and one about 京烤 (a nonsense fragment) both contain the token — character-level indexes produce false positives that a dictionary-based segmenter avoids.

Dictionary and statistical segmentation

For Japanese and Korean, character-by-character splitting loses critical morphological information. Japanese uses three interlaced scripts (Hiragana, Katakana, Han), and word boundaries do not align with script boundaries. Korean (Hangul) is an alphabetic script where syllable blocks combine into words — single Hangul syllables are rarely meaningful in isolation.

Dedicated segmenters address this:

  • Kuromoji — a Japanese morphological analyser included in Lucene via the analysis-kuromoji plugin. It applies a dictionary (IPAdic or UniDic) and a Viterbi-based HMM to find the minimum-cost segmentation path through the lattice of all possible word splits.
  • Nori — Lucene’s Korean analyser (analysis-nori). It uses a similar dictionary-plus-Viterbi approach with the Mecab-ko-dic dictionary and additionally decomposes compound nouns into their constituent morphemes.
  • SmartChinese — Lucene’s Mandarin segmenter, using a Hidden Markov Model trained on a Chinese word corpus. Available in Solr and Elasticsearch via analysis-smartcn.

[illustrate: lattice diagram for the Japanese string “東京都” — all candidate segmentations shown as paths (東/京/都, 東京/都, 東/京都, 東京都), Viterbi scores annotating each edge, lowest-cost path highlighted]

For Mandarin, large-lexicon dictionary methods (jieba, PKU segmenter) are also common outside the Lucene ecosystem.

Example

Input: "東京烤鸭研究" (a contrived mixed string: “Tokyo” + “roast duck research”)

Approach Tokens
CJK Bigram (Lucene CJK Analyser) 東京, 京烤, 烤鸭, 鸭研, 研究
SmartChinese 東京, 烤鸭, 研究
ICU Tokeniser (generic dictionary) 東京, 烤鸭, 研究 (variable; dictionary-dependent)

The bigram approach indexes every adjacent pair, giving high recall at the cost of precision. The dictionary-based segmenters emit cleaner tokens aligned to word boundaries.

[illustrate: before/after of “東京烤鸭研究” — raw character run on the left; CJK bigram output in the centre showing overlapping tokens colour-coded by overlap; SmartChinese output on the right showing three clean word-boundary tokens]

Variants and history

Lucene CJK Analyser has been part of the Lucene core since the early 2000s. Its bigram approach was a pragmatic choice: it required no language-specific dictionary, worked reasonably well for Chinese and Japanese in a search context, and was trivially extensible to Korean. It remains available as CJKAnalyzer in Lucene and as cjk in Solr/Elasticsearch analysis chains.

Kuromoji (Atilika, 2011) brought morphologically aware Japanese segmentation into Lucene. It ships three segmentation modes: normal (standard dictionary segmentation), search (decompose longer compounds for better search recall), and extended (also emit unigrams for unknown words). The search mode is the standard choice for indexing.

Nori replaced the older Korean analyser in Lucene 7.4 (2018). Its decompound_mode parameter controls whether compound nouns are split: none, discard (split and remove the compound), or mixed (retain both the compound and its parts).

SmartChinese uses a Hidden Markov Model trained on the ICTCLAS corpus. For higher accuracy, many Mandarin NLP pipelines outside Lucene use jieba (Python) or HanLP, both of which offer multiple segmentation backends including neural models fine-tuned on domain corpora.

When to use it

Choose by language and required precision:

Language Baseline (low config) Production recommended
Chinese (Mandarin) Lucene CJK Analyser (bigrams) analysis-smartcn or jieba
Japanese ICU Tokeniser (generic dict) analysis-kuromoji (mode: search)
Korean ICU Tokeniser (generic dict) analysis-nori (decompound: mixed)

Use the CJK bigram approach when:

  • You need a zero-configuration fallback for mixed-script content you cannot predict.
  • Recall matters more than precision — bigrams will find matches that dictionary segmenters miss if the word is absent from the lexicon.
  • You are indexing short strings (product codes, names) where dictionary segmentation errors are more harmful than false positives.

Use a dedicated segmenter (Kuromoji, Nori, SmartChinese) when:

  • Your content is predominantly one language and correct word boundaries matter for phrase queries or relevance scoring.
  • You need morphological features (part-of-speech filtering, base-form normalisation, compound decomposition) not available from a character-level approach.
  • Index size is a constraint — bigram indexes can be two to three times larger than word-segmented indexes for the same corpus.

In Solr, the standard configuration for Japanese:

<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>

See also