Thai Tokeniser
What it is
A Thai tokeniser is a segmenter for Thai script, a writing system that places no spaces between words. Without spaces, generic boundary algorithms — including UAX #29 and the ICU Tokeniser — cannot reliably identify where one word ends and the next begins. A Thai tokeniser solves this by consulting a word-form dictionary, a statistical or neural model, or a combination of both.
Thai tokenisation is a non-trivial linguistic problem because the language allows long, uninterrupted character runs that can be segmented in multiple valid ways. The string ตากลม can mean either “dried fish ball” (ตาก-ลม) or “round eyes” (ตา-กลม) depending on context — a problem dictionary lookup alone cannot resolve.
How it works
Most Thai tokenisers operate as a two-stage pipeline:
Stage 1 — Dictionary lookup. The input string is scanned against a lexicon of known Thai words using a maximal-match or minimal-match heuristic:
- Maximal match (longest match) — scan left to right, always consume the longest prefix that appears in the dictionary. Fast and reasonable as a baseline, but greedy and sensitive to lexicon coverage.
- Minimal match (shortest match) — prefer the shortest valid prefix at each position. Produces more tokens; useful when unknown words are common.
Stage 2 — Disambiguation. Where dictionary lookup produces ambiguous segmentations, a statistical model — an n-gram language model, a CRF (Conditional Random Field), or a neural sequence tagger — scores the candidates and selects the most probable word sequence given surrounding context.
[illustrate: maximal-match scan of “ตากลม” — cursor advancing left to right, two candidate segmentation paths shown as a trie branch (ตาก+ลม vs ตา+กลม), statistical scorer selecting the higher-probability path based on context]
The ICU tokeniser bundles a basic Thai dictionary sufficient for common words, but it does not include a disambiguation model. It will correctly segment straightforward runs but will fail on ambiguous or domain-specific vocabulary. Dedicated Thai tokenisers carry larger lexicons and contextual models.
Example
Input: "นักเรียนไปโรงเรียน"
English gloss: “The student goes to school”
| Approach | Tokens |
|---|---|
| UAX #29 (no dictionary) | นักเรียนไปโรงเรียน (1 token — entire string) |
| ICU (basic dictionary) | นักเรียน, ไป, โรงเรียน |
| Dedicated tokeniser (PyThaiNLP) | นักเรียน, ไป, โรงเรียน |
For this common sentence the outputs agree. The gap between ICU and a dedicated tokeniser widens with informal text, loanwords, named entities, and domain vocabulary (medical, legal, social media).
[illustrate: before/after segmentation of a raw Thai sentence — unsegmented character run on the left, word-boundary markers inserted between tokens on the right, each token coloured distinctly with a transliteration label below]
Variants and history
Rule-based / dictionary-only tokenisers were the earliest approach. Thai NLP research produced large public lexicons — the best-known is the BEST corpus and associated dictionary from NECTEC (National Electronics and Computer Technology Center, Thailand), which has been the standard benchmark since 2010.
CRF-based tokenisers (mid-2010s) treat tokenisation as a character-level sequence labelling task, where each character is tagged B (beginning of word), I (inside word), or E (end of word). This gives the model sensitivity to context without requiring an exhaustive lexicon.
Neural tokenisers use character-level CNNs, BiLSTMs, or Transformer encoders to predict boundary positions. The DeepCut model (Kittinaradorn et al., 2019) demonstrated strong benchmark results using a convolutional character-level network. WangchanBERTa (NSTDA, 2021), a BERT-based model pre-trained on Thai text, achieves state-of-the-art tokenisation as part of a broader NLP pipeline.
PyThaiNLP is the primary open-source library for Thai NLP in Python. It bundles several tokeniser backends switchable by a single parameter:
from pythainlp.tokenize import word_tokenize
text = "นักเรียนไปโรงเรียน"
word_tokenize(text, engine="newmm") # dictionary + maximal match
word_tokenize(text, engine="deepcut") # neural (DeepCut)
word_tokenize(text, engine="attacut") # neural (AttaCut, faster inference)
newmm (New Maximum Matching) is the default and covers most general-purpose use cases. deepcut and attacut give better recall on out-of-vocabulary words at higher compute cost.
When to use it
Use a dedicated Thai tokeniser when:
- Your content contains Thai text. The Unicode Tokeniser emits the entire Thai portion of a string as a single token; the ICU Tokeniser is a baseline improvement but not sufficient for recall-sensitive applications.
- You are building a search index over Thai documents. Without word-level segmentation, phrase queries, term frequency counting, and relevance scoring are all incorrect.
- Your domain contains specialised vocabulary (medical, legal, technical) — in this case, supplement the base tokeniser’s dictionary with domain-specific word lists.
Engine selection tradeoffs:
| Engine | Speed | OOV handling | Recommended when |
|---|---|---|---|
Dictionary + maximal match (newmm) |
Fast | Weak | General purpose, high-throughput indexing |
| CRF | Medium | Good | Balanced accuracy/speed |
Neural (deepcut, attacut) |
Slower | Strong | User-generated content, informal text, named entities |
For Elasticsearch and OpenSearch, the ICU Analysis plugin (analysis-icu) provides icu_tokenizer, which uses ICU’s built-in Thai dictionary — a practical production option when deploying a standalone Thai tokeniser is not feasible. For best results, pair it with a custom user_dictionary to extend ICU’s lexicon with domain terms.