Thai Tokeniser

What it is

A Thai tokeniser is a segmenter for Thai script, a writing system that places no spaces between words. Without spaces, generic boundary algorithms — including UAX #29 and the ICU Tokeniser — cannot reliably identify where one word ends and the next begins. A Thai tokeniser solves this by consulting a word-form dictionary, a statistical or neural model, or a combination of both.

Thai tokenisation is a non-trivial linguistic problem because the language allows long, uninterrupted character runs that can be segmented in multiple valid ways. The string ตากลม can mean either “dried fish ball” (ตาก-ลม) or “round eyes” (ตา-กลม) depending on context — a problem dictionary lookup alone cannot resolve.

How it works

Most Thai tokenisers operate as a two-stage pipeline:

Stage 1 — Dictionary lookup. The input string is scanned against a lexicon of known Thai words using a maximal-match or minimal-match heuristic:

  • Maximal match (longest match) — scan left to right, always consume the longest prefix that appears in the dictionary. Fast and reasonable as a baseline, but greedy and sensitive to lexicon coverage.
  • Minimal match (shortest match) — prefer the shortest valid prefix at each position. Produces more tokens; useful when unknown words are common.

Stage 2 — Disambiguation. Where dictionary lookup produces ambiguous segmentations, a statistical model — an n-gram language model, a CRF (Conditional Random Field), or a neural sequence tagger — scores the candidates and selects the most probable word sequence given surrounding context.

[illustrate: maximal-match scan of “ตากลม” — cursor advancing left to right, two candidate segmentation paths shown as a trie branch (ตาก+ลม vs ตา+กลม), statistical scorer selecting the higher-probability path based on context]

The ICU tokeniser bundles a basic Thai dictionary sufficient for common words, but it does not include a disambiguation model. It will correctly segment straightforward runs but will fail on ambiguous or domain-specific vocabulary. Dedicated Thai tokenisers carry larger lexicons and contextual models.

Example

Input: "นักเรียนไปโรงเรียน" English gloss: “The student goes to school”

Approach Tokens
UAX #29 (no dictionary) นักเรียนไปโรงเรียน (1 token — entire string)
ICU (basic dictionary) นักเรียน, ไป, โรงเรียน
Dedicated tokeniser (PyThaiNLP) นักเรียน, ไป, โรงเรียน

For this common sentence the outputs agree. The gap between ICU and a dedicated tokeniser widens with informal text, loanwords, named entities, and domain vocabulary (medical, legal, social media).

[illustrate: before/after segmentation of a raw Thai sentence — unsegmented character run on the left, word-boundary markers inserted between tokens on the right, each token coloured distinctly with a transliteration label below]

Variants and history

Rule-based / dictionary-only tokenisers were the earliest approach. Thai NLP research produced large public lexicons — the best-known is the BEST corpus and associated dictionary from NECTEC (National Electronics and Computer Technology Center, Thailand), which has been the standard benchmark since 2010.

CRF-based tokenisers (mid-2010s) treat tokenisation as a character-level sequence labelling task, where each character is tagged B (beginning of word), I (inside word), or E (end of word). This gives the model sensitivity to context without requiring an exhaustive lexicon.

Neural tokenisers use character-level CNNs, BiLSTMs, or Transformer encoders to predict boundary positions. The DeepCut model (Kittinaradorn et al., 2019) demonstrated strong benchmark results using a convolutional character-level network. WangchanBERTa (NSTDA, 2021), a BERT-based model pre-trained on Thai text, achieves state-of-the-art tokenisation as part of a broader NLP pipeline.

PyThaiNLP is the primary open-source library for Thai NLP in Python. It bundles several tokeniser backends switchable by a single parameter:

from pythainlp.tokenize import word_tokenize

text = "นักเรียนไปโรงเรียน"

word_tokenize(text, engine="newmm")     # dictionary + maximal match
word_tokenize(text, engine="deepcut")   # neural (DeepCut)
word_tokenize(text, engine="attacut")   # neural (AttaCut, faster inference)

newmm (New Maximum Matching) is the default and covers most general-purpose use cases. deepcut and attacut give better recall on out-of-vocabulary words at higher compute cost.

When to use it

Use a dedicated Thai tokeniser when:

  • Your content contains Thai text. The Unicode Tokeniser emits the entire Thai portion of a string as a single token; the ICU Tokeniser is a baseline improvement but not sufficient for recall-sensitive applications.
  • You are building a search index over Thai documents. Without word-level segmentation, phrase queries, term frequency counting, and relevance scoring are all incorrect.
  • Your domain contains specialised vocabulary (medical, legal, technical) — in this case, supplement the base tokeniser’s dictionary with domain-specific word lists.

Engine selection tradeoffs:

Engine Speed OOV handling Recommended when
Dictionary + maximal match (newmm) Fast Weak General purpose, high-throughput indexing
CRF Medium Good Balanced accuracy/speed
Neural (deepcut, attacut) Slower Strong User-generated content, informal text, named entities

For Elasticsearch and OpenSearch, the ICU Analysis plugin (analysis-icu) provides icu_tokenizer, which uses ICU’s built-in Thai dictionary — a practical production option when deploying a standalone Thai tokeniser is not feasible. For best results, pair it with a custom user_dictionary to extend ICU’s lexicon with domain terms.

See also