Tokenisation

Tokenisation Preprocessing Information-Retrieval Language-Modelling Text-Analysis Needs-Review

What it is

Tokenisation is the first stage of almost every text-processing pipeline. It takes a raw string as input and produces a sequence of tokens — discrete units whose boundaries and granularity are defined by the tokeniser’s rules.

Nothing downstream can run without it. An inverted index cannot store words it hasn’t identified. A language model cannot embed a subword it hasn’t segmented. A BM25 scorer cannot count term frequencies before the terms exist. Tokenisation is the foundation the rest of the pipeline stands on.

The word “tokenisation” names the process; the output units are tokens. What a token actually is — a word, a subword fragment, a character — depends on which tokeniser is applied. The same sentence can yield very different token sequences depending on that choice.

How it works

A tokeniser reads a Unicode string and applies a splitting strategy to emit a list of tokens. The strategy ranges from trivial (split on whitespace) to learned (train a vocabulary of subword units from a corpus).

The main approaches, ordered roughly from simplest to most sophisticated:

Whitespace tokenisation splits on spaces and newlines. "the quick brown" → ["the", "quick", "brown"]. Fast and requires no configuration; breaks down immediately on punctuation and contractions.

Rule-based tokenisation applies regular expressions or hand-crafted grammars to handle edge cases: punctuation, possessives, hyphenated compounds, URLs, currency symbols. Most classical NLP toolkits (NLTK, spaCy, Stanford CoreNLP) use this approach for general English text.

Standard analysis in search engines is a multi-step pipeline: a character filter cleans the raw input (strips HTML, normalises Unicode), a tokeniser splits it into terms, and one or more token filters transform the resulting token list (lowercasing, stop-word removal, stemming). Lucene, Elasticsearch, OpenSearch, and Solr all expose this three-stage model explicitly.

Subword tokenisation learns a fixed vocabulary of word-piece units from training data and segments each word into the longest-known subword sequences. Byte-pair encoding (BPE), WordPiece, and Unigram LM are the standard algorithms. This approach underlies every major transformer model (GPT, BERT, T5, LLaMA) and handles out-of-vocabulary words gracefully by decomposing them into known fragments.

Character tokenisation treats every Unicode code point as a token. No vocabulary is needed, but sequences become long — costly for transformers with quadratic attention complexity.

Example

Input: "Tokenisation isn't trivial."

Strategy	Tokens
Whitespace	`Tokenisation`, `isn't`, `trivial.`
Rule-based	`Tokenisation`, `is`, `n't`, `trivial`, `.`
Lowercase + rule-based	`tokenisation`, `is`, `n't`, `trivial`, `.`
BPE (hypothetical vocab)	`token`, `isation`, `is`, `n`, `'t`, `trivial`, `.`
Character	`T`, `o`, `k`, `e`, `n`, `i`, `s`, `a`, `t`, `i`, `o`, `n`, , `i`, `s`, `n`, `'`, `t`, , `t`, `r`, `i`, `v`, `i`, `a`, `l`, `.`

Variants and history

Tokenisation as a formal NLP step emerged alongside computational linguistics in the 1960s–70s. Rule-based tokenisers for English matured through the 1980s–90s.

Language-specific tokenisation. English tokenisation is relatively straightforward because words are space-separated. Mandarin and Japanese have no whitespace between words and require dictionary-based or statistical segmentation (Jieba, MeCab). Arabic and Hebrew are morphologically complex; a single written token may contain a preposition, noun, and possessive suffix that must be split before indexing.

Subword algorithms. BPE was adapted for NLP by Sennrich et al. (2016) to handle rare and out-of-vocabulary words in neural machine translation. Google’s WordPiece (BERT) and SentencePiece (T5, LLaMA) are close relatives. All three learn merge rules or vocabulary entries from raw text corpora.

Tokenisation and vocabulary size. Subword tokenisers expose a vocabulary size hyperparameter. GPT-2 uses 50,257 tokens; LLaMA 2 uses 32,000. Larger vocabularies mean shorter sequences but require more embedding parameters.

When to use it

For full-text search (Elasticsearch, OpenSearch, Solr): use the engine’s built-in analysis chain. The standard analyser — Unicode normalisation, whitespace split, lowercase, stop-word filter — is the right starting point for English. Add a stemmer when recall matters more than precision; use edge n-gram filters for autocomplete.

For transformer models: use the tokeniser bundled with the pretrained checkpoint, without modification. The tokeniser and model weights are a matched pair — changing one invalidates the other.

For multilingual content: rule-based word tokenisation breaks down. Use a language-aware library (spaCy’s multilingual pipeline, ICU tokenisation, SentencePiece) or a subword tokeniser trained on the target language.

Watch for mismatches: if your query pipeline and index-time pipeline use different tokenisers, term-frequency matching will silently fail. In Lucene-based engines, use the same Analyzer class for both indexing and query parsing.