Whitespace Tokeniser
What it is
A whitespace tokeniser splits a raw string into tokens wherever it encounters a whitespace character — space (U+0020), tab (U+0009), or newline (U+000A/U+000D). Nothing else influences where boundaries fall. No rules, no vocabulary, no trained model.
It is the minimum viable tokeniser: easy to reason about, trivial to implement, and entirely transparent in its behaviour.
How it works
The algorithm is a single pass over the input string. Each run of non-whitespace characters becomes a token; whitespace characters are consumed and discarded.
tokens = input_string.split()
Python’s str.split() with no argument collapses consecutive whitespace and strips leading and trailing spaces. Most implementations follow the same convention.
Example
Input: "Costs $1,200 — including VAT."
| Strategy | Tokens |
|---|---|
| Whitespace | Costs, $1,200, —, including, VAT. |
| Rule-based | Costs, $, 1,200, —, including, VAT, . |
The whitespace tokeniser emits VAT. as a single token. A query for VAT will not match it unless the downstream system strips punctuation separately. This is the canonical failure mode.
Edge cases
Punctuation attached to words. A trailing full stop, comma, or closing parenthesis becomes part of the preceding token. "done." and "done" are distinct types in the index — a query for one will not retrieve the other.
Contractions. "isn't" is a single token. Downstream steps cannot split the negative contraction without separate post-processing.
Non-breaking space (U+00A0). HTML and word-processor output frequently contains non-breaking spaces. Standard split() implementations treat U+00A0 as whitespace; some do not. The safe choice is to normalise Unicode whitespace before tokenising.
CJK scripts. Mandarin, Japanese, and Korean text contains no spaces between words. A whitespace tokeniser applied to "自然言語処理" (natural language processing) emits a single token: the entire string. The algorithm is inapplicable without a language-aware segmenter.
Multiple spaces and zero-width characters. Consecutive spaces collapse correctly in most implementations, but zero-width joiners (U+200D) and zero-width non-breaking spaces (U+FEFF) are invisible and may produce empty or unexpected tokens if not stripped first.
When to use it
Use it when:
- Input is clean, space-delimited text with no embedded punctuation you care about — log lines, machine-generated identifiers, structured fields.
- Speed is the primary constraint and token quality matters less than throughput. A whitespace split is O(n) with minimal overhead and no external dependencies.
- You are prototyping a pipeline and need a baseline before committing to a more sophisticated tokeniser.
- You control input generation and can guarantee punctuation is already separated (e.g. pre-tokenised corpora like Penn Treebank distribute text with spaces around every punctuation mark).
Do not use it when:
- Input is natural-language prose. Punctuation-attached tokens and unhandled contractions silently reduce recall without any error signal.
- Your content is multilingual or may include CJK, Arabic, or other scripts where word boundaries are not marked by spaces.
- The pipeline feeds a downstream component — a stemmer, stop-word filter, or BM25 scorer — that expects clean word tokens. Garbage-in from the tokeniser compounds through every subsequent stage.
- You need query–index consistency: if documents were indexed with a rule-based tokeniser and queries are parsed with a whitespace tokeniser,
"VAT."in the index will never match the query token"VAT".