Whitespace Tokeniser

What it is

A whitespace tokeniser splits a raw string into tokens wherever it encounters a whitespace character — space (U+0020), tab (U+0009), or newline (U+000A/U+000D). Nothing else influences where boundaries fall. No rules, no vocabulary, no trained model.

It is the minimum viable tokeniser: easy to reason about, trivial to implement, and entirely transparent in its behaviour.

How it works

The algorithm is a single pass over the input string. Each run of non-whitespace characters becomes a token; whitespace characters are consumed and discarded.

tokens = input_string.split()

Python’s str.split() with no argument collapses consecutive whitespace and strips leading and trailing spaces. Most implementations follow the same convention.

Example

Input: "Costs $1,200 — including VAT."

Strategy	Tokens
Whitespace	`Costs`, `$1,200`, `—`, `including`, `VAT.`
Rule-based	`Costs`, `$`, `1,200`, `—`, `including`, `VAT`, `.`

The whitespace tokeniser emits VAT. as a single token. A query for VAT will not match it unless the downstream system strips punctuation separately. This is the canonical failure mode.

Edge cases

Punctuation attached to words. A trailing full stop, comma, or closing parenthesis becomes part of the preceding token. "done." and "done" are distinct types in the index — a query for one will not retrieve the other.

Contractions. "isn't" is a single token. Downstream steps cannot split the negative contraction without separate post-processing.

Non-breaking space (U+00A0). HTML and word-processor output frequently contains non-breaking spaces. Standard split() implementations treat U+00A0 as whitespace; some do not. The safe choice is to normalise Unicode whitespace before tokenising.

CJK scripts. Mandarin, Japanese, and Korean text contains no spaces between words. A whitespace tokeniser applied to "自然言語処理" (natural language processing) emits a single token: the entire string. The algorithm is inapplicable without a language-aware segmenter.

Multiple spaces and zero-width characters. Consecutive spaces collapse correctly in most implementations, but zero-width joiners (U+200D) and zero-width non-breaking spaces (U+FEFF) are invisible and may produce empty or unexpected tokens if not stripped first.

When to use it

Use it when:

Input is clean, space-delimited text with no embedded punctuation you care about — log lines, machine-generated identifiers, structured fields.
Speed is the primary constraint and token quality matters less than throughput. A whitespace split is O(n) with minimal overhead and no external dependencies.
You are prototyping a pipeline and need a baseline before committing to a more sophisticated tokeniser.
You control input generation and can guarantee punctuation is already separated (e.g. pre-tokenised corpora like Penn Treebank distribute text with spaces around every punctuation mark).

Do not use it when:

Input is natural-language prose. Punctuation-attached tokens and unhandled contractions silently reduce recall without any error signal.
Your content is multilingual or may include CJK, Arabic, or other scripts where word boundaries are not marked by spaces.
The pipeline feeds a downstream component — a stemmer, stop-word filter, or BM25 scorer — that expects clean word tokens. Garbage-in from the tokeniser compounds through every subsequent stage.
You need query–index consistency: if documents were indexed with a rule-based tokeniser and queries are parsed with a whitespace tokeniser, "VAT." in the index will never match the query token "VAT".