Whitespace Tokeniser

What it is

A whitespace tokeniser splits a raw string into tokens wherever it encounters a whitespace character — space (U+0020), tab (U+0009), or newline (U+000A/U+000D). Nothing else influences where boundaries fall. No rules, no vocabulary, no trained model.

It is the minimum viable tokeniser: easy to reason about, trivial to implement, and entirely transparent in its behaviour.

How it works

The algorithm is a single pass over the input string. Each run of non-whitespace characters becomes a token; whitespace characters are consumed and discarded.

tokens = input_string.split()

Python’s str.split() with no argument collapses consecutive whitespace and strips leading and trailing spaces. Most implementations follow the same convention.

Example

Input: "Costs $1,200 — including VAT."

Strategy Tokens
Whitespace Costs, $1,200, , including, VAT.
Rule-based Costs, $, 1,200, , including, VAT, .

The whitespace tokeniser emits VAT. as a single token. A query for VAT will not match it unless the downstream system strips punctuation separately. This is the canonical failure mode.

Edge cases

Punctuation attached to words. A trailing full stop, comma, or closing parenthesis becomes part of the preceding token. "done." and "done" are distinct types in the index — a query for one will not retrieve the other.

Contractions. "isn't" is a single token. Downstream steps cannot split the negative contraction without separate post-processing.

Non-breaking space (U+00A0). HTML and word-processor output frequently contains non-breaking spaces. Standard split() implementations treat U+00A0 as whitespace; some do not. The safe choice is to normalise Unicode whitespace before tokenising.

CJK scripts. Mandarin, Japanese, and Korean text contains no spaces between words. A whitespace tokeniser applied to "自然言語処理" (natural language processing) emits a single token: the entire string. The algorithm is inapplicable without a language-aware segmenter.

Multiple spaces and zero-width characters. Consecutive spaces collapse correctly in most implementations, but zero-width joiners (U+200D) and zero-width non-breaking spaces (U+FEFF) are invisible and may produce empty or unexpected tokens if not stripped first.

When to use it

Use it when:

  • Input is clean, space-delimited text with no embedded punctuation you care about — log lines, machine-generated identifiers, structured fields.
  • Speed is the primary constraint and token quality matters less than throughput. A whitespace split is O(n) with minimal overhead and no external dependencies.
  • You are prototyping a pipeline and need a baseline before committing to a more sophisticated tokeniser.
  • You control input generation and can guarantee punctuation is already separated (e.g. pre-tokenised corpora like Penn Treebank distribute text with spaces around every punctuation mark).

Do not use it when:

  • Input is natural-language prose. Punctuation-attached tokens and unhandled contractions silently reduce recall without any error signal.
  • Your content is multilingual or may include CJK, Arabic, or other scripts where word boundaries are not marked by spaces.
  • The pipeline feeds a downstream component — a stemmer, stop-word filter, or BM25 scorer — that expects clean word tokens. Garbage-in from the tokeniser compounds through every subsequent stage.
  • You need query–index consistency: if documents were indexed with a rule-based tokeniser and queries are parsed with a whitespace tokeniser, "VAT." in the index will never match the query token "VAT".

See also