Word Tokeniser

What it is

A word tokeniser splits a raw string into tokens at word boundaries by applying a set of rules — typically regular expressions — rather than relying on whitespace alone. Where a whitespace tokeniser asks only “is this character a space?”, a word tokeniser asks “what kind of linguistic unit does this character sequence form?”

The distinction matters immediately. Given "She won't go.", a whitespace tokeniser produces three tokens: She, won't, go. — a trailing period glued to a word and a contraction left unsplit. A word tokeniser produces five: She, wo, n't, go, . — clean word forms the rest of the pipeline can actually use.

How it works

The tokeniser applies an ordered set of regular expressions to classify and extract token spans before splitting on whitespace. A minimal rule set for English handles:

  • Punctuation separation — detach trailing or leading ., ,, :, !, ? from word tokens
  • Contractions — split n't, 's, 're, 've, 'll, 'd at the apostrophe boundary (won'two, n't; she'sshe, 's)
  • Possessives"dog's"dog, 's
  • Hyphenated compounds — keep or split depending on policy; "well-known" may stay as one token for indexing or split to two for stemming
  • Tokens that must not split — URLs (https://example.com), email addresses, numbers with commas or decimals ($1,200, 3.14), and stock tickers are matched early and emitted whole

Rules are applied in priority order. A URL pattern fires before a punctuation-split rule so that the dot in example.com is never detached.

Example

Input: "We've emailed support@acme.com — it's $49.99."

Strategy Tokens
Whitespace We've, emailed, support@acme.com, , it's, $49.99.
Word tokeniser We, 've, emailed, support@acme.com, , it, 's, $, 49.99, .

The whitespace tokeniser delivers $49.99. as a single token — unindexable as a price, and carrying a spurious period. The word tokeniser preserves the email address as a unit (it matches a specific pattern), separates the currency symbol, retains the number, and detaches the sentence-final period.

Variants and history

Penn Treebank tokenisation is the most widely cited reference standard. Released with the Penn Treebank corpus (Marcus et al., 1993), its tokenisation scheme defined explicit rules for English contractions, possessives, and punctuation. Because the majority of annotated English NLP data uses PTB conventions, models trained on that data implicitly expect PTB-tokenised input. NLTK’s word_tokenize applies a close approximation of PTB rules; the Moses tokeniser used in machine translation is another widely deployed descendant.

spaCy’s rule-based tokeniser extends the PTB approach with language-specific exception dictionaries and prefix/suffix/infix pattern lists per language. It runs as a deterministic finite automaton rather than iterating multiple regex passes, making it significantly faster.

Regexp tokenisers expose the rule set directly. NLTK’s RegexpTokenizer and similar classes let practitioners write their own pattern list — useful when domain text (legal, clinical, code) has conventions PTB rules don’t cover.

When to use it

Use a word tokeniser whenever input is natural-language prose destined for a downstream component that expects clean word forms — a stemmer, a BM25 scorer, a stop-word filter, or a frequency counter.

Prefer a whitespace tokeniser only when input is already clean and pre-delimited (machine-generated logs, pre-tokenised corpora distributed with spaces around every punctuation mark) and speed is critical.

Avoid word tokenisers when working with transformer models: those models ship with their own subword tokeniser, trained to match the model’s vocabulary. Substituting a word tokeniser breaks the alignment between token sequences and model weights.

For multilingual pipelines, a language-specific rule set (or a language-aware library such as spaCy’s multilingual models) is needed; English PTB rules applied to French, German, or Turkish text produce incorrect splits.

See also