Token
What it is
A token is the atomic unit of text that a pipeline hands off between stages. Before any NLP system — a search engine, a language model, a classifier — can do anything useful with text, that text must be broken into discrete pieces. Those pieces are tokens.
What counts as a token depends entirely on the tokeniser doing the splitting. In a typical search engine, tokens are words: "the quick brown fox" becomes four tokens. In a byte-pair encoding (BPE) tokeniser of the kind used by large language models, the same phrase might become five or six subword tokens because rare words are split at the subword level. In a character-level model, every individual character is a token.
The word “token” is therefore a relative term — it names a unit without specifying its size or boundary rules. The tokeniser determines both.
How it works
A tokeniser reads a raw string and emits a sequence of tokens. The most common approaches differ in where they place the boundaries:
Whitespace tokenisation splits on spaces and tabs. Fast and predictable, but naïve: "don't" stays as one token, "end." keeps its trailing period.
Rule-based tokenisation applies a set of hand-crafted patterns — usually regular expressions — to handle punctuation, contractions, hyphenation, and special characters. Most classical NLP toolkits (NLTK, spaCy, OpenNLP) use this approach for word tokenisation.
Subword tokenisation learns a vocabulary of word-piece units from a training corpus, then splits words into the longest known subwords. BPE (byte-pair encoding) and WordPiece are the dominant algorithms. The token "tokenisation" might become ["token", "isation"] or even ["token", "is", "ation"] depending on the vocabulary. This approach handles out-of-vocabulary words gracefully.
Character tokenisation treats every Unicode code point as a token. No vocabulary needed; every possible input is covered. The cost is long sequences.
Example
Input: "Elasticsearch's fuzzy matching is fast."
| Strategy | Tokens |
|---|---|
| Whitespace | Elasticsearch's, fuzzy, matching, is, fast. |
| Rule-based | Elasticsearch, 's, fuzzy, matching, is, fast, . |
| BPE (hypothetical) | elastic, search, 's, fuzz, y, match, ing, is, fast, . |
| Character | E, l, a, s, t, i, c, s, e, a, r, c, h, … (38 tokens) |
Variants and history
The term “token” in the computational sense borrows from compiler theory, where lexical analysis splits source code into tokens before parsing. NLP adopted the same vocabulary in the 1950s–60s.
Type vs token. In linguistics, a type is a unique word form and a token is any individual occurrence. "the cat sat on the mat" has six tokens but five types. Search engines care about types; language models care about token counts.
Token IDs. In transformer models, the tokeniser converts each token string to an integer index. "fast" might map to token ID 4104 in GPT-2’s vocabulary. This mapping is fixed at training time.
Special tokens. Transformer tokenisers reserve IDs for control tokens: [CLS], [SEP], [PAD], <|endoftext|>. These carry no lexical content — they signal structure to the model.
When to use it
The practical decision is which tokeniser to use:
- Search and IR — rule-based word tokenisation via the engine’s built-in analyser
- Transformer models — BPE or WordPiece as specified by the pretrained tokeniser; never mix models with mismatched tokenisers
- Morphologically rich languages — subword or character tokenisation degrades more gracefully than whitespace splitting
- Noisy text — character or subword tokenisation is more robust to hashtags, OCR errors, and misspellings
Token count directly drives cost and latency in any system billed per token or with a sequence-length limit.