Sentence Tokeniser
What it is
A sentence tokeniser — also called a sentence splitter or sentence boundary detector — takes a document as input and returns an ordered list of sentence strings. It sits one level above word tokenisation in the pipeline: the document is first split into sentences, and each sentence is then passed to a Word Tokeniser for term extraction.
The task looks trivial. It is not. The full stop character (.) ends sentences but also appears in abbreviations (Dr., Prof., etc.), acronyms (U.S.A., N.A.S.A.), decimal numbers (3.14, £1,299.99), ellipses (...), and domain names (nlp.example.com). A naive rule — split on . followed by a capital letter — misfires on "Dr. Smith arrived." and "The U.S.A. withdrew." within the first few sentences of any news article.
How it works
Most production sentence tokenisers fall into one of two families.
Rule-based splitters apply a priority-ordered list of patterns. A typical rule set:
- Protect known abbreviations — match tokens like
Dr.,Mr.,etc.,U.S.A.against an exception list and mark their periods as non-terminal. - Protect numeric contexts — a period flanked by digits on both sides is a decimal point, not a sentence boundary.
- Protect ellipses —
...(or the Unicode ellipsis characterU+2026) is never a sentence boundary on its own. - Classify remaining periods — a period followed by whitespace and an uppercase letter is a candidate boundary; apply any remaining disambiguation rules, then split.
Rule-based splitters are fast and fully transparent, but their exception lists must be maintained per domain and per language.
Punkt (Kiss & Strunk, 2006) is the dominant unsupervised approach and the default in NLTK. Punkt treats sentence boundary detection as a binary classification problem on period tokens. It learns three things from an unannotated corpus:
- Which word types appear frequently with trailing periods and are therefore likely abbreviations.
- Which of those abbreviations also regularly start sentences (so-called sentence starters).
- Whether a period-terminated token is orthographically consistent with sentence-final positions in the training corpus.
No labelled data is required — Punkt bootstraps its abbreviation lexicon from raw text statistics alone. The trained model is small enough (a few kilobytes) to bundle with a library.
[illustrate: step-by-step Punkt decision on the string “Dr. Smith works for U.S.A. Corp. She arrived on Jan. 1.” — each period token evaluated in sequence; “Dr.”, “U.S.A.”, “Corp.”, “Jan.” all classified as abbreviations and marked non-terminal (shown in orange); the period after “Corp.” reclassified as terminal when followed by “She” with a capital and a space (shown in green); final boundary positions marked with vertical dashed lines]
Example
Input:
"The conference is hosted by Prof. Alan Kay. See fig. 3.2 for details. Questions? Ask Dr. Smith."
| Splitter | Sentences |
|---|---|
Naïve (split on . + capital) |
"The conference is hosted by Prof." / "Alan Kay." / … (misfires on Prof.) |
| Punkt | "The conference is hosted by Prof. Alan Kay." / "See fig. 3.2 for details." / "Questions?" / "Ask Dr. Smith." |
Punkt correctly protects Prof., fig., and Dr. and recognises that 3.2 is a decimal. The question mark boundary is handled because ? and ! are treated as unconditional terminal punctuation.
Variants and history
Kiss & Strunk (2006) introduced Punkt in the paper Unsupervised Multilingual Sentence Boundary Detection. The algorithm’s key insight is that abbreviation tokens have anomalously high period frequency relative to their unperioded forms — a signal extractable without any labelled examples.
spaCy’s sentenciser uses a dependency-parse-based approach by default: sentence boundaries are inferred from the syntactic tree (specifically, the root token of each clause). This is more accurate on complex or quoted text but requires running the full parser first. spaCy also exposes a rule-based Sentencizer component for pipelines where parser overhead is unacceptable.
Quoted speech is a persistent hard case across all approaches. In He said, "We're leaving. Now.", the period after leaving is sentence-final within the quotation, but the outer sentence continues. Most splitters treat the quotation as opaque and defer to punctuation-plus-capitalisation heuristics.
When to use it
Run a sentence tokeniser whenever downstream processing operates at sentence granularity: machine translation, summarisation, named-entity recognition with sentence-scoped context, and sentence embedding models all expect individual sentences as input, not raw paragraphs.
For English general text, Punkt (NLTK’s sent_tokenize) is a reasonable default with no labelled training data required. Re-train or fine-tune it when your domain has abbreviations not found in news corpora — medical text (pt., b.i.d., q.d.), legal text (ibid., op. cit., v.), or code-mixed content.
Prefer spaCy’s parser-based sentencer when accuracy on quoted speech and complex syntax matters more than speed. Prefer a rule-based splitter when you need fully deterministic, auditable behaviour with no statistical component.
Avoid skipping sentence tokenisation and passing full documents to word-level pipelines: term frequencies become artificially inflated across sentence boundaries, and models that expect sentence-length input will silently degrade.