Sentence Boundary Detection

What it is

Sentence boundary detection (or sentence segmentation) identifies where sentences begin and end in text. While simple in theory (split on periods), it’s complex in practice due to periods in abbreviations (“U.S.A.”, “Dr.”), decimal numbers (“3.14”), URLs, and abbreviations at sentence ends. Most NLP pipelines include sentence detection as a preprocessing step.

[illustrate: Text with ambiguous periods; marked sentence boundaries; examples of tricky cases (abbreviations, decimals, etc.)]

How it works

Rule-based approaches:
- Simple: Split on [.!?] followed by space and capital letter
- Better: Maintain list of common abbreviations; context-aware rules
Statistical approaches:
- Train classifier (SVM, MaxEnt) on context features around punctuation
- Features: token before/after period, capitalization, abbreviation lists
Neural approaches:
- Encode text with BERT or BiLSTM
- Classify each period/punctuation as sentence boundary or not
- More robust to abbreviations and edge cases

Example

Text: "Dr. Smith went to the U.S.A. He saw it."

Rule-based (naive):
Sentence 1: "Dr."
Sentence 2: "Smith went to the U.S.A."
Sentence 3: "He saw it."  ← Wrong!

Rule-based (with abbreviations):
Sentence 1: "Dr. Smith went to the U.S.A."
Sentence 2: "He saw it."  ← Correct!

Tricky cases:
- "I have 3.14 apples. You have none." → 2 sentences (decimal point)
- "Visit http://example.com. It's great." → 2 sentences (URL)
- "The U.K., U.S., and France met." → 1 sentence (abbreviations in list)

Variants and history

Early NLP relied on simple heuristics. PUNKT (Kiss & Strunk, 2006) unsupervised sentence splitter became widely adopted via NLTK. Neural sentence segmentation (bidirectional context) improved on ambiguous cases. Domain-specific models needed for medical text, code comments, etc. Modern toolkits (spaCy, CoreNLP) provide robust sentence detection.

When to use it

Use sentence boundary detection for:

Text preprocessing pipeline startup
Parallel sentence extraction for machine translation
Sentence-level analysis (sentiment per sentence)
Document structure understanding
Any task requiring sentence tokens

Sentence detection is usually a preliminary step. Most modern NLP frameworks handle it transparently.