Sentence Boundary Detection
What it is
Sentence boundary detection (or sentence segmentation) identifies where sentences begin and end in text. While simple in theory (split on periods), it’s complex in practice due to periods in abbreviations (“U.S.A.”, “Dr.”), decimal numbers (“3.14”), URLs, and abbreviations at sentence ends. Most NLP pipelines include sentence detection as a preprocessing step.
[illustrate: Text with ambiguous periods; marked sentence boundaries; examples of tricky cases (abbreviations, decimals, etc.)]
How it works
-
Rule-based approaches:
- Simple: Split on
[.!?]followed by space and capital letter - Better: Maintain list of common abbreviations; context-aware rules
- Simple: Split on
-
Statistical approaches:
- Train classifier (SVM, MaxEnt) on context features around punctuation
- Features: token before/after period, capitalization, abbreviation lists
-
Neural approaches:
- Encode text with BERT or BiLSTM
- Classify each period/punctuation as sentence boundary or not
- More robust to abbreviations and edge cases
Example
Text: "Dr. Smith went to the U.S.A. He saw it."
Rule-based (naive):
Sentence 1: "Dr."
Sentence 2: "Smith went to the U.S.A."
Sentence 3: "He saw it." ← Wrong!
Rule-based (with abbreviations):
Sentence 1: "Dr. Smith went to the U.S.A."
Sentence 2: "He saw it." ← Correct!
Tricky cases:
- "I have 3.14 apples. You have none." → 2 sentences (decimal point)
- "Visit http://example.com. It's great." → 2 sentences (URL)
- "The U.K., U.S., and France met." → 1 sentence (abbreviations in list)
Variants and history
Early NLP relied on simple heuristics. PUNKT (Kiss & Strunk, 2006) unsupervised sentence splitter became widely adopted via NLTK. Neural sentence segmentation (bidirectional context) improved on ambiguous cases. Domain-specific models needed for medical text, code comments, etc. Modern toolkits (spaCy, CoreNLP) provide robust sentence detection.
When to use it
Use sentence boundary detection for:
- Text preprocessing pipeline startup
- Parallel sentence extraction for machine translation
- Sentence-level analysis (sentiment per sentence)
- Document structure understanding
- Any task requiring sentence tokens
Sentence detection is usually a preliminary step. Most modern NLP frameworks handle it transparently.