Chunking
What it is
Chunking (or shallow parsing) groups adjacent tokens into syntactic phrases without building full parse trees. Chunks typically include noun phrases (“the quick brown fox”), verb phrases (“will run”), and prepositional phrases (“over the hill”). Chunking is simpler than full parsing and computationally efficient, useful for information extraction and text analysis.
[illustrate: Text with chunks marked: NP (noun phrase), VP (verb phrase), PP (prepositional phrase)]
How it works
-
Tag sequences: Typically use IOB or BIO encoding
[NP The quick brown fox] [VP runs] [PP over the hill]- IOB: B-NP, I-NP, I-NP, I-NP, B-VP, B-PP, I-PP, I-PP
-
Chunk types: Common phrases
- NP (noun phrase): articles, adjectives, nouns
- VP (verb phrase): verbs, auxiliaries, particles
- PP (prepositional phrase): prepositions + NP
- ADJP (adjective phrase): adjectives
- ADVP (adverbial phrase): adverbs
-
Neural chunking:
- Encode tokens with BERT or BiLSTM
- Classify each token’s chunk type
- IOB decoding to extract phrase spans
Example
Sentence: "The quick brown fox runs over the hill quickly"
Chunks:
[NP The quick brown fox] [VP runs] [PP over the hill] [ADVP quickly]
IOB tagging:
The/B-NP quick/I-NP brown/I-NP fox/I-NP runs/B-VP
over/B-PP the/I-PP hill/I-PP quickly/B-ADVP
Variants and history
Chunking emerged in the 1990s as a task intermediate between POS tagging and full parsing. Rule-based chunkers used POS patterns. Machine learning chunking (SVMs, CRFs, 2000s) improved robustness. Neural chunking (BiLSTM-CRF, BERT-based, 2015+) achieved high accuracy. Modern systems often joint-train chunking with POS tagging or NER for efficiency.
When to use it
Use chunking for:
- Quick noun phrase extraction
- Lightweight syntactic analysis
- Information extraction without full parsing
- Text preprocessing for other tasks
- Handling languages where full parsing is expensive
Chunking is efficient and interpretable. For complex syntactic phenomena, full dependency or constituency parsing may be necessary.