Part-of-Speech Tagging
What it is
Part-of-Speech (POS) tagging is a sequence labeling task assigning each word in text a grammatical category (noun, verb, adjective, pronoun, etc.). POS tags are used downstream for parsing, entity recognition, and linguistic analysis. Most modern POS taggers are neural (BERT-based) and achieve 97%+ accuracy on standard benchmarks.
[illustrate: Text with POS tags above each token: “The/DT cat/NN sat/VBD on/IN the/DT mat/NN”]
How it works
-
Tag sets: Standard schemes like Penn Treebank (48 tags) or Universal POS (17 tags)
- NOUN, VERB, ADJ, ADV, PRON, DET, ADP, CCONJ, SCONJ, PUNCT, etc.
-
Tagging approaches:
- Rule-based: Hand-crafted rules (rare now)
- Statistical: HMM, CRF with hand-engineered features
- Neural: BiLSTM or Transformer fine-tuned for tagging
-
Modern approach:
- Encode tokens with BERT or similar
- Classify each token’s POS tag
- Often combined with other tasks (NER, lemmatization)
Example
Sentence: "The quick brown fox jumps over the lazy dog"
POS tags (Universal POS):
The/DET quick/ADJ brown/ADJ fox/NOUN jumps/VERB over/ADP the/DET lazy/ADJ dog/NOUN
Penn Treebank (more fine-grained):
The/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN
Variants and history
POS tagging dates to the 1960s with rule-based systems. HMM taggers (1980s–90s) enabled probabilistic approaches. CRF models (2000s) improved with structured predictions. Neural POS tagging (BiLSTM, 2016+) and BERT-based POS (2018+) achieved near-human accuracy. Contextual nature of POS (homonymy: “bank” as noun vs. verb) makes bidirectional context crucial.
When to use it
Use POS tagging for:
- Parsing and syntax analysis
- Lemmatization and stemming
- Named entity recognition
- Information extraction
- Text analysis and corpus linguistics
- Language learning systems
POS tagging is typically a preprocessing step, not end task. Most modern systems do joint tagging (POS + NER + lemmatization) for efficiency.