Corpus Annotation
What it is
Corpus annotation is the process of adding linguistic labels or metadata to raw text. Annotations include POS tags, named entities, semantic roles, dependencies, discourse relations, etc. Annotated corpora serve as training data for supervised NLP models and as evaluation benchmarks for task performance.
[illustrate: Text with multiple layers of annotation (POS, NER, dependencies); example showing how annotations provide training signal]
How it works
-
Annotation scheme:
- Define labels (e.g., B-PER, I-PER, O for NER)
- Document guidelines (examples, edge cases)
- Train annotators
-
Annotation process:
- Assign labels to text units (tokens, spans, sentences)
- Multiple annotators for reliability (inter-annotator agreement)
- Adjudication (resolve disagreements)
-
Quality control:
- Inter-annotator agreement (Cohen’s κ, Fleiss’ κ)
- Typically κ ≥ 0.8 required for good quality
- Version control and issue tracking
-
Output: Labeled corpus ready for model training
Example
Original text: "John Smith works at Google in San Francisco."
POS annotation:
John/NNP Smith/NNP works/VBZ at/IN Google/NNP in/IN San/NNP Francisco/NNP ./.
NER annotation:
[John Smith]_PERSON works at [Google]_ORG in [San Francisco]_LOCATION .
Dependency annotation:
works(ROOT)
├─ John Smith(nsubj)
├─ Google(prep → at)
└─ San Francisco(prep → in)
Training data: (token, label) pairs for supervised model
Variants and history
Corpus annotation dates to the 1980s–90s (Penn Treebank, CoNLL). Inter-annotator agreement studies formalized quality assessment. Crowdsourcing (Amazon Mechanical Turk, 2000s) scaled annotation cost-effectively. Active learning strategies select informative examples for annotation. Weak supervision and distantly supervised learning reduce annotation burden. Modern LLM-based annotation using few-shot prompts offers new possibilities but requires validation.
When to use it
Annotate corpora for:
- Training supervised NLP models (NER, POS, parsing)
- Creating benchmarks for evaluation
- Domain adaptation (annotate domain-specific text)
- Building datasets for new tasks
- Quality analysis and error analysis
Annotation is expensive (human effort) but essential for supervised learning and rigorous evaluation. Cost: ~$100–1000 per 1000 tokens depending on task complexity.