Corpus Annotation

What it is

Corpus annotation is the process of adding linguistic labels or metadata to raw text. Annotations include POS tags, named entities, semantic roles, dependencies, discourse relations, etc. Annotated corpora serve as training data for supervised NLP models and as evaluation benchmarks for task performance.

[illustrate: Text with multiple layers of annotation (POS, NER, dependencies); example showing how annotations provide training signal]

How it works

Annotation scheme:
- Define labels (e.g., B-PER, I-PER, O for NER)
- Document guidelines (examples, edge cases)
- Train annotators
Annotation process:
- Assign labels to text units (tokens, spans, sentences)
- Multiple annotators for reliability (inter-annotator agreement)
- Adjudication (resolve disagreements)
Quality control:
- Inter-annotator agreement (Cohen’s κ, Fleiss’ κ)
- Typically κ ≥ 0.8 required for good quality
- Version control and issue tracking
Output: Labeled corpus ready for model training

Example

Original text: "John Smith works at Google in San Francisco."

POS annotation:
John/NNP Smith/NNP works/VBZ at/IN Google/NNP in/IN San/NNP Francisco/NNP ./.

NER annotation:
[John Smith]_PERSON works at [Google]_ORG in [San Francisco]_LOCATION .

Dependency annotation:
works(ROOT)
  ├─ John Smith(nsubj)
  ├─ Google(prep → at)
  └─ San Francisco(prep → in)

Training data: (token, label) pairs for supervised model

Variants and history

Corpus annotation dates to the 1980s–90s (Penn Treebank, CoNLL). Inter-annotator agreement studies formalized quality assessment. Crowdsourcing (Amazon Mechanical Turk, 2000s) scaled annotation cost-effectively. Active learning strategies select informative examples for annotation. Weak supervision and distantly supervised learning reduce annotation burden. Modern LLM-based annotation using few-shot prompts offers new possibilities but requires validation.

When to use it

Annotate corpora for:

Training supervised NLP models (NER, POS, parsing)
Creating benchmarks for evaluation
Domain adaptation (annotate domain-specific text)
Building datasets for new tasks
Quality analysis and error analysis

Annotation is expensive (human effort) but essential for supervised learning and rigorous evaluation. Cost: ~$100–1000 per 1000 tokens depending on task complexity.