Coreference Resolution
What it is
Coreference resolution identifies and links all mentions of the same entity in a document. For example, “John”, “he”, and “the CEO” might all refer to the same person. The output is a partition of mentions into coreference clusters, one per unique entity. Coreference is critical for understanding document-level meaning and for downstream tasks like summarization and QA.
[illustrate: Text with mentions colored by coreference cluster; pronouns linked to antecedents via arcs]
How it works
-
Mention detection: Identify candidate mentions (usually noun phrases, pronouns)
- “John Smith”, “he”, “the CEO”, “John”
-
Linking: Determine which mentions refer to same entity
- Pairwise linking: Score all pairs (mention_i, mention_j); cluster based on scores
- Global clustering: Jointly cluster all mentions
-
Neural approaches:
- BiLSTM or BERT encoder for mention representations
- Pairwise classifier: P(coreferent | mention_i, mention_j)
- Beam search or agglomerative clustering to form clusters
Example
Text:
"John met Mary. He gave her a gift. The gift was expensive.
Mary appreciated it. She thanked John."
Coreference clusters:
- Cluster 1 (John): "John" (sent 1), "He" (sent 2), "John" (sent 5)
- Cluster 2 (Mary): "Mary" (sent 1), "her" (sent 2), "She" (sent 5)
- Cluster 3 (gift): "a gift" (sent 2), "The gift" (sent 3), "it" (sent 4)
Variants and history
Coreference resolution dates to the 1990s with rule-based systems. Mention-pair models (2004+) trained pairwise classifiers. Entity-mention models score mentions against entity representations. Span-based models (Lee et al., 2017) directly score mention spans without pre-detected mentions. Contextualized embeddings (BERT, 2018+) improved performance. Modern systems achieve 75–80% F1 on OntoNotes benchmark.
When to use it
Use coreference resolution for:
- Document understanding and question-answering
- Summarization (tracking who did what)
- Information extraction (entity-centric)
- Machine translation (pronoun handling)
- Dialogue systems (tracking referents)
Coreference is challenging; performance drops significantly on out-of-domain text. Joint approaches (NER + coreference) sometimes improve results.