Relevance Judgement

What it is

A relevance judgement is a human annotation indicating whether a document is relevant to a given query. Relevance judgements serve as ground truth for evaluating IR systems, enabling calculation of metrics (precision, recall, MAP, NDCG). Typically binary (relevant/irrelevant) or graded (not relevant, somewhat relevant, highly relevant). Large collections of judgements form evaluation benchmarks (TREC, CLEF).

[illustrate: Query with set of documents; annotators rating relevance; collection of judgements used for evaluation]

How it works

Binary judgements:
- Relevant (1) or irrelevant (0)
- Simple, low annotator disagreement
- Used in most benchmarks (TREC)
Graded judgements:
- Scale: 0 (irrelevant) to N (perfectly relevant)
- Common: 0–4 or 0–2 scale
- Enables NDCG and similar metrics
Annotation process:
- Assessors review documents for queries
- Provide relevance scores
- Multiple assessors for reliability
- Majority vote or average on disagreements
Agreement metrics:
- Cohen’s κ, Fleiss’ κ
- Typically κ ≥ 0.6–0.8 for acceptable agreement

Example

Query: "best machine learning libraries"

Documents to judge:
1. Scikit-learn tutorial     → judgement: relevant (1)
2. Restaurant menu          → judgement: not relevant (0)
3. TensorFlow guide         → judgement: relevant (1)
4. Sports news              → judgement: not relevant (0)
5. PyTorch documentation    → judgement: highly relevant (1)

Binary judgements: {1, 0, 1, 0, 1}
Graded (0–2): {2, 0, 2, 0, 2}

These judgements enable:
- Calculating precision/recall
- Evaluating ranking quality (MAP, NDCG)
- Benchmarking IR systems

Variants and history

Relevance judgement tradition dates to TREC (1992+) and IR research. TREC assessments (~100k judgements per year across tasks). NDCG-friendly graded judgements (0–3 scale) became standard. Crowdsourced judgements (Amazon Mechanical Turk, 2010s) reduced cost but required quality control. Weak supervision (click data, implicit relevance) provides noisy alternatives to explicit annotation. Human evaluation remains gold standard but expensive.

When to use it

Collect relevance judgements for:

Creating evaluation benchmarks
System evaluation (comparing algorithms)
Fine-tuning ranking models
Assessing user satisfaction
Building ground truth for new domains

Relevance judgement collection is expensive (~$50–200 per query-document pair). Cost-benefit: substantial upfront investment for reusable benchmark.