Mean Average Precision
What it is
Mean Average Precision (MAP) is the average precision score across multiple queries, providing a single number for system evaluation. For each query, Average Precision (AP) integrates precision over recall levels. MAP combines precision, recall, and ranking order into one metric, making it ideal for comparing IR systems.
[illustrate: Precision-recall curve for single query showing AP as area; multiple queries showing MAP calculation]
How it works
Average Precision (single query):
AP = Σ P(k) × Δrecall(k)
Where P(k) = precision at rank k; Δrecall(k) = change in recall at rank k
Simplified (sum over relevant positions):
AP = Σ(P(k) / # relevant) for all k where rank k is relevant
Mean Average Precision (multiple queries):
MAP = (1/Q) × Σ AP_q over all Q queries
Properties:
- Range: 0–1
- Weights highly ranked relevant docs more than lower-ranked
- Balances precision and recall naturally
- Single-number summary for system comparison
Example
Query 1: 10 relevant docs
Retrieved: {R, R, I, R, I, R, I, I, I, R}
Positions of relevant: {1, 2, 4, 6, 10}
P@1 = 1/1 = 1.0, recall = 1/10
P@2 = 2/2 = 1.0, recall = 2/10
P@4 = 3/4 = 0.75, recall = 3/10
P@6 = 4/6 = 0.67, recall = 4/10
P@10 = 5/10 = 0.5, recall = 5/10
AP = (1.0 + 1.0 + 0.75 + 0.67 + 0.5) / 10 = 0.39
Query 2: 5 relevant docs
(Similar calculation...)
AP_2 = 0.60
MAP = (0.39 + 0.60) / 2 = 0.495
Variants and history
Average Precision emerged from TREC conferences (1990s) for IR evaluation. MAP became standard in academia and industry. Interpolated AP and non-interpolated AP differ slightly. NDCG (ranking-aware) gained popularity, especially in recommendation. MAP@k focuses on top-k results. Modern evaluations often report both MAP and NDCG.
When to use it
Use MAP when:
- Comparing IR systems across multiple queries
- Balancing precision and recall in one metric
- Ranking quality matters (higher ranks count more)
- Need single number for system comparison
- Standard evaluation benchmark expected
MAP is interpretable and standard. Limitation: doesn’t account for unjudged documents or very top-heavy ranking preferences (where NDCG better).