Mean Average Precision

What it is

Mean Average Precision (MAP) is the average precision score across multiple queries, providing a single number for system evaluation. For each query, Average Precision (AP) integrates precision over recall levels. MAP combines precision, recall, and ranking order into one metric, making it ideal for comparing IR systems.

[illustrate: Precision-recall curve for single query showing AP as area; multiple queries showing MAP calculation]

How it works

Average Precision (single query):

AP = Σ P(k) × Δrecall(k)

Where P(k) = precision at rank k; Δrecall(k) = change in recall at rank k

Simplified (sum over relevant positions):

AP = Σ(P(k) / # relevant) for all k where rank k is relevant

Mean Average Precision (multiple queries):

MAP = (1/Q) × Σ AP_q  over all Q queries

Properties:

Range: 0–1
Weights highly ranked relevant docs more than lower-ranked
Balances precision and recall naturally
Single-number summary for system comparison

Example

Query 1: 10 relevant docs
Retrieved: {R, R, I, R, I, R, I, I, I, R}
Positions of relevant: {1, 2, 4, 6, 10}

P@1 = 1/1 = 1.0, recall = 1/10
P@2 = 2/2 = 1.0, recall = 2/10
P@4 = 3/4 = 0.75, recall = 3/10
P@6 = 4/6 = 0.67, recall = 4/10
P@10 = 5/10 = 0.5, recall = 5/10

AP = (1.0 + 1.0 + 0.75 + 0.67 + 0.5) / 10 = 0.39

Query 2: 5 relevant docs
(Similar calculation...)
AP_2 = 0.60

MAP = (0.39 + 0.60) / 2 = 0.495

Variants and history

Average Precision emerged from TREC conferences (1990s) for IR evaluation. MAP became standard in academia and industry. Interpolated AP and non-interpolated AP differ slightly. NDCG (ranking-aware) gained popularity, especially in recommendation. MAP@k focuses on top-k results. Modern evaluations often report both MAP and NDCG.

When to use it

Use MAP when:

Comparing IR systems across multiple queries
Balancing precision and recall in one metric
Ranking quality matters (higher ranks count more)
Need single number for system comparison
Standard evaluation benchmark expected

MAP is interpretable and standard. Limitation: doesn’t account for unjudged documents or very top-heavy ranking preferences (where NDCG better).