Measuring Search Effectiveness by Hand

Why you need a baseline before you touch anything

Every search system eventually gets tuned. Someone complains that “bluetooth speaker” returns a page of cables, a product manager asks why competitors rank higher on a key term, or a new stemmer gets proposed and nobody is sure whether it helps or hurts. Without a structured way to measure the before and after, tuning is guesswork — you fix the three queries in front of you and silently break twelve others.

A relevance judgement set turns the vague question “is search better now?” into a number you can track across every change you make.

The classical solution comes from the Cranfield experiments of the 1960s, which established the still-dominant paradigm for IR evaluation: assemble a fixed collection of documents, write a set of queries, have humans judge which documents are relevant to each query, and then measure how well a retrieval system recovers those judgements. The Cranfield paradigm underpins modern evaluation campaigns like TREC, CLEF, and BEIR. Voorhees & Harman (2005) give the full history. You do not need a research lab to apply the same idea. A spreadsheet, an afternoon of careful thought, and twenty to fifty queries is enough to build a meaningful baseline for a production search system.


What goes in the spreadsheet

A relevance judgement set has three ingredients: a query set, a document pool, and relevance grades connecting them.

The query set

Aim for twenty to fifty queries that collectively represent the real traffic your system handles. Pull them from three sources:

Your query logs — look at your most frequent queries and your most common zero-result queries. High-frequency queries matter because they affect the most users; zero-result queries reveal vocabulary mismatches the system cannot bridge. If you do not have query logs yet, make your best guess and revisit once you have data.

Failure cases — queries you already know produce bad results. These are the motivating cases for any tuning work and they must be in the set so you can confirm they actually improve.

Adversarial probes — queries designed to stress the system’s weaknesses. If your corpus is dense with acronyms, write queries using both the acronym and the spelled-out form. If you support multiple languages or dialects, include cross-lingual queries. If users often phrase things as questions, include a few of those.

Thirty queries that cover your real failure modes will teach you more than three hundred queries sampled at random from logs.

Assign each query a short ID (Q01, Q02, …) and a human-readable label. Keep the label brief — it exists to remind you what the query is about when you return to the spreadsheet months later. Domain-specific query sets are more diagnostic than generic ones. A query like “transformer” means something very different in an NLP corpus than in an electrical engineering one — annotate intent where ambiguity exists.

The document pool

For each query, you need to decide which documents are worth judging. In a small corpus (under a few thousand documents) you can judge every document exhaustively. In a larger corpus, the standard practice is pooling: run several different retrieval configurations and take the union of their top-ten results. Any document that no configuration surfaces is assumed non-relevant. This gives you a tractable set of fifty to one hundred documents per query to judge, even over millions of records.

Record documents by a stable identifier — a database ID, a slug, or a SKU. Titles are useful for readability but they change; IDs do not.

Relevance grades

The simplest grading scale is binary: relevant (1) or not relevant (0). Binary grades are fast to assign and sufficient for computing Precision@K and MRR. If you want to compute NDCG — which rewards surfacing the most relevant documents at the top, not just any relevant document — you need a graded scale. A four-point scale works well in practice:

Grade Label Meaning
3 Perfect Exactly answers this query; the user would stop searching
2 Highly relevant Substantially answers the query with minor gaps
1 Relevant Related and useful, but not the best answer
0 Not relevant Off-topic or misleading

Resist the temptation to add more grades. The difference between a 2 and a 3 is already a judgement call — splitting it further introduces noise without improving the measurement. Inter-annotator agreement tends to drop sharply above four relevance levels. Sanderson (2010) found that binary labels have the highest consistency between independent judges.


The spreadsheet layout

A flat table with one row per query-document pair is the most flexible format and easiest to work with programmatically.

query_id query_text doc_id doc_title grade notes
Q01 waterproof hiking boot P001 Merrell Moab 3 Waterproof Hiking Boot 3 exact match
Q01 waterproof hiking boot P002 Columbia Newton Ridge — water resistant 1 water resistant ≠ waterproof
Q01 waterproof hiking boot P047 Therm-a-Rest sleeping pad 0 wrong category
Q02 something dry for trails P001 Merrell Moab 3 Waterproof Hiking Boot 3 semantic match
Q02 something dry for trails P002 Columbia Newton Ridge — water resistant 2 plausible but weaker

The notes column is important. It records the reasoning behind a grade at the time you made it, so that six months later — when you or a colleague revisit the set — the grades are reproducible. Notes also surface disagreements: if two annotators would grade the same pair differently, the note exposes why, and you can resolve the policy before it corrupts your scores.

Store the spreadsheet in version control alongside your search configuration. Every time you retune a parameter or change an analyser, commit the resulting scores alongside the code change. The spreadsheet becomes a ledger of what was tried and what it cost or gained. CSV or TSV is preferable to a proprietary spreadsheet format for version control diffs to be readable. A .xlsx file committed to git is an opaque binary blob.


Computing the scores

With a graded judgement set in hand, three metrics cover most practical needs.

Precision@K

For a given query, Precision@K is the fraction of the top K results that are relevant (grade ≥ 1). It answers the question: of the first K things the user sees, how many are worth their time?

P@5 = (relevant results in top 5) / 5

P@5 and P@10 are the most common choices. P@5 reflects the behaviour of users who rarely scroll past the first page; P@10 is useful when the task involves comparing several options before deciding.

MRR — Mean Reciprocal Rank

MRR measures how far down the list the user has to scroll before finding any relevant result. For each query, the Reciprocal Rank is 1 / rank_of_first_relevant_result. MRR is the mean of that value across all queries.

MRR = (1/N) × Σ (1 / rank_of_first_relevant)

MRR is the right metric for navigational queries and lookup tasks where the user wants exactly one thing. If the correct document is at rank 1, RR is 1.0; at rank 2 it is 0.5; at rank 5 it is 0.2. A system with MRR < 0.5 is routinely burying the correct answer below the fold.

NDCG — Normalised Discounted Cumulative Gain

NDCG is the metric that rewards getting the best results to the very top, not merely getting any relevant result somewhere in the list. It uses your graded relevance scores and applies a logarithmic discount to lower-ranked results — finding the grade-3 document at rank 1 contributes far more to the score than finding it at rank 5.

DCG@K = Σ (2^grade_i - 1) / log2(i + 1)   for i = 1 to K
NDCG@K = DCG@K / IDCG@K

IDCG is the ideal DCG — the score you would get if the results were ranked in perfect order from highest to lowest grade. Dividing by IDCG normalises the score to [0, 1] so you can compare across queries of different lengths and difficulty. NDCG@10 is the standard reporting metric in academic IR evaluation and most commercial search teams use the same. NDCG was introduced by Järvelin & Kekäläinen (2002). Their original formulation used a different discount function; the (2^grade - 1) / log2(rank + 1) variant is the one that became the industry standard.


Using the baseline to guide tuning

Once you have a scored baseline, the process for any change is:

  1. Make the change to your search configuration — adjust a field boost, switch an analyser, tune a BM25 b or k1 parameter, add a synonym filter.
  2. Re-run your retrieval system against all queries in the judgement set.
  3. Compute P@K, MRR, and NDCG@10 for the new configuration.
  4. Compare to the baseline — not just the aggregate score, but query by query.

That last step is where the real insight lives. An aggregate NDCG that moves from 0.71 to 0.73 sounds like a win, but you need to confirm it is not driven by improvement on two easy queries while ten others quietly regressed. Sort the per-query delta and look at the bottom: the queries that got worse after your change are telling you something about its failure mode.

Aggregate scores hide regressions. Always inspect the per-query delta before declaring a change an improvement.

A useful heuristic: treat any query where NDCG drops by more than 0.1 as a regression that requires explanation, even if the aggregate goes up. You either accept the regression (document why), fix it (adjust your change), or revise the judgement grades if you decide they were wrong.


Practical tips for annotation

Annotate queries in isolation — do not look at the system’s actual ranked output while grading. If you grade relative to what the current system returns, you will anchor your grades to its biases and the judgement set will not catch its blind spots.

Time-box each query — allow yourself two to three minutes per query-document pair. Longer than that and you are overthinking it; shorter and you risk inconsistency.

Use a second annotator on a sample — ideally annotate 20% of your query-document pairs with a second person and measure agreement (Cohen’s κ or simple percent agreement). If agreement is below 70%, your grade definitions need tightening. Disagreements on individual pairs are not a failure — they surface the ambiguous cases that need an explicit policy. For practical search evaluation, a κ above 0.6 is generally considered acceptable. Below 0.4, the judgements are too noisy to trust as a measurement instrument.

Keep the set stable — once a judgement set is your baseline, do not quietly change grades or add queries without recording the fact. If your understanding of relevance evolves (it will), create a new version of the set and re-score all historical configurations against it so trends remain comparable.

Start small and iterate — twenty queries graded carefully is more useful than two hundred queries graded carelessly. You can always expand the set later.


Further reading

  • Learning to Rank — once you have relevance judgements you can use them as training signal for a learning-to-rank model, replacing hand-tuned boosts with a model that optimises directly for NDCG
  • F1 Score — precision and recall unified; useful when you need a single number that balances both
  • Query Expansion — one of the most common interventions you will want to measure against your baseline
  • BM25 — the scoring function whose parameters (k1, b) are the most common tuning knobs