Evaluation

ROUGE

Recall-Oriented Understudy for Gisting Evaluation; n-gram overlap metric for summarization and paraphrase evaluation.
Relevance Judgement

Human annotation of query-document relevance; provides ground truth for IR system evaluation.
Recall

Fraction of relevant documents that are retrieved; measures completeness of retrieval; high recall indicates few false negatives.
Precision

Fraction of retrieved documents that are relevant; measures quality of retrieved set; high precision indicates few false positives.
Precision at k

Precision measured over top k results; practical metric reflecting user experience when viewing limited results.
Perplexity

Exponentiated negative average log probability; measures how well a language model predicts a sample. Lower is better.
NDCG

Ranking quality metric weighting relevant results by position; accounts for ranking order. Normalized DCG (NDCG).
Mean Reciprocal Rank

Mean of reciprocal rank of first relevant result; measures how quickly system finds first answer. Abbreviated MRR.
Mean Average Precision

Mean of average precision scores across queries; standard evaluation metric balancing precision and ranking quality. Abbreviated MAP.
Cross-Entropy

Cross-entropy measures the average number of bits needed to encode samples from a true distribution using a model distribution. It is the standard training loss for language models and the basis of perplexity.
BERTScore

Semantic similarity using contextual BERT embeddings; measures meaning-level matching rather than surface-level n-gram overlap.
Measuring Search Effectiveness by Hand

Before you tune a search system you need a way to know whether your changes made things better or worse. This guide shows you how to build a hand-crafted relevance judgement set — a spreadsheet of queries, documents, and expected rankings — and how to use it as a repeatable baseline.
F1 Score

The F1 score is the harmonic mean of precision and recall, producing a single number that balances a model’s ability to avoid false positives against its ability to avoid false negatives.
Corpus

A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.