Evaluation
-
ROUGE
Recall-Oriented Understudy for Gisting Evaluation; n-gram overlap metric for summarization and paraphrase evaluation.
-
Relevance Judgement
Human annotation of query-document relevance; provides ground truth for IR system evaluation.
-
Recall
Fraction of relevant documents that are retrieved; measures completeness of retrieval; high recall indicates few false negatives.
-
Precision
Fraction of retrieved documents that are relevant; measures quality of retrieved set; high precision indicates few false positives.
-
Precision at k
Precision measured over top k results; practical metric reflecting user experience when viewing limited results.
-
Perplexity
Exponentiated negative average log probability; measures how well a language model predicts a sample. Lower is better.
-
NDCG
Ranking quality metric weighting relevant results by position; accounts for ranking order. Normalized DCG (NDCG).
-
Mean Reciprocal Rank
Mean of reciprocal rank of first relevant result; measures how quickly system finds first answer. Abbreviated MRR.
-
Mean Average Precision
Mean of average precision scores across queries; standard evaluation metric balancing precision and ranking quality. Abbreviated MAP.
-
Cross-Entropy
Cross-entropy measures the average number of bits needed to encode samples from a true distribution using a model distribution. It is the standard training loss for language models and the basis of perplexity.
-
BERTScore
Semantic similarity using contextual BERT embeddings; measures meaning-level matching rather than surface-level n-gram overlap.
-
Measuring Search Effectiveness by Hand
Before you tune a search system you need a way to know whether your changes made things better or worse. This guide shows you how to build a hand-crafted relevance judgement set — a spreadsheet of queries, documents, and expected rankings — and how to use it as a repeatable baseline.
-
F1 Score
The F1 score is the harmonic mean of precision and recall, producing a single number that balances a model’s ability to avoid false positives against its ability to avoid false negatives.
-
Corpus
A corpus is a structured collection of text documents used to train, evaluate, or build statistics for an NLP system — the raw material from which indexes, models, and vocabularies are derived.