F1 Score

What it is

The F1 score is an evaluation metric that combines two competing signals — precision and recall — into a single number. It answers the question: how well does a system find the right things and avoid returning the wrong ones?

Precision — of everything the system returned, what fraction was actually correct?
Recall — of everything that was actually correct, what fraction did the system return?

Neither metric alone is sufficient. A system that returns only one result (and gets it right) achieves perfect precision but abysmal recall. A system that returns everything achieves perfect recall but precision collapses. F1 forces a reckoning with both.

How it works

F1 is the harmonic mean of precision and recall:

F1 = 2 × (precision × recall) / (precision + recall)

Written in terms of true positives (TP), false positives (FP), and false negatives (FN):

precision = TP / (TP + FP)
recall    = TP / (TP + FN)

F1 = 2TP / (2TP + FP + FN)

The harmonic mean is used rather than the arithmetic mean because it is dominated by whichever of the two values is lower. A system with precision 0.9 and recall 0.1 has an arithmetic mean of 0.5 — which sounds acceptable — but an F1 of 0.18, which correctly signals that something is badly wrong.

[illustrate: precision and recall as two overlapping sets — retrieved items on the left, relevant items on the right, TP in the intersection, FP in retrieved-only, FN in relevant-only — then F1 as a bar that shrinks toward the lower of the two individual bars]

Example

A named entity recogniser processes a sentence containing five person names. It returns four results: three are genuine names (TP = 3), one is a false alarm (FP = 1), and two real names were missed (FN = 2).

precision = 3 / (3 + 1) = 0.75
recall    = 3 / (3 + 2) = 0.60

F1 = 2 × (0.75 × 0.60) / (0.75 + 0.60)
   = 2 × 0.45 / 1.35
   ≈ 0.667

Raising the detection threshold to eliminate the false positive would push precision to 1.0 but drop recall further, likely reducing F1. The metric captures that trade-off directly.

[illustrate: before/after table showing TP, FP, FN counts as the detection threshold shifts, with precision, recall, and F1 updating at each step — F1 peaking at an intermediate threshold]

Variants and history

The F1 score is a special case of the Fβ score, introduced by van Rijsbergen in his 1979 text Information Retrieval:

Fβ = (1 + β²) × (precision × recall) / (β² × precision + recall)

Setting β = 1 gives equal weight to precision and recall — the standard F1. Setting β > 1 weights recall more heavily (β = 2 is common when missed positives are costly, as in medical screening). Setting β < 1 weights precision more heavily (β = 0.5 suits spam filtering, where false positives erode user trust).

For multi-class classification, F1 is typically aggregated across classes using one of three strategies:

Macro F1 — compute F1 per class, then take the unweighted mean. Treats all classes equally regardless of how many examples each has.
Micro F1 — pool TP, FP, FN across all classes before computing. Equivalent to overall accuracy when every example belongs to exactly one class.
Weighted F1 — compute F1 per class, then average weighted by class frequency. Standard for imbalanced datasets.

When to use it

F1 is the default evaluation metric whenever both precision and recall matter and the dataset is imbalanced. It is particularly appropriate for:

Information retrieval — measuring whether a search system returns the right documents while suppressing irrelevant ones.
Named entity recognition and sequence labelling — where most tokens are negative examples and raw accuracy would be misleading.
Binary classification with class imbalance — a model predicting the majority class exclusively can achieve 90% accuracy on a 90/10 split; its F1 on the minority class is zero.

F1 has limitations worth knowing. It treats precision and recall as equally important, which is rarely true in production — choose Fβ when the cost asymmetry is clear. It also tells you nothing about the shape of the precision–recall curve; for ranking systems, average precision (AP) or mean average precision (MAP) are more informative because they account for the ordering of results, not just a single operating point.