MONA explainer 9 min read March 28, 2026

What Is Precision, Recall, and F1 Score and How the Confusion Matrix Drives Classification Evaluation

Geometric visualization of precision and recall intersecting within a confusion matrix grid

Table of Contents

ELI5

Precision measures how often your “yes” is right. Recall measures how many real “yes” cases you caught. F1 score balances both into one number, punishing you if either one is weak.

A classifier with 99% accuracy sounds like a triumph. Until you learn it was screening for a disease that affects 1% of the population and achieved that score by predicting “healthy” for every single patient. Every sick person missed. Every metric on the dashboard green. That gap between a comforting number and a dangerous reality is exactly what precision, recall, and the Confusion Matrix were built to expose.

The Four Cells Behind Every Classification Verdict

Every binary classifier produces four possible outcomes, and those four outcomes form a 2x2 grid — the confusion matrix. Rows represent what the model predicted; columns represent what actually happened. The entire field of Model Evaluation for classification traces back to reading that grid correctly.

What is precision recall and F1 score in machine learning

The four cells: true positives (TP) — the model said “yes” and was right. True negatives (TN) — the model said “no” and was right. False positives (FP) — the model said “yes” but was wrong. False negatives (FN) — the model said “no” but was wrong.

Accuracy counts how many predictions were correct out of all predictions: (TP + TN) / total. And here is the trap. In an imbalanced dataset — where one class vastly outnumbers the other — accuracy rewards a model for ignoring the rare class. A fraud detector that never flags fraud achieves 99.8% accuracy if only 0.2% of transactions are fraudulent.

The confusion matrix does not let you hide behind aggregates.

Precision = TP / (TP + FP). Of everything the model labeled positive, how much was actually positive? Precision punishes false alarms. If your spam filter sends important emails to junk, precision is low. High precision means: when this model says “yes,” you can trust it (Google ML Crash Course).

Recall = TP / (TP + FN). Of everything that was actually positive, how much did the model catch? Recall punishes misses. If your cancer screening fails to flag a tumor, recall is low. High recall means: this model does not let real positives slip through (Google ML Crash Course).

F1 score = 2 * precision * recall / (precision + recall). It compresses both metrics into a single value between 0 and 1 using the Harmonic Mean (Wikipedia). The name traces back to Van Rijsbergen’s Information Retrieval (1979), and was formally introduced at the MUC-4 conference in 1992 (Wikipedia).

Neither precision nor recall alone tells you whether a classifier is reliable. A model that labels everything as positive achieves perfect recall — and terrible precision. A model that labels only one supremely confident case as positive achieves perfect precision — and terrible recall. F1 forces both numbers to be decent before the score climbs.

How the Harmonic Mean Punishes Asymmetry

The choice of the harmonic mean is not arbitrary. It is the mechanism that gives F1 its diagnostic power — and understanding why requires seeing what happens when you use the wrong average.

How does the F1 score combine precision and recall using the harmonic mean

The arithmetic mean of 0.95 and 0.10 is 0.525 — a respectable-sounding number for a classifier with near-useless recall. The harmonic mean of the same pair is approximately 0.18.

That difference is the entire point.

The harmonic mean of two values is defined as 2ab / (a + b). Equivalently, F1 satisfies the relationship: 1/F1 = (1/2)(1/precision + 1/recall) (Wikipedia). The reciprocal form reveals the geometry: the harmonic mean operates in the space of rates, not magnitudes. A low value in either dimension drags the result down disproportionately — there is no hiding a weak metric behind a strong one.

This property makes F1 a conservative score. It rewards balance over extremes. If precision is 0.90 and recall is 0.90, F1 is 0.90. If precision is 0.99 and recall is 0.50, F1 drops to 0.665. The arithmetic mean would report 0.745 — a number that obscures the fact that half of all positive cases are being missed.

The general form, F-beta, introduces a parameter that controls the weighting: F-beta = (1 + beta squared) * precision * recall / (beta squared * precision + recall) (Wikipedia). When beta = 1, precision and recall carry equal weight — that is F1. When beta = 2, recall is weighted four times more heavily; useful in medical screening where missing a case is worse than a false alarm. When beta = 0.5, precision dominates.

Not a single formula. A family of tradeoff dials.

The Invisible Dial That Reshapes Every Metric

Most classifiers do not output a binary label directly. They output a probability score — and somewhere between that score and the final “positive” or “negative” label sits a Classification Threshold. Moving that threshold reshapes precision, recall, and F1 simultaneously.

How does the precision recall tradeoff work when you change the classification threshold

Picture a fraud detection model that assigns each transaction a score between 0 and 1. At a moderate threshold, anything above it gets flagged as fraud. Lower the threshold, and the model flags more transactions — catching more real fraud (recall rises) but also flagging more legitimate transactions (precision falls). Raise it, and the model becomes conservative — fewer false alarms (precision rises) but more real fraud slips through (recall falls).

This is the precision-recall tradeoff, and it is inescapable (Google ML Crash Course). You cannot improve both simultaneously by adjusting the threshold alone. The tradeoff is not a bug in the metric; it reflects a fundamental constraint in how probability scores map to binary decisions.

The threshold encodes a cost judgment. When false negatives are more expensive than false positives — medical screening, safety-critical systems — you lower the threshold and accept more false alarms. When false positives are more expensive — spam filtering, content moderation at scale — you raise the threshold and accept more misses (Google ML Crash Course).

F1 is what you get when you declare those costs equal. Which is also why F1 is sometimes the wrong metric.

The Roc Auc curve evaluates a classifier across all possible thresholds, providing a threshold-independent view of separability — a complementary lens to the single-threshold snapshot that F1 offers.

Diagram showing how the confusion matrix decomposes into precision and recall, with the harmonic mean combining them into F1 score — The confusion matrix produces four counts; precision and recall read those counts from different angles, and the harmonic mean compresses both into F1.

Where the Numbers Stop and the Decisions Start

If you change the class distribution in your test set, F1 changes — even if the model has not changed. If you evaluate on a dataset contaminated by training examples, F1 lies to you with a straight face. Benchmark Contamination is the silent failure mode of every leaderboard metric, F1 included.

If your task has severe class imbalance and you care about both positive and negative predictions, F1 has a structural blind spot: it ignores true negatives entirely. The Matthews Correlation Coefficient uses all four quadrants of the confusion matrix and is considered more informative for imbalanced data (Wikipedia). F1 and MCC answer different questions — F1 asks “how well does this model find positives?” while MCC asks “how well do this model’s predictions correlate with reality across all classes?”

If you are working with Scikit Learn, sklearn.metrics.f1_score supports five averaging modes: binary, micro, macro, weighted, and samples (scikit-learn Docs). Macro averaging treats each class equally; weighted averaging accounts for class imbalance. The choice of averaging mode can shift your F1 substantially on the same model — a detail that is rarely mentioned in benchmark comparisons.

Rule of thumb: If false positives and false negatives carry roughly equal cost, F1 is a reasonable single metric. If they do not — and they usually do not in production — use F1 as a starting point but make the cost asymmetry explicit through F-beta or a custom loss function.

When it breaks: F1 collapses as a reliable signal when the positive class is extremely rare and the evaluation set is small. With too few positive examples, a single additional false positive or false negative swings F1 by several points — making the metric unstable and comparisons between models unreliable.

The Data Says

The confusion matrix is not a report card. It is a diagnostic instrument — one that precision, recall, and F1 read from different angles. Accuracy hides the failure modes that matter most; the harmonic mean refuses to let one strong number compensate for one weak one. The metric you choose is a statement about which errors you can afford.

Aha Moments

MAX

The article nails the diagnostic structure — four cells, two lenses, one tradeoff. What I would add from a specification perspective: the threshold choice is a design decision that belongs in the system spec, not in the data scientist’s notebook. I have watched teams tune classification thresholds by gut feel during model development, then ship a default threshold into production because nobody wrote down the cost ratio. The confusion matrix gives you the vocabulary to make that decision explicit. Write the acceptable false positive rate into your acceptance criteria. Write the minimum recall into your test suite. The moment those numbers leave someone’s head and enter the spec, the evaluation pipeline becomes auditable. That is the difference between a metric and a contract.

DAN

Max is right about the spec gap, but I would push the lens wider. The market is saturated with teams announcing “state-of-the-art F1” on curated benchmarks, and almost none of them disclose the threshold, the averaging mode, or the class distribution of their evaluation set. That ambiguity is not accidental — it is a competitive advantage for whoever controls the narrative. The real trend worth watching: organizations are starting to demand evaluation transparency alongside headline numbers. If your published F1 does not come with its confusion matrix, the threshold, and the dataset description, informed buyers are learning to treat it as marketing copy. The metric is not broken. The reporting is.

ALAN

Both of you describe the mechanical and strategic failures of F1 reporting. I would ask you to sit with a more uncomfortable question. When we say F1 balances precision and recall “equally,” we are making a moral claim disguised as a mathematical one — the claim that false positives and false negatives are equally harmful. In medical screening, that is demonstrably false. In criminal justice risk scoring, it is dangerous. The confusion matrix is a mirror, but who decides which reflection matters? The engineer tuning the threshold is making a value judgment about whose harm counts more — the person falsely accused or the person missed entirely. And that judgment is usually invisible, buried in a hyperparameter nobody audits. If the metric encodes a value, who is responsible for the value it encodes?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors