MONA explainer 10 min read April 6, 2026

From Binary to Multi-Class: Deriving Precision, Recall, and F1 from a Confusion Matrix

Grid of prediction outcomes revealing hidden classification failures through color-coded diagonal and off-diagonal cells

Table of Contents

ELI5

A confusion matrix is a grid that counts every correct and incorrect prediction a model makes. From those counts you derive precision, recall, F1, and every classification metric that matters.

A model scores 95% accuracy on a medical screening dataset. The team celebrates. Then someone builds a confusion matrix and discovers the model predicted “healthy” for every single patient — including the 5% who actually had the disease. The accuracy was real. The model was useless. That gap between a headline number and diagnostic reality is precisely what the confusion matrix exists to close.

The Forensic Grid: What Each Cell Records

Every Binary Classification model produces four possible outcomes per prediction, and the Confusion Matrix arranges them into a 2×2 grid that works like a forensic report. Rows represent what actually happened; columns represent what the model claimed happened. The tension between those two axes — ground truth versus prediction — is where every interesting metric lives.

What should you understand before learning confusion matrices

The vocabulary is small but deceptively precise, and conflating any two terms will cascade errors through every metric you compute downstream.

True Positives (TP): the model predicted positive and was correct. True Negatives (TN): the model predicted negative and was correct. These occupy the diagonal — the cases where prediction aligned with reality.

False Positives (FP): the model predicted positive but was wrong. The false alarm; the spam filter that blocks a legitimate email; the fraud detector that freezes a valid transaction. False Negatives (FN): the model predicted negative but was wrong. The missed detection — the screening test that sends a sick patient home with a clean bill of health.

In scikit-learn’s convention, the matrix element C[i,j] represents the count of observations with true label i that were predicted as label j (scikit-learn Docs). For binary problems, unpacking is mechanical: tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel(). You can also normalize the matrix — by row ('true'), by column ('pred'), or by total population ('all') — to convert raw counts into proportions, which is especially useful when class sizes differ by orders of magnitude.

Two concepts round out the foundation. Specificity — TN / (TN + FP) — measures how well the model identifies the negative class; it is the mirror of recall, focused on the opposite side of the grid. And the entire framework of Model Evaluation assumes the test set is representative of data the model will encounter in practice. When Benchmark Contamination leaks training examples into the evaluation split, every cell in the matrix inflates uniformly — the grid looks cleaner than it should, and no single metric will flag the problem because the corruption is structural, not localized.

Without clean data separation, the matrix reflects memory, not generalization.

Three Metrics, One Fundamental Tension

The grid is a ledger. Precision, recall, and F1 are the questions you pose to that ledger — and each question deliberately ignores one quadrant of the table. Understanding the arithmetic matters less than understanding what each formula leaves out.

How to calculate precision recall and F1 score from a confusion matrix

Precision = TP / (TP + FP). Of everything the model flagged as positive, how many actually were? Precision punishes false alarms. A spam filter with high precision rarely sends real email to junk — but it might let spam slip through untouched, because false negatives are invisible to this formula.

Recall = TP / (TP + FN). Of all the actual positives in the dataset, how many did the model find? Recall punishes missed detections. A cancer screening model with high recall flags most tumors — but it might also refer many healthy patients for unnecessary biopsies, because false positives sit outside this formula’s field of view.

The tension between them is mechanical, not philosophical. Lowering the classification threshold captures more true positives (recall climbs) but admits more false alarms (precision drops). Raising the threshold does the reverse. No single threshold maximizes both simultaneously — unless the model achieves perfect class separation, which in practice it almost never does. Plotting precision against recall across all thresholds produces the PR curve, and the area under that curve gives you a threshold-independent summary of model performance on the positive class. But even the PR curve is a projection; it collapses the four-cell grid into a two-dimensional trade-off and discards true negatives entirely.

Precision, Recall, and F1 Score resolves this tension through the harmonic mean: F1 = 2 × TP / (2 × TP + FP + FN). Why harmonic rather than arithmetic? Because the harmonic mean punishes asymmetry. If precision is 0.95 and recall is 0.10, the arithmetic mean is a generous 0.525. The harmonic mean returns 0.18 — a verdict that reflects how catastrophically the model fails on one dimension.

Not a design choice. A mathematical property of the harmonic mean.

The F1 score weights precision and recall equally, which is appropriate when false positives and false negatives carry similar costs. When they do not — and in most applied contexts, they do not — the F-beta score generalizes the formula: F_β = (1 + β²) × (precision × recall) / (β² × precision + recall). Setting β = 2 weights recall twice as heavily as precision; β = 0.5 does the reverse (scikit-learn Docs). A medical screening system that cannot afford missed detections should optimize for F2. A content moderation system that cannot afford wrongful removals should lean toward F0.5.

The formula is a knob. The confusion matrix is the instrument panel that tells you which direction to turn it.

Beyond Two Labels: N Classes, N² Cells

Binary classification is a special case — a useful pedagogical starting point, but not what most classification problems look like once they reach real data. When a model sorts inputs into three, ten, or a hundred categories, the 2×2 grid expands into an N×N matrix, and the off-diagonal cells no longer sort neatly into “false positive” and “false negative.”

How does a confusion matrix work for multi-class classification

In an N-class confusion matrix, each row still represents the true class and each column the predicted class. The diagonal counts correct classifications. Everything off the diagonal is a specific misclassification — and now the direction of the error carries diagnostic information. A handwriting model that confuses 3 with 8 is responding to geometric similarity; one that confuses 3 with 7 has a different failure mode entirely. The matrix records both the frequency and the trajectory of every mistake.

To compute precision or recall for a single class k, treat the problem as one-versus-rest: class k against everything else. Precision for class k uses column k (all predictions of k); recall uses row k (all true instances of k). But how do you collapse N separate per-class metrics into a single summary number?

Three averaging strategies exist, and each answers a fundamentally different question (scikit-learn Docs):

Micro averaging pools all true positives, false positives, and false negatives globally before computing one metric. For single-label problems, micro-F1 collapses to overall accuracy. It gives the population-level picture but is dominated by frequent classes — rare classes barely move the needle.

Macro averaging computes the metric independently per class, then takes the unweighted mean. Every class counts equally regardless of size. This is the strategy that surfaces failures on rare categories — the categories where failures tend to carry the most consequence and the least visibility.

Weighted averaging computes per-class metrics and averages them proportional to support — the number of true instances in each class. It accounts for imbalance without reducing to a single accuracy figure; a middle ground between micro’s population sensitivity and macro’s egalitarianism.

The choice is not a statistical preference. It is a declaration about whose errors you consider tolerable.

Progression from 2x2 binary confusion matrix to NxN multi-class grid with micro, macro, and weighted averaging strategies compared — From four cells to N-squared: how the confusion matrix scales and the three averaging strategies that summarize it.

What the Diagonal Predicts and What the Margins Conceal

Turn the mechanism into predictions you can verify against your own models.

If you increase class imbalance — say the positive class drops from 30% to 3% of the dataset — precision on the minority class will deteriorate faster than recall, because the denominator TP + FP absorbs a growing share of false alarms from the majority class. Overall accuracy may barely shift. The confusion matrix will expose the entire asymmetry; accuracy will present a flattering summary.

If two classes share visual or semantic similarity, the off-diagonal cells between exactly those two classes will show elevated counts — a pattern invisible to any aggregate metric. This is diagnostic data: it points you toward where to focus annotation effort, where to engineer features, and which class boundary the model finds genuinely ambiguous rather than merely noisy.

If you switch from micro to macro averaging and the score drops sharply, the model is performing well on common classes and failing on rare ones. Macro-F1 is the metric that refuses to let majority outvote minority.

Rule of thumb: when overall accuracy looks strong but macro-F1 drops sharply below it, the confusion matrix almost certainly contains at least one minority class with near-zero recall. The headline number is real. The model’s coverage is not.

When it breaks: The confusion matrix assumes test labels are correct. In domains with high annotator disagreement — radiology reads, sentiment classification, legal document tagging — the matrix reflects labeling inconsistency as if it were model error. A cell showing persistent confusion between two categories might reveal a genuine model limitation, or it might reveal that human experts cannot reliably distinguish those categories either. Without measuring inter-annotator agreement first, the confusion matrix cannot separate model confusion from label ambiguity.

Compatibility note:
scikit-learn plot_confusion_matrix: Deprecated since version 1.0. Use ConfusionMatrixDisplay.from_predictions() or ConfusionMatrixDisplay.from_estimator() instead.

The Data Says

The confusion matrix is not a metric. It is the raw evidence from which every classification metric is derived. Precision, recall, F1, specificity, balanced accuracy — all of them are different projections of the same N×N grid. And the averaging strategy you select for multi-class problems is not a statistical convenience; it is a decision about whose errors matter enough to measure.

Aha Moments

MAX

The confusion matrix is the only evaluation artifact that shows where a model breaks, not just how much. An aggregate score is a failing test with no stack trace — it flags a problem but points nowhere. The matrix gives you row-and-column specificity: class A maps incorrectly to class B, not class C. That is actionable. You can adjust class weights, collect targeted training examples, or restructure feature engineering for exactly the pair causing the most damage. The averaging decision — micro versus macro versus weighted — should be documented in the evaluation contract before training begins. Choosing it afterward invites the temptation to pick whichever number flatters the result, and that is not evaluation. That is negotiation with your own data.

DAN

The averaging strategy is quietly one of the most consequential configuration choices in a model’s lifecycle, because it defines what “good enough” means in practice. Teams that default to micro averaging on imbalanced datasets will report strong aggregate numbers while ignoring failure on rare classes — the predictions that often carry the highest stakes. Fraud detection, rare disease screening, minority language identification: all long-tail problems where macro averaging would surface the weakness that micro hides. The strategic risk is not that the model underperforms. The risk is that the chosen averaging method makes underperformance invisible until production data reveals it. By then, the confusion matrix that could have caught the gap has been archived in a notebook nobody revisited.

ALAN

Both of you are treating the ground-truth labels as settled fact. In many high-stakes domains, the labeling process itself is contested. Who decided a given radiology scan was “positive” — a single annotator, a consensus panel, or an algorithm trained on prior consensus? The confusion matrix inherits every assumption embedded in the annotation pipeline. Persistent misclassification between two categories might reveal a genuine model limitation — or it might reveal a boundary that human experts cannot agree on. We treat annotator disagreement as noise to be resolved, but sometimes disagreement is the signal. Before we trust the matrix as evidence of model failure, shouldn’t we first measure how much of the recorded “confusion” belongs to the model and how much belongs to the labels?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors