Balanced Accuracy

Also known as: BACC, macro-averaged recall, average recall

Balanced Accuracy
Balanced accuracy is a classification metric that averages the recall achieved on each class, so a model is rewarded for correctly identifying rare classes instead of inflating its score by favoring the majority class on imbalanced data.

Balanced accuracy is a classification metric that averages the recall scored on each class separately, giving rare and common classes equal weight so imbalanced datasets cannot inflate the result.

What It Is

Plain accuracy is the metric most people reach for first: the share of predictions a model got right. On a balanced dataset, it works. The problem appears the moment one outcome is much rarer than the others. In fraud detection, legitimate transactions might outnumber fraudulent ones by a thousand to one. A model that labels every transaction as legitimate scores near-perfect accuracy while catching zero fraud. Balanced accuracy exists to expose this kind of empty victory.

The metric works by scoring each class on its own, then averaging. Recall is the building block: for any one class, recall asks “of all the real examples of this class, how many did the model actually find?” Balanced accuracy computes recall for every class and takes the plain average. In a two-class problem, that means averaging sensitivity (recall on the positive class) and specificity (recall on the negative class). With more than two classes, it averages recall across all of them, a measure often called macro-averaged recall.

Because the average gives each class the same weight, class size stops mattering to the score. A class with a handful of examples counts as much as a class with millions. That is the whole point: the model can no longer earn a high number by performing well only on the crowd. Grading a student equally on every subject, rather than weighting by how many questions each subject happened to have, captures the same idea: you cannot coast on the biggest section.

For the broader topic of class imbalance, this makes balanced accuracy one of the first diagnostics worth running. It does not fix imbalance; techniques like class weighting, resampling, and cost-sensitive learning do that. It tells you whether you have a problem at all, and whether an attempted fix actually helped the minority class or just reshuffled the majority.

How It’s Used in Practice

Most people meet balanced accuracy while comparing models on a dataset they already suspect is skewed. You train a classifier, the accuracy comes back in the ninety-something range, and it looks like a win, until you compute balanced accuracy and it sits close to the score you would get by guessing. That gap is the signal: the model learned the majority class and quietly gave up on the minority one.

In practice, the metric usually shows up through standard tooling rather than hand calculation. Machine-learning libraries expose it directly, scikit-learn provides a balanced_accuracy_score function, and AutoML platforms and model cards increasingly report it next to plain accuracy. For a product manager or analyst reviewing a model’s results, it is the single number that answers “does this model actually handle the rare case we care about?”

The common workflow is to report both numbers side by side. Plain accuracy tells you the overall hit rate; balanced accuracy tells you whether that hit rate is honest. When the two diverge, the dataset is imbalanced and the headline accuracy is misleading.

Pro Tip: Always show balanced accuracy next to plain accuracy on skewed data, but don’t treat it as the finish line. It weights every class equally, which rarely matches real costs: a missed cancer diagnosis is not the same price as a false alarm. Once balanced accuracy confirms the model sees the rare class, switch to cost-aware metrics that reflect what each mistake actually costs.

When to Use / When Not

ScenarioUseAvoid
Evaluating a classifier on imbalanced data (fraud, disease screening, defect detection)
Reporting one headline number when classes are roughly equal
Comparing models where catching the rare class is the goal
When a false positive and a false negative carry very different costs
Multi-class problems where some categories have very few examples
Choosing a probability threshold across the full operating range

Common Misconception

Myth: A higher balanced accuracy always means a better model, so it should replace plain accuracy everywhere. Reality: Balanced accuracy measures one specific thing, the average recall across classes. It deliberately ignores class size, which can understate a model intentionally tuned to favor one class for business reasons. It signals class-balanced performance, not whether the model meets real-world cost trade-offs.

One Sentence to Remember

If your data is skewed, plain accuracy can lie; balanced accuracy is the fastest sanity check that your model actually learned the rare class instead of memorizing the majority. Start there, confirm the model sees the minority, then move to cost-aware metrics that match what each error truly costs.

FAQ

Q: What is the difference between accuracy and balanced accuracy? A: Accuracy counts all correct predictions over the total, so the majority class dominates. Balanced accuracy averages the recall of each class separately, giving every class equal weight regardless of how rare it is.

Q: What is a good balanced accuracy score? A: For binary classification, 0.5 equals random guessing and 1.0 is perfect. A score near 0.5 means the model is failing on at least one class, even when plain accuracy looks high.

Q: Does balanced accuracy fix class imbalance? A: No. It only measures performance fairly across classes; it does not change the model. Fixing imbalance needs techniques like class weighting, resampling, or cost-sensitive learning applied during training.

Expert Takes

Not a better accuracy. A different question. Plain accuracy asks how often the model is right; balanced accuracy asks whether it is right on each class in turn, then averages. On skewed data those answers diverge sharply. The metric encodes a value judgment, that every class deserves equal attention, and makes that judgment visible instead of letting class frequency decide it silently.

A model that scores high on accuracy and low on balanced accuracy isn’t broken, your evaluation spec is. The default metric never told it that the rare class mattered. Name the per-class recall you need in your acceptance criteria, track balanced accuracy in the same dashboard as accuracy, and the gap shows up before deployment instead of in a postmortem.

The business translation is blunt: a model blind to the rare class is a liability dressed up as a success. Fraud, churn, defects, the cases that move money are almost always the minority. If your team still reports one accuracy number on imbalanced data, you’re optimizing a metric that rewards ignoring your most expensive problem. Balanced accuracy is the first honest line on the scorecard.

A single accuracy score can hide who the model fails. When the rare class is a disease, a fraud victim, or a wrongly flagged applicant, the people inside that minority are exactly the ones a majority-friendly metric renders invisible. So who decides that every class deserves equal weight, and who answers when the convenient number says the system works while the people it overlooks say otherwise?