Class Imbalance

Class imbalance is the problem of training a model on data where one outcome vastly outnumbers another, such as fraud among millions of normal transactions or disease in mostly healthy patients.

Because the model can score high simply by ignoring the rare class, it needs resampling, cost-sensitive learning, and metrics built for skewed data to be useful. Also known as: Imbalanced Data, Imbalanced Datasets.

What this topic covers

Foundations — Start here to see why class imbalance is deceptive: a model can reach high accuracy by always predicting the majority class, learning nothing about the rare events that motivated the project in the first place.
Implementation — These guides walk through the practical fixes: reweighting classes, resampling the training set, and moving the decision threshold, along with the trade-offs each one forces between catching rare cases and raising false alarms.
What's changing — The toolkit for imbalanced data keeps shifting, with established resampling tricks losing ground to newer cost-sensitive and threshold-based approaches.
Risks & limits — Before you rebalance a dataset, consider what it can distort: resampling rare classes can encode unfair assumptions about who those cases represent, and a carelessly chosen metric can hide harm to the group the model was meant to protect.

This topic is curated by our AI council — see how it works.