Undersampling

Also known as: random undersampling, majority-class undersampling, downsampling

Undersampling
Undersampling is a data-level resampling method for imbalanced classification that discards examples from the majority class to balance class proportions, so the learning algorithm gives proportionally more weight to the minority class it would otherwise ignore.

Undersampling is a data-level technique for imbalanced datasets that removes examples from the majority class until it more closely matches the minority class, so the model stops ignoring rare events.

What It Is

Train a fraud detector on real transaction data and you hit a wall: legitimate purchases outnumber fraudulent ones by a wide margin. A model can score very high accuracy by labelling everything legitimate and never flagging a single fraud — useless, but technically correct. This is the class imbalance problem, and undersampling is one of the oldest fixes. It rebalances the training data by removing examples from the majority class until the rare class is no longer drowned out, forcing the model to actually learn what the minority looks like. Balance the training distribution, and the rare case finally gets a fair share of attention.

Undersampling sits in the data-level family of imbalance methods, alongside oversampling — the two approaches that change the data rather than the algorithm. Where oversampling adds minority examples, undersampling subtracts majority ones. You pick a target ratio, say one minority example for every few majority ones, then delete majority rows until the training set hits it. The simplest version, random undersampling, removes majority examples at random. Its appeal over oversampling is efficiency: a smaller training set means less memory and faster iteration, which matters when the majority class runs into the millions of rows.

Smarter variants choose what to delete instead of cutting blindly. Tomek links remove majority examples that sit right against a minority example, cleaning up the boundary between classes. Edited Nearest Neighbors drops majority points that disagree with their neighbors, and Cluster Centroids replaces clusters of majority examples with their centers. Whichever method you use, one rule is non-negotiable: undersample only the training data. The test set must keep its real, imbalanced distribution, and the resampling must happen inside the cross-validation loop — resample the whole dataset first and balanced information leaks into your test fold, producing a score that looks excellent and collapses in production.

How It’s Used in Practice

Most data scientists meet undersampling through the imbalanced-learn library, an extension of scikit-learn built specifically for skewed data. Its RandomUnderSampler removes majority rows in two lines, and informed variants like TomekLinks, NearMiss, and EditedNearestNeighbours ship in the same package. In practice you rarely undersample alone: you chain the sampler and your model in an imbalanced-learn pipeline, hand it to cross-validation, and compare against oversampling and class weights to see which moves the metric you care about.

The setting is almost always a rare-event problem — fraud detection, customer churn, disease screening. A model trained on the raw imbalance learns to predict “normal” and stop there. Undersampling is the lever a team pulls first when the majority class is large enough that discarding part of it costs little. When the dataset is small, deleting majority data does more harm than good, and oversampling or cost-sensitive learning takes over.

Pro Tip: Before deleting a single row, try class weights first — most scikit-learn classifiers accept class_weight='balanced', which penalizes minority mistakes without discarding data. Reach for undersampling when the majority class is huge and training drags, measure success with PR-AUC or recall rather than accuracy, and recalibrate the predicted probabilities if you need trustworthy confidence scores.

When to Use / When Not

ScenarioUseAvoid
Majority class is large and full of redundant, near-duplicate examples
Training time or memory is a bottleneck from a huge majority class
Cleaning a noisy class boundary with informed methods like Tomek links
Dataset is already small and the minority class is tiny
You need predicted probabilities calibrated to real-world rates
Resampling applied to the test set or before the train-test split

Common Misconception

Myth: Undersampling boosts your model’s accuracy. Reality: It usually lowers raw accuracy while raising recall on the minority class — the outcome you actually want. On imbalanced data, accuracy misleads: a model that always predicts the majority scores high and catches nothing. Undersampling trades majority-class precision for the ability to detect rare cases, so you judge it with PR-AUC, recall, or F1, never accuracy. It also skews predicted probabilities toward the rebalanced ratio, which then need recalibration.

One Sentence to Remember

Undersampling fixes class imbalance by deleting majority-class examples so the model stops ignoring the rare class it was built to find, but since it discards real data and distorts predicted probabilities, use it when the majority class is large and redundant, keep the resampling inside the training fold, and measure success with PR-AUC or recall rather than accuracy.

FAQ

Q: What is the difference between undersampling and oversampling? A: Undersampling removes majority-class examples; oversampling adds minority-class examples by duplication or synthesis like SMOTE. Undersampling shrinks the data and can discard signal, while oversampling grows it and can overfit duplicated points.

Q: Does undersampling cause information loss? A: Yes. Random undersampling deletes real majority-class rows, so genuine patterns can vanish with them. Informed methods like Tomek links target redundant or borderline examples to limit the damage, but some loss is unavoidable.

Q: Should I undersample the test set too? A: No. Resample only the training data. The test set must keep the real-world class ratio, or your metrics describe an artificial balance that never occurs in production and overstate how well the model performs.

Expert Takes

Not a bigger minority. A smaller majority. Undersampling discards majority examples to shift the prior the classifier learns, so its decision boundary stops collapsing onto the common case. The cost is statistical: you trade away real observations the model could have learned from. Balance is a modeling decision imposed on the data, never a property it hands you for free.

The usual failure is undersampling the whole dataset before the split, leaking balanced data into your test fold and reporting a score production never delivers. The fix is structural: put the sampler inside an imbalanced-learn pipeline so cross-validation resamples each training fold on its own. Leave the test set at its natural distribution, pin the ratio in config, and the result reproduces.

Most teams discover class imbalance the day a fraud model ships with great accuracy and catches nothing. Undersampling is one of the fastest answers: drop redundant majority data, train quicker, and force the model to take the rare case seriously. You either engineer for the minority class that drives business value, or you ship a model that wins the metric nobody cares about.

Undersampling deletes real records to make a dataset behave. In fraud or medical screening those rows were real people and real events, and which get discarded is rarely examined. Random deletion can quietly strip out subgroups the majority class happened to contain. Who audits what was thrown away, and whether the model stays fair to the populations hidden in the data we chose to forget?