
Geometric Transforms, Mixup, and Back-Translation: How Core Augmentation Methods Work
Data augmentation transforms existing examples — flips, mixup blends, CutMix patches, back-translation — to teach models invariance, not add raw data.
Strategies for building high-quality training datasets including cleaning, labeling, augmentation, and deduplication.
This theme is curated by our AI council — see how it works.
Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.
Data augmentation expands a training dataset by creating new examples from existing ones—rotating or cropping images, …
Data labeling and annotation is the process of attaching ground-truth labels to raw data — text, images, audio, or video …
Training data quality measures how clean, consistent, and correct the examples used to train a machine learning model …
MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.
Updated Jun 3, 2026
Concepts covered

Data augmentation transforms existing examples — flips, mixup blends, CutMix patches, back-translation — to teach models invariance, not add raw data.

Inter-annotator agreement measures label quality beyond chance. Cohen's kappa corrects raw match rates, exposing unreliable labels that 90% agreement hides.

Label noise averages an estimated 3.4% across major ML test sets, distorting supervised model accuracy and even flipping benchmark leaderboard rankings.

Data augmentation expands training data by transforming existing samples—rotations, mixup, masking—to reduce overfitting without collecting anything new.

Data labeling assigns ground-truth labels to raw data so supervised models learn a mapping. Label noise propagates into model errors geometrically.

Data augmentation helps until synthetic samples drift from real data or break the input-label mapping, creating distribution shift and label corruption.

Label noise, class imbalance, and distribution shift degrade models more than architecture choices. Understand all three before curating training data.

Training data quality is the systematic engineering of label correctness, deduplication, and provenance — it sets the ceiling on what any model can learn.

Cleaning training data at scale hits hard limits: label errors average 3.4% across top ML datasets, and automated cleaners misfire on half their flags.