Synthetic Data & Generation

Creating artificial training data with generative models, including benchmark datasets and the ethics of treating synthetic data as a privacy workaround.

Authors 12 articles 122 min total read Updated Jun 19, 2026

Explainers (6) Guides (2) News (2) Opinions (2)

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

3 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

AI-PRINCIPLES

Benchmark Datasets →

Benchmark datasets are standardized collections of tasks used to measure and compare how well AI models perform — from …

6 articles

AI-ETHICS

Synthetic Data Ethics →

Synthetic data ethics is the study of the moral risks that arise when AI-generated data stands in for real records. Even …

0 articles

AI-PRINCIPLES

Synthetic Data Generation →

Synthetic data generation creates artificial training data—either with hand-written rules or with generative …

6 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Jun 19, 2026

Concepts covered

Synthetic data failure modes: vanishing distribution tails, the fidelity-privacy tradeoff, and outlier re-identification risk

MONA explainer Start here 11 min Jun 14, 2026

Model Collapse, Fidelity Gaps, and Re-Identification: The Technical Limits of Synthetic Data

Synthetic data faces three hard limits: model collapse from recursive training, fidelity-privacy tradeoffs, and re-identification of outlier records.

Four families of synthetic data generation arranged by how much statistical structure each learns from real data

MONA explainer Start here 10 min Jun 14, 2026

Rule-Based, Statistical, GAN, and LLM-Distilled: The Four Families of Synthetic Data Techniques

Synthetic data generation spans four families — rule-based, statistical, GAN-based, and LLM-distilled — each preserving a different depth of structure.

How synthetic data generation samples new artificial records from a learned statistical distribution of real data

MONA explainer Start here 9 min Jun 14, 2026

What Is Synthetic Data Generation and How Artificial Training Data Is Created

Synthetic data generation creates artificial records that mimic a dataset's statistics without reusing real rows, via GANs, VAEs, and diffusion models.

How a single AI benchmark percentage hides the metric, the pass@k sampling regime, and data contamination

MONA explainer Core 10 min Jun 19, 2026

Prerequisites for Reading AI Benchmark Scores: Metrics, Pass@k, and Contamination

AI benchmark scores hide three variables: what the metric counts, the pass@k sampling regime, and whether the test leaked into the training data.

Three failure modes of AI benchmarks: saturation ceilings, training-data contamination, and construct validity gaps

MONA explainer Core 9 min Jun 19, 2026

Saturation, Contamination, and Construct Validity: The Technical Limits of AI Benchmarks

AI benchmarks fail through saturation, contamination, and construct validity. Decontamination cut HumanEval scores nearly 40% — the gap was pure leakage.

Benchmark datasets GLUE, MMLU, and SWE-bench scoring and ranking large language models on a leaderboard

MONA explainer Core 10 min Jun 19, 2026

What Are Benchmark Datasets and How GLUE, MMLU, and SWE-bench Measure LLM Performance

Benchmark datasets are fixed test sets that score and rank LLMs. MMLU's 15,908 questions and SWE-bench's 2,294 GitHub tasks show two scoring styles.