When Synthetic Replaces Real: Bias Laundering and Accountability in Generated Datasets

Table of Contents
The Hard Truth
Imagine a dataset with no real people in it — every record invented by a model, every privacy worry dissolved, every awkward question about consent quietly closed. Now ask: where did the bias go? It did not leave — it learned to hide.
For decades we worried about what was inside our training data — whose faces, whose names, whose histories were swept up without asking. Synthetic data promises to end that worry by manufacturing records that resemble real people without being any of them. It is an elegant move. But elegance tends to relocate problems rather than solve them, and the problem relocated here is one of the oldest in computing: the bias we put in is the bias we get out.
The Question We Stopped Asking About Our Data
Every dataset used to arrive with a question attached: where did this come from, who collected it, and under what terms? Synthetic Data Generation offers a way to make that question disappear. Train a model on real data, then have it produce an entirely artificial dataset that mirrors the statistical shape of the original without containing a single real person. The privacy problem dissolves. The consent problem dissolves. And somewhere in that dissolution, a quieter question slips away with them: who is accountable for what the generated data leaves out?
Borrowing a phrase from finance, I will call this bias laundering. It is a coinage, not a technical standard — there is no settled definition and no canonical citation. But the metaphor is exact. Laundering does not make dirty money clean; it severs the link between the money and its origin so no one can follow the trail. Synthetic data can do the same to bias: not remove it, but strip away the provenance that would let anyone prove it was ever there. To see why, it helps to begin with the strongest possible case for doing it.
Why Careful People Reach for Synthetic Data
The case is genuinely strong, and dismissing it would be intellectually lazy. Real datasets are full of real harm — surveillance baked into collection, consent that was never meaningfully given, sensitive attributes exposed to anyone with access. Synthetic data answers all of this. With Differential Privacy layered into the generator, you can produce records mathematically bounded against re-identifying any individual who appeared in the source. Privacy stops being a promise and becomes a measurable guarantee.
The fairness case is just as serious. If a real dataset underrepresents a group, you can instruct a generator to synthesize more examples of that group, rebalancing what the world left lopsided. The tooling is mature and openly available — DataCebo’s Synthetic Data Vault and its CTGAN models, the open-source SDK from MOSTLY AI, and Gretel, significant enough that it was reportedly acquired by NVIDIA in March 2025 (TechCrunch). Even lawmakers have pointed the same direction: the EU AI Act, in force since August 2024, names synthetic and anonymised data as a means to detect and correct bias (EU AI Act analysis). When thoughtful researchers, mature tools, and regulators all lean the same way, the case deserves respect — which is exactly why its hidden assumption is so dangerous.
The Word Doing All the Work Is “Representative”
Notice the word everyone leans on: representative. The entire promise rests on the belief that a model which learned a distribution can reproduce it faithfully, margins and all. But that is not what generative models are built to do. They are trained to maximize likelihood, which means they pull toward the dense center of the data and treat the sparse edges as noise to be smoothed. The rare case is statistically expensive to preserve, so it is the first thing to go.
This is where the question of whether synthetic data hides or amplifies bias gets an uncomfortable answer: it does both. It amplifies, because training on generated data sets up a feedback loop — work presented at ACM FAccT 2024 shows that model-induced distribution shift encodes a model’s own unfairness into the ground truth of the next training set, compounding it with each cycle. And it hides, because the distortion looks like cleaner data rather than corrupted data. The most rigorous version of this finding comes from Nature, where Shumailov and colleagues showed that models trained recursively on generated data suffer irreversible collapse, with the tails of the distribution — the minorities, the outliers, the exceptions — vanishing first. The people you most wanted to protect are the people the model is most likely to forget. Each pass is a kind of Knowledge Distillation, and what gets distilled away is the exception.
Laundering Doesn’t Clean the Money. It Erases the Trail.
The finance metaphor turns out to be more than rhetorical. When money is laundered, the goal is never to change the amount — it is to break the chain of custody, so that by the time the money looks clean, no investigator can connect it to the crime. Synthetic data performs the same operation on accountability. Once a dataset is generated rather than collected, there is no real person to point to, no original record to examine, no collection log to audit. The bias may be entirely intact, but its Data Provenance is gone.
Consider the older image of a photocopy of a photocopy. Each copy looks acceptable on its own; the degradation only becomes visible when you hold the latest against the original — except here, the original has been deliberately discarded. We are building datasets whose claim to fairness cannot be falsified, because the evidence that would falsify it no longer exists. A harm no one has to own is the most durable kind.
The Real Cost Is Not Bias. It Is Untraceable Bias.
Thesis (one sentence, required): Synthetic data does not remove bias from our systems — it removes our ability to locate who is responsible for it.
This holds even when the synthetic dataset is, by standard fairness metrics, less biased than the real one it replaced. The danger was never that generated data is uniquely prejudiced; often it is measurably fairer. The danger is what happens to responsibility. When a biased decision traces back to a real dataset, accountability has somewhere to land: the collection method, the vendor who sold it, the team that signed off. When it traces back to a generator, responsibility diffuses into a fog — the model’s original training data, the toolmaker (perhaps now a division of a far larger company), the team that sampled the output, the regulation that blessed the technique. Everyone touched it; no one owns it.
This is also why the distinction between tools matters. A library like Faker produces obviously fake placeholder values — names, addresses, numbers that preserve no real distribution and therefore launder nothing. The risk lives entirely in the statistically faithful generators, the ones whose output is convincing enough to be trusted and opaque enough to escape scrutiny. The better the imitation, the cleaner the laundering.
Questions Worth Sitting With
The field is not blind to this. NIST’s finalized guidance on evaluating differential privacy, published in 2025, treats bias as an explicit component of its evaluation framework rather than an afterthought (NIST). Fairness audits built on metrics like demographic parity, equal opportunity, and disparate impact are increasingly applied to synthetic sets (Preprints.org). These are real and worthwhile efforts. But notice what they audit: the output, not the chain of custody. A metric can tell you a generated dataset looks balanced today; it cannot tell you which real cases were quietly erased to make it look that way.
So the questions worth sitting with are not technical ones. Who audits the generator, and against what ground truth, when that ground truth has been thrown away? When regulation points to synthetic data as a remedy for bias, the harder problem is verifying that it corrected the harm rather than concealed it. Regulations indicate the direction society is moving — toward generated data as a fix — but a remedy whose effects cannot be traced is an act of faith, not a safeguard. We should at least know which one we are practising.
Where This Argument Could Break
This argument has a clear failure point. If provenance tooling matures — if every synthetic record can carry a verifiable lineage back to the distribution and the privacy budget that produced it, and if standardized bias evaluation along the lines NIST is building becomes routine and genuinely auditable — then synthetic generation could become more transparent than the messy, under-documented real-world collection it replaces. In that world synthetic data would increase accountability rather than erode it, and the laundering metaphor would collapse. The whole case rests on a bet about sequence: whether we build that transparency before we scale the convenience, or long after.
The Question That Remains
We turned to synthetic data to stop harming real people with our datasets, and that instinct was right. The harder possibility is that we have only made the harm more difficult to see — and that the people most likely to disappear from a generated distribution are the ones already living at its margins. When the dataset contains no real person at all, who is left to be wronged, and who is left to answer for it?
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors