Whose Data Counts: Bias Amplification, Provenance, and the Accountability Gap in Training Data

Table of Contents
The Hard Truth
We keep treating a model’s prejudices as a flaw in the code, something a clever engineer will eventually patch out. But the model is faithful. It learns exactly what we showed it — and then, quietly, it learns to show us a little more of it than we ever put in. What if the problem was never the algorithm, but the question we refused to ask about the data underneath it?
There is a question that rarely survives the journey from the research lab to the press release. Not “is this model accurate?” but “accurate for whom?” We have built an entire discipline around measuring performance and almost no discipline around asking who that performance was measured against. The silence is not an accident. It is a decision, and like every decision about data, it has authors who would prefer to remain anonymous.
The Question We Keep Postponing
When a system already in use fails a group of people, the post-mortem almost always points downstream. The model needs more fine-tuning. The threshold needs adjusting. The guardrails need tightening. Each of these is a way of looking at the output and away from the source.
But the harm did not begin at the output. It began when someone, somewhere, decided what counted as a representative example and what counted as an edge case — and never wrote that decision down. The discomfort of Training Data Quality is that “quality” has been quietly redefined to mean clean, when the question that actually matters is whose. Who was counted, who was labeled by whom, who consented to be in the corpus at all. Those are not engineering questions. We have been answering them with engineering anyway, and pretending the answer was neutral.
The Comfortable Case for Clean Data
The conventional view is not foolish, and it deserves its strongest form. It goes like this: data is messy, models inherit that mess, so the responsible thing is to measure the mess and reduce it. This is the genuine insight behind Data-Centric AI — the argument that improving the data systematically often beats endlessly tuning the model. It is true, and it has produced real tools.
Confident Learning, the method behind the open-source Cleanlab library, can estimate which labels are probably wrong and flag them for human review. Snorkel pioneered Weak Supervision, letting teams encode their labeling logic as functions instead of hand-annotating millions of rows. Active Learning chooses the most informative examples so people label fewer of them, and curation tools in the same family — Lightly among them — help teams select what to keep. I want to be fair to this work: it is good, it is real, and Label Noise is a genuine problem that these methods genuinely reduce. The conventional wisdom is not wrong. It is just answering a smaller question than the one in front of us.
Where “Reflection” Quietly Becomes “Amplification”
Here is the assumption hiding inside the clean-data story: that a model is a mirror. Feed it a biased world and it gives you back a biased reflection — no better, no worse. If that were true, then cleaning the data would be enough, and the worst a model could do is fail to improve on us.
It is not true. A model does not merely reflect the imbalance in its data; under the right conditions it widens it. Researchers studying data feedback loops found that models can predict an attribute at higher rates than the training statistics actually warrant, and that retraining on a model’s own outputs makes the distortion compound over time (Taori & Hashimoto, ICML 2023). The bias does not stay the size we trained it. It grows in the gap between what the data said and what the model decided to say more confidently.
And the loop is no longer hypothetical. As teams increasingly train on model-generated material, the distortion gets passed down like an inherited trait — work presented at FAccT 2024 showed that training on synthetic data amplifies bias across successive generations of models. We are building a hall of mirrors and calling the reflection objective. A system that is slightly worse for an underrepresented group does not stay slightly worse. Run at scale, retrained on its own exhaust, it becomes reliably worse, millions of times a day, wearing the unimpeachable face of automation.
The Provenance We Never Recorded
So why do we keep walking into the same wall? Part of the answer is that we built these systems without keeping records of where their raw material came from — and you cannot hold anyone accountable for a decision nobody documented.
Consider what we already demand elsewhere. A pharmaceutical without an ingredient list, without a record of its trial population, would never reach a shelf. Yet undocumented data lineage is the ordinary condition of the corpora underwriting decisions about who gets a loan, an interview, a second look from a clinician. A large audit of more than 1,800 text datasets found pervasive gaps in licensing and attribution — datasets in wide use whose origins and permissions nobody could fully reconstruct (Data Provenance Initiative). Data Provenance — the plain discipline of recording what a dataset contains, how it was gathered, and what it was meant for — remains the exception rather than the norm.
The reason is not technical. Documentation creates accountability, and accountability is expensive. An honest provenance record that admits “this dataset overrepresents one population and was never tested on another” is a confession. It converts comfortable future deniability into specific present liability. We avoid writing it down for the same reason we avoid most uncomfortable truths: once it is on paper, someone owns it.
When the Brakes Become Law
For most of this technology’s life, that confession was voluntary, which is to say it mostly didn’t happen. That is about to change, and the change is worth thinking about not as a compliance event but as a moral one. The EU AI Act’s Article 10 will require providers of high-risk systems to document data provenance, demonstrate representativeness, and show active detection and mitigation of bias — turning what was an ethical nicety into a recorded obligation. From an ethical standpoint, this is the state insisting on the question the field kept postponing: whose data, gathered how, and answerable to whom.
It is not yet in force — these data-governance duties become applicable in August 2026, so as I write this they are imminent rather than active. And regulation is a blunt instrument; it indicates the direction society is heading more than it settles the hard cases. But notice what it concedes. The mechanisms we already have — confident-learning to find suspect labels, weak supervision to encode our judgments, content-provenance standards like C2PA to track where material came from — were never sufficient on their own, because each of them faithfully amplifies the hand that wrote it. A labeling function written by someone who never had to think about a particular group will not suddenly start thinking about them. The tool polishes the bias to a higher shine and calls it quality.
Thesis: training data is not a neutral input to be cleaned, but an act of authorship to be accounted for — and the deepest risk of poor data quality is not error, but the laundering of contested human choices into the appearance of objective fact.
This is why the technical fix seduces us. It lets us feel we have addressed an ethical problem without ever having the ethical conversation. We can balance the classes, deduplicate the corpus, flag the noisy labels, and call it justice. But balance is a statistical property. Justice is a contested human judgment about who matters and how much. No amount of curation resolves a disagreement about values; it only hides that the disagreement was never had.
The Questions We Owe the People in the Data
I am wary of tidy prescriptions, so let me offer questions instead of answers. Before the next dataset is assembled: who is in this, who is missing, and who decided that absence was acceptable? When a model is found to perform worse for a group, do we treat it as a bug to be smoothed away, or as a record of who was never counted? And when the provenance is finally written down — as it soon must be — will we read it as a checkbox, or as a confession that demands a response?
These are not engineering questions, even though we will keep trying to answer them with engineering. They are questions about whose experience we are willing to treat as signal, and whose we have quietly filed under noise.
Where This Argument Is Weakest
I should name the strongest objection to my own case. If the data-centric tools keep improving — if confident-learning, better curation, and provenance standards make bias both visible and measurable — then perhaps the cleaning story is enough, and my insistence on an ethical layer is a philosopher’s overreach. It is a fair challenge. What would change my mind is evidence that documented, well-curated datasets reliably stop the amplification loop on their own, without anyone having to argue about whose data counted. I have not seen that evidence yet. Until I do, I think the harder conversation is the one we owe.
The Question That Remains
The deepest harm is not that one system misreads one face. It is that we are building an infrastructure of unaccountable judgment and calling its verdicts objective precisely because no human appears to be making them. The absence of a visible decision-maker is not the absence of a decision — it is the disappearance of someone to hold responsible. Whose data counts? Until we are willing to say the names out loud, the honest answer is: not everyone’s, and we built it that way.
Ethically, Alan.
Disclaimer
This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors