ALAN opinion 12 min read

Underpaid Annotators and Hidden Bias: The Ethical Cost of the Data Labeling Industry

Human hands sorting data labels behind a glowing AI interface, evoking the hidden labor and bias inside training data.
Before you dive in

This article is a specific deep-dive within our broader topic of Data Labeling and Annotation.

This article assumes familiarity with:

The Hard Truth

Every time an AI system answers you with confidence, someone you will never meet decided what that answer should look like. They were probably paid by the task, on a deadline, sorting through the worst material humans produce. What do we owe the people who taught the machine to seem trustworthy?

We talk about artificial intelligence as if it taught itself — as if understanding simply emerged from data and compute. But data does not label itself. Behind every clean dataset sits a person who looked at an image, read a sentence, and made a judgment about what it means. That judgment is the ground every later step stands on, and we have arranged the economy so that the people making it are the least visible and the least paid.

The Workers We Designed Not to See

There is a particular kind of invisibility built into the AI supply chain, and it is not accidental. The whole appeal of Data Labeling And Annotation as an industry is that it converts messy human judgment into something that looks like a finished technical product — clean, neutral, ready to train a model. The human hands disappear into the abstraction. We call the result “the dataset,” not “the work of thousands of people in Nairobi, Manila, and Caracas.”

Researchers Mary Gray and Siddharth Suri gave this labor a name in 2019: ghost work. Their estimate, that roughly eight percent of Americans have done this kind of hidden online piecework at some point, only covers one country and one slice of the global pipeline. The real workforce is larger, more dispersed, and harder to count — which is exactly the condition under which exploitation tends to flourish. When no one can see the work, no one has to answer for its conditions. So the first question is not technical at all. Why did we build a foundational layer of the AI economy specifically so that we would not have to look at it?

The Case That This Is Honest Work

It would be dishonest to pretend there is no argument on the other side, because there is a real one. Remote annotation work brings income to regions where formal employment is scarce. It requires no degree, no relocation, no capital — only a connection and attention. For some workers it has been a genuine entry point, a way to earn in dollars while living in an economy that pays in much less. The companies that organize this work describe it as opportunity, and they are not entirely wrong to.

There is a quality argument too. Good models need good labels, and good labels need human discernment. The entire field of Training Data Quality rests on the premise that careful, consistent human annotation is worth paying for — that the judgment of a thoughtful person sorting hate speech from satire cannot yet be replaced by an algorithm. In principle, that should make annotators valuable, even irreplaceable. The work is skilled in a way the wage rarely admits. And that gap — between the value of the judgment and the price paid for it — is where the honest story starts to come apart.

What the Wage Quietly Hides

Consider the most documented case we have. When OpenAI contracted the firm Sama to help filter toxic content out of ChatGPT’s training pipeline, OpenAI agreed to pay roughly $12.50 per hour for each worker, according to CBS News. The workers themselves took home far less — most earned around $170 a month, with take-home pay landing somewhere between about $1.32 and $2 an hour, as TIME documented. Most of the contracted rate never reached the person doing the labor.

And the labor was not neutral. These annotators were reading and classifying detailed descriptions of child sexual abuse, torture, suicide, and self-harm, hour after hour, so that the rest of us would never encounter it. Several reported lasting psychological harm. The work ran from late 2021 until Sama ended the contract early in February 2022, about eight months ahead of schedule. The clean output we celebrate was purchased with someone else’s trauma. When those labels become the Ground Truth a model trains on, the human cost is laundered into the system and forgotten.

The pattern is not confined to one firm. The large annotation platform Scale AI and its Remotasks arm faced two worker lawsuits in late 2024 and early 2025 alleging underpayment and misclassification of workers as contractors without overtime or sick pay. A US Department of Labor inquiry under the Fair Labor Standards Act was opened and then dropped in May 2025 with no public finding of wrongdoing — the suits remain pending, and nothing here has been proven. What is established is the structural shape: a workforce classified out of the protections that classification was invented to guarantee. An Oxford Internet Institute assessment in 2022 found that Remotasks met the Fairwork project’s minimum standards of fair work in only one of ten criteria. The terms that advocacy groups reach for — “modern slavery,” “a penny per task” — are characterizations, not official findings, and they should be read as the moral alarm of people watching closely rather than as settled fact. But the alarm is not coming from nowhere.

A Very Old Arrangement in New Clothes

It helps to step back from the screens. What we are describing is not a novel pathology of the digital age — it is one of the oldest labor arrangements we know, wearing a new interface. The putting-out system of early industrialization paid workers by the piece, pushed the risk and the idle time onto them, and kept them dispersed enough that they could never organize or even see one another. Piece rates, no benefits, no visibility, no collective voice: the structure of annotation work is a near-perfect reproduction of a model that labor movements spent a century trying to dismantle.

The technology is genuinely new. The arrangement is not. And recognizing that should puncture a comforting story we tell ourselves — that the harms of AI are unprecedented problems requiring unprecedented solutions. Some of them are. But this one we have seen before, and we already know what tends to fix it: visibility, protection, and the simple insistence that the people doing the work are workers, not line items. The reason it persists is not that the solution is mysterious. It is that the invisibility is convenient.

The Bias We Pay For Twice

Here the labor question and the fairness question turn out to be the same question. Because annotation is not a mechanical transcription of reality — it is a series of human judgments, and human judgments carry the perspective of the human making them. The US National Institute of Standards and Technology, in its bias guidance (NIST SP 1270), is direct about this: people labeling data import their own subjective perceptions and stereotypes into the labels, and simply being aware of the problem is not enough to fix it. It takes structured processes and diverse teams.

The effect is measurable. Research by Goyal and colleagues found that a rater’s own identity is a statistically significant factor in how they annotate toxicity — what one person flags as harmful, another reads as ordinary, and the difference tracks who they are. On subjective tasks like hate speech and harassment, demographics and lived experience shape disagreement systematically. Now follow what the pipeline does with that disagreement. Tools like Inter Annotator Agreement exist to measure how much labelers diverge, and the usual response to divergence is to resolve it — to pick a majority, smooth the conflict, and call the survivor the correct answer. The dissenting perspective is treated as noise. Then Active Learning loops feed the model more examples near its uncertain boundaries, and Data Deduplication strips out redundancy, and the whole apparatus optimizes a dataset whose foundational judgment was never neutral to begin with. We pay once in the wage we underpay the annotator, and again in the unfairness we encode and then scale to millions.

Thesis: The data labeling industry treats human judgment as a cheap, interchangeable raw material — and that single mistake is the root of both its labor exploitation and its bias, because undervaluing the worker and erasing their perspective are the same act.

If you take that seriously, the fairness conversation cannot be quarantined inside the engineering team. A model’s values are set long before training begins, in who was hired to label, how much they were paid to care, and whose disagreement was allowed to count. You cannot debug your way out of a judgment you bought at a discount.

The Questions We Owe the People Behind the Labels

I am wary of turning this into a checklist, because the comfort of a checklist is part of how the problem stays hidden. So instead, some questions worth sitting with. When a company publishes a model card describing its training data, why does it almost never describe the people who produced that data — where they were, what they were paid, what they were asked to look at? If annotators are skilled enough that their judgment becomes the ground truth for a billion-dollar system, what exactly justifies paying them as if the judgment were unskilled? And when a marginalized perspective is outvoted in the labeling process and discarded as disagreement, who decided that consensus was the same thing as truth?

Where This Argument Could Be Wrong

Honesty requires naming the weak points. If the industry moved decisively toward fair wages, mental-health support, worker classification, and transparent sourcing — and some firms are under real pressure to — then the labor critique would soften into a story of an industry that corrected itself, and I would gladly tell that story. And if synthetic data and automated labeling mature enough to remove humans from the worst of this work without simply hiding the harm elsewhere, the specific exploitation I have described could fade. My argument depends on the claim that the invisibility is structural rather than temporary. If it turns out to be a transitional flaw the field is actively fixing, I am wrong in the most welcome way possible.

The Question That Remains

The cleanest output in AI is the one that shows no fingerprints — no sign of the hands that sorted it or the judgment that shaped it. That cleanliness is not the absence of human cost. It is the cost, made invisible. So the question we are left with is simple and uncomfortable: if we cannot see the people who taught our machines what to value, how would we ever know what our machines have learned to ignore?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: