Who Decides What Good Means: Cultural Bias and Power Asymmetry in LLM Benchmarks

Table of Contents
The Hard Truth
What if the most influential definition of intelligence in the twenty-first century was written by a handful of English-speaking researchers, tested on English-speaking users, and never questioned by anyone with the power to change it?
Every time a model climbs a leaderboard, someone celebrates. Rarely does anyone ask what the leaderboard actually measures — or, more precisely, whose idea of “good” it encodes. The history of measurement is the history of power, and artificial intelligence has inherited that history without examining the debt.
The Comfortable Fiction of Objectivity
Model Evaluation promises something seductive: a number that tells you which system is better. Feed it a benchmark, get a score, rank the contenders. The appeal is the same appeal that standardized testing has held for a century — the comforting illusion that a single scale can capture something as contextual as competence.
Most benchmarks start with reasonable intentions. MMLU tests knowledge across academic domains — though top models have largely saturated it by 2026, it remains the field’s default reference for knowledge breadth. HumanEval and SWE Bench measure coding proficiency through English-only problem descriptions. Chatbot Arena lets users vote on which response they prefer, generating ELO Rating rankings from millions of comparisons. Perplexity and BLEU offer quantitative handles on fluency and translation quality. Even the Confusion Matrix, in its elegant simplicity, reduces classification to a grid of right and wrong.
These tools create shared reference points where none existed. The problem is not measurement itself — it is the unmarked assumptions inside the ruler.
The Machine That Ranks Itself
The steelman case for current benchmarks is strong. Chatbot Arena has accumulated over six million votes across more than four hundred models (Contrary Research). That scale creates a kind of democratic legitimacy — the crowd decides, not a committee. Open leaderboards drive competition, which drives progress.
Yet the architecture of that legitimacy contains a structural asymmetry. OpenAI receives roughly a fifth of all evaluation data flowing through LMArena, Google nearly as much, while eighty-three open-source models collectively receive less than a third (Contrary Research). The platform that ranks the models also generates revenue by selling evaluation tools to the same laboratories whose models it ranks — a conflict that would be disqualifying in financial auditing but barely registers in AI governance.
The evaluator is not independent of the evaluated. What does democratic evaluation mean when the democracy is structurally tilted?
Whose Knowledge, Whose Morality
Here is the assumption hidden inside every major benchmark: that the questions are culturally neutral. They are not. An analysis of MMLU found that 84.9% of its geographic questions focus on North America or Europe, and 28% of all questions require culturally sensitive knowledge (Global MMLU, ACL 2025). When a benchmark treats Western geography as universal geography, it does not test knowledge — it tests proximity to a particular worldview.
The asymmetry runs deeper than geography. When researchers mapped GPT-4o’s value alignment across the Inglehart-Welzel cultural dimensions, the model landed closest to Finland and farthest from Jordan — a cultural distance of 0.20 versus 4.10 (Tao et al., PNAS Nexus). This is the training data’s center of gravity — made visible.
A study across forty-eight countries found that large language models systematically overestimate Western moral foundations — particularly Care — while underestimating non-Western values like Purity (Zewail et al., PNAS 2026). The models do not merely reflect cultural bias. They compress moral pluralism into a single hierarchy, encoding one civilization’s ethical priorities as the default.
And when source identity enters the frame, the distortion compounds. Attributing identical statements to Chinese individuals consistently lowered agreement scores across all tested models, a pattern documented across 192,000 assessments (Science Advances). The benchmark does not ask whether the statement is true. It asks whether it sounds credible — and credibility itself is culturally coded.
When the Ruler Becomes the Territory
There is a historical pattern worth recognizing. In the early twentieth century, IQ tests developed by Western psychologists were exported globally and used to rank populations on a single scale of cognitive ability. Those tests measured familiarity with specific cultural contexts, not intelligence in any universal sense. The damage that followed — from eugenics policies to educational tracking — stemmed not from malice but from the unexamined assumption that one culture’s questions were everyone’s questions.
LLM benchmarks are not IQ tests. But they share a structural feature: a local standard presented as universal. When models were tested across twenty-nine languages, performance gaps of up to 24.3% emerged between high-resource and low-resource languages (MMLU-ProX). The model did not become less intelligent when it switched from English to Yoruba. The benchmark simply stopped measuring what it claimed to measure.
The LLM As Judge paradigm adds another layer of recursion. When one language model evaluates another, the evaluator’s cultural priors become the evaluation criteria. If the judge model carries Western assumptions — and the evidence suggests it does — then the entire evaluation loop becomes a system for confirming the worldview it started with.
The Thesis Nobody Wants to Hear
Thesis (one sentence, required): LLM benchmarks do not measure intelligence — they measure cultural proximity to the assumptions of their creators, and the institutions that control these benchmarks hold disproportionate power over what the world accepts as AI progress.
We cannot discard benchmarks — they remain the only shared language the field has for comparing models. But we can stop treating them as neutral instruments. The questions baked into MMLU are not objective. The voters on Chatbot Arena — whose precise demographic composition remains undisclosed — are not a representative sample of humanity. The entities that fund, host, and profit from evaluation platforms are not disinterested referees.
Institutional responses are beginning to surface. NIST published a draft framework for automated benchmark evaluation in January 2026, and assembled a ten-country measurement network — with Kenya as the sole African member. The EU AI Act, with full applicability from August 2026, will require bias and fairness evaluation for high-risk systems. Whether these frameworks can reshape a field already consolidated around a narrow set of benchmarks is a question that cannot wait for perfect answers.
The Chairs That Are Empty
If benchmarks define what counts as intelligence, then the question of who designs them is not technical — it is political. Right now, that design table is overwhelmingly occupied by well-funded Western institutions, English-speaking researchers, and technology companies with financial stakes in the outcome. The people whose languages, values, and moral traditions are being measured — and often found wanting — rarely hold the pen.
A benchmark that claims universality but reflects one culture’s priorities is not a neutral tool. It is an argument disguised as arithmetic. And the most dangerous arguments are the ones that never announce themselves as arguments at all.
Where This Argument Is Weakest
The vulnerability here is real. If every benchmark is culturally biased, the logical conclusion might be that no cross-cultural comparison is possible — which would leave the field without shared reference points entirely. Some researchers argue that culturally sensitive subsets, like those in Global MMLU’s forty-two-language expansion, represent a workable middle path. If Benchmark Contamination can be addressed and evaluation frameworks genuinely diversified, the measurement problem may be tractable without abandoning measurement itself.
This argument weakens considerably if the next generation of benchmarks is designed by genuinely diverse coalitions — not as a token gesture but as a structural requirement. The thesis holds only as long as the design table remains narrow.
The Question That Remains
We have built a system where the definition of intelligence is set by those who build the instruments, funded by those who profit from the rankings, and tested on populations who never agreed to the terms. If the measure shapes the thing it measures — and in AI, it does — then the question is not which model is best. It is best according to whom, and at whose expense.
Disclaimer
This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors