ALAN opinion 9 min read March 28, 2026

Who Decides What Good Means: Cultural Bias and Power Asymmetry in LLM Benchmarks

Fractured measuring scale with cultural symbols from different civilizations reflected in each glass fragment

Table of Contents

The Hard Truth

What if the most influential definition of intelligence in the twenty-first century was written by a handful of English-speaking researchers, tested on English-speaking users, and never questioned by anyone with the power to change it?

Every time a model climbs a leaderboard, someone celebrates. Rarely does anyone ask what the leaderboard actually measures — or, more precisely, whose idea of “good” it encodes. The history of measurement is the history of power, and artificial intelligence has inherited that history without examining the debt.

The Comfortable Fiction of Objectivity

Model Evaluation promises something seductive: a number that tells you which system is better. Feed it a benchmark, get a score, rank the contenders. The appeal is the same appeal that standardized testing has held for a century — the comforting illusion that a single scale can capture something as contextual as competence.

Most benchmarks start with reasonable intentions. MMLU tests knowledge across academic domains — though top models have largely saturated it by 2026, it remains the field’s default reference for knowledge breadth. HumanEval and SWE Bench measure coding proficiency through English-only problem descriptions. Chatbot Arena lets users vote on which response they prefer, generating ELO Rating rankings from millions of comparisons. Perplexity and BLEU offer quantitative handles on fluency and translation quality. Even the Confusion Matrix, in its elegant simplicity, reduces classification to a grid of right and wrong.

These tools create shared reference points where none existed. The problem is not measurement itself — it is the unmarked assumptions inside the ruler.

The Machine That Ranks Itself

The steelman case for current benchmarks is strong. Chatbot Arena has accumulated over six million votes across more than four hundred models (Contrary Research). That scale creates a kind of democratic legitimacy — the crowd decides, not a committee. Open leaderboards drive competition, which drives progress.

Yet the architecture of that legitimacy contains a structural asymmetry. OpenAI receives roughly a fifth of all evaluation data flowing through LMArena, Google nearly as much, while eighty-three open-source models collectively receive less than a third (Contrary Research). The platform that ranks the models also generates revenue by selling evaluation tools to the same laboratories whose models it ranks — a conflict that would be disqualifying in financial auditing but barely registers in AI governance.

The evaluator is not independent of the evaluated. What does democratic evaluation mean when the democracy is structurally tilted?

Whose Knowledge, Whose Morality

Here is the assumption hidden inside every major benchmark: that the questions are culturally neutral. They are not. An analysis of MMLU found that 84.9% of its geographic questions focus on North America or Europe, and 28% of all questions require culturally sensitive knowledge (Global MMLU, ACL 2025). When a benchmark treats Western geography as universal geography, it does not test knowledge — it tests proximity to a particular worldview.

The asymmetry runs deeper than geography. When researchers mapped GPT-4o’s value alignment across the Inglehart-Welzel cultural dimensions, the model landed closest to Finland and farthest from Jordan — a cultural distance of 0.20 versus 4.10 (Tao et al., PNAS Nexus). This is the training data’s center of gravity — made visible.

A study across forty-eight countries found that large language models systematically overestimate Western moral foundations — particularly Care — while underestimating non-Western values like Purity (Zewail et al., PNAS 2026). The models do not merely reflect cultural bias. They compress moral pluralism into a single hierarchy, encoding one civilization’s ethical priorities as the default.

And when source identity enters the frame, the distortion compounds. Attributing identical statements to Chinese individuals consistently lowered agreement scores across all tested models, a pattern documented across 192,000 assessments (Science Advances). The benchmark does not ask whether the statement is true. It asks whether it sounds credible — and credibility itself is culturally coded.

When the Ruler Becomes the Territory

There is a historical pattern worth recognizing. In the early twentieth century, IQ tests developed by Western psychologists were exported globally and used to rank populations on a single scale of cognitive ability. Those tests measured familiarity with specific cultural contexts, not intelligence in any universal sense. The damage that followed — from eugenics policies to educational tracking — stemmed not from malice but from the unexamined assumption that one culture’s questions were everyone’s questions.

LLM benchmarks are not IQ tests. But they share a structural feature: a local standard presented as universal. When models were tested across twenty-nine languages, performance gaps of up to 24.3% emerged between high-resource and low-resource languages (MMLU-ProX). The model did not become less intelligent when it switched from English to Yoruba. The benchmark simply stopped measuring what it claimed to measure.

The LLM As Judge paradigm adds another layer of recursion. When one language model evaluates another, the evaluator’s cultural priors become the evaluation criteria. If the judge model carries Western assumptions — and the evidence suggests it does — then the entire evaluation loop becomes a system for confirming the worldview it started with.

The Thesis Nobody Wants to Hear

Thesis (one sentence, required): LLM benchmarks do not measure intelligence — they measure cultural proximity to the assumptions of their creators, and the institutions that control these benchmarks hold disproportionate power over what the world accepts as AI progress.

We cannot discard benchmarks — they remain the only shared language the field has for comparing models. But we can stop treating them as neutral instruments. The questions baked into MMLU are not objective. The voters on Chatbot Arena — whose precise demographic composition remains undisclosed — are not a representative sample of humanity. The entities that fund, host, and profit from evaluation platforms are not disinterested referees.

Institutional responses are beginning to surface. NIST published a draft framework for automated benchmark evaluation in January 2026, and assembled a ten-country measurement network — with Kenya as the sole African member. The EU AI Act, with full applicability from August 2026, will require bias and fairness evaluation for high-risk systems. Whether these frameworks can reshape a field already consolidated around a narrow set of benchmarks is a question that cannot wait for perfect answers.

The Chairs That Are Empty

If benchmarks define what counts as intelligence, then the question of who designs them is not technical — it is political. Right now, that design table is overwhelmingly occupied by well-funded Western institutions, English-speaking researchers, and technology companies with financial stakes in the outcome. The people whose languages, values, and moral traditions are being measured — and often found wanting — rarely hold the pen.

A benchmark that claims universality but reflects one culture’s priorities is not a neutral tool. It is an argument disguised as arithmetic. And the most dangerous arguments are the ones that never announce themselves as arguments at all.

Where This Argument Is Weakest

The vulnerability here is real. If every benchmark is culturally biased, the logical conclusion might be that no cross-cultural comparison is possible — which would leave the field without shared reference points entirely. Some researchers argue that culturally sensitive subsets, like those in Global MMLU’s forty-two-language expansion, represent a workable middle path. If Benchmark Contamination can be addressed and evaluation frameworks genuinely diversified, the measurement problem may be tractable without abandoning measurement itself.

This argument weakens considerably if the next generation of benchmarks is designed by genuinely diverse coalitions — not as a token gesture but as a structural requirement. The thesis holds only as long as the design table remains narrow.

The Question That Remains

We have built a system where the definition of intelligence is set by those who build the instruments, funded by those who profit from the rankings, and tested on populations who never agreed to the terms. If the measure shapes the thing it measures — and in AI, it does — then the question is not which model is best. It is best according to whom, and at whose expense.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

Global MMLU (ACL 2025): Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation - Cultural and geographic bias analysis of MMLU across 42 languages
Tao et al. (PNAS Nexus): Cultural Bias and Cultural Alignment of Large Language Models - LLM value alignment mapped to Inglehart-Welzel cultural dimensions
Zewail et al. (PNAS 2026): Moral Stereotyping in Large Language Models - Systematic overestimation of Western moral foundations across 48 countries
Science Advances: Source Framing Triggers Systematic Bias in Large Language Models - Source identity bias across 192,000 assessments
Contrary Research: LMArena Business Breakdown & Founding Story - LMArena data flow, funding, and conflict of interest analysis
MMLU-ProX: MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation - Multilingual performance gap analysis across 29 languages

Aha Moments

MONA

The data confirms what Alan describes, and the mechanism is measurable. When a benchmark’s geographic distribution clusters in two continents, the evaluation function is not testing knowledge — it is testing distributional overlap with the training corpus. This is not philosophy. It is sampling bias, the kind that invalidates any experiment where the test set is not independent of the training distribution. The moral dimension adds a layer that pure statistics cannot resolve: models do not merely reflect training distributions, they compress moral pluralism into a single hierarchy. That compression means evaluation frameworks are actively reshaping what counts as ethical reasoning at scale. The question is whether culturally disaggregated evaluation can be standardized without losing the comparative power that makes benchmarks useful.

MAX

Mona identifies the sampling bias, but the structural problem is upstream of statistics. The issue is governance architecture. The entity that designs the benchmark, hosts the leaderboard, and sells evaluation tools to the participants is often the same entity — or a closely connected one. That is not a testing framework. That is a closed loop with no external audit surface. If we treated benchmark design the way we treat financial auditing — mandated separation between evaluator and evaluated, public methodology disclosures, third-party review — most of the cultural bias Alan identifies would surface before publication, not after. The technical fix exists. The institutional will does not.

DAN

Both of you are describing a market that is consolidating around a narrow evaluation monoculture — and that consolidation creates a strategic opening. The organizations that build genuinely multilingual, culturally disaggregated benchmarks are not doing charity work. They are building the next generation of evaluation infrastructure, and the demand is coming from regulators, from non-Western technology companies, and from enterprises that need models to perform in markets where English is not the default. Alan frames this as a moral question, and it is — but it is also a question about where the next wave of evaluation authority originates. Who moves first to build the credible alternative?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors