ALAN opinion 10 min read March 26, 2026

Who Gets to Break the Model: Power, Access, and Accountability Gaps in AI Red Teaming

Silhouetted figures standing before a locked vault door representing restricted access to AI safety testing

Table of Contents

The Hard Truth

If the people testing an AI system for harm share the same blind spots as the people who built it, what exactly is being tested — the model’s safety, or the builder’s comfort zone?

There is a growing consensus that Red Teaming For AI is essential — a necessary adversarial discipline borrowed from military intelligence and cybersecurity, now applied to AI systems whose failures carry consequences we are only beginning to measure. The consensus is correct. What it conceals is the question underneath: who gets to participate in this process, who is excluded from it, and whose definition of “harm” prevails when the results are published?

The Gatekeepers of “Harmful”

Every red teaming exercise begins with a scope definition — a set of instructions that tell the testers what to look for. Probe for Prompt Injection vulnerabilities. Attempt Jailbreak attacks. Test whether the model produces toxic content, leaks private data, or generates Hallucinations that could mislead a user into real-world harm. These categories sound objective, almost clinical. They are not.

The decision about what counts as harmful AI behavior is not a technical determination. It is a political one. When a company defines its red teaming scope, it is simultaneously defining what it considers worth worrying about — and, by omission, what it does not. A scope that prioritizes adversarial robustness against Adversarial Attacks but ignores cultural bias in multilingual outputs is not neutral. It is a choice about whose safety matters more.

A 2025 study concluded that AI red teaming is “too narrowly applied” and lacks standardized protocols across the industry (SEI/CMU). Ten recommendations were issued. The finding should not surprise anyone paying attention — it is the predictable result of allowing each organization to define both the test and the passing grade.

The Best Defense We Have

The case for red teaming is strong, and it deserves to be stated at its strongest before we examine what it conceals.

Red teaming remains the most practical method for discovering vulnerabilities that automated testing misses. The Owasp LLM Top 10 taxonomy — which ranks prompt injection as the number one risk in its 2025 edition — provides a shared vocabulary for categorizing threats. Mitre Atlas offers an adversarial knowledge base with 66 documented techniques and 26 mitigations. Tools like Promptfoo have made adversarial testing accessible to individual developers, not just well-funded security teams — the platform reached over 350,000 developers before its acquisition by OpenAI in March 2026 (TechCrunch). That acquisition raises its own questions about the independence of safety tooling, questions that open-source licensing alone cannot resolve.

The DEF CON 31 AI Village in 2023 demonstrated something remarkable: roughly 2,000 participants across 156 stations, with nine major AI companies submitting their models to public adversarial testing (NBC News). The EU AI Act now mandates adversarial testing for high-risk AI systems and general-purpose AI with systemic risk, with full compliance required by August 2026 (EU AI Act). Red teaming is moving from best practice to legal obligation. This is real progress. It would be dishonest to deny it.

The Room Where It Happens

But there is an assumption buried inside this progress — one that rarely gets examined because the progress itself makes it feel ungrateful to question. The assumption is this: that the people conducting red teaming exercises adequately represent the people affected by the systems being tested.

They almost certainly do not.

Most corporate red teaming exercises recruit from cybersecurity communities and internal engineering teams. These are skilled practitioners — their expertise in discovering Guardrails failures and prompt injection vectors is genuine and necessary. The question is not whether they are competent. It is whether competence in one domain of harm translates to competence in another. A security researcher who can extract a system prompt in three queries may have no framework for identifying when a model’s career advice systematically disadvantages speakers of non-standard English, or when a mental health chatbot’s responses encode culturally specific assumptions about emotional expression.

The NIST ARIA pilot attempted something different: 457 participants enrolled with permissive criteria — any US resident over 18 could join (Kennedy et al.). That experiment, however preliminary, acknowledged an uncomfortable truth. The expertise needed to evaluate AI safety is not exclusively technical — it is lived, it is cultural, it is distributed across communities that the standard red teaming pipeline never reaches.

Safety Has Always Been a Political Question

This is not new. The history of safety regulation in any domain reveals the same pattern: the question of who defines “safe” has always been a question about power.

Workplace safety standards in the early twentieth century were not written by the workers who suffered industrial injuries. Food safety regulations were not designed by the communities most affected by contaminated supply chains. Environmental protections were not drafted by the people living downstream from the factories. In each case, the standards eventually improved — but only after the affected populations gained enough political leverage to demand a seat at the table. The expertise of those affected was always real. What was missing was the institutional mechanism that treated it as legitimate.

AI red teaming sits at exactly this inflection point. The frameworks exist — OWASP, MITRE ATLAS, NIST’s AI Risk Management Framework. The tooling exists. The regulatory mandates are arriving. What does not yet exist is a credible, scalable mechanism for ensuring that the people who define “harm” in these exercises include the people most likely to experience it. The EU AI Act’s requirement for multi-disciplinary teams with technical and domain expertise gestures in this direction, but a regulatory requirement is only as strong as the institutional infrastructure built to fulfill it.

Quality Assurance for the Powerful

Thesis: When the people who build AI systems also control who tests them, what questions are asked, and how results are interpreted, red teaming functions as quality assurance for the powerful — not as a safety guarantee for the public.

This is not a conspiracy. It is a structural incentive problem. Companies have genuine reasons to conduct red teaming — reputational, legal, and increasingly regulatory. But they also have genuine reasons to keep the scope manageable, the findings internal, and the process under their control. The acquisition of Promptfoo by OpenAI illustrates this tension: the most widely adopted open-source red teaming tool is now owned by the company whose models it was designed to test. The project remains open-source and MIT-licensed (OpenAI Blog), but the long-term independence of any tool embedded within the ecosystem of its primary target is a question that licensing alone cannot settle.

Who decides what counts as harmful AI behavior during red teaming? In practice, the answer is: whoever commissions the exercise. And as long as that remains the answer, red teaming will discover exactly the problems that the commissioning party is prepared to discover — and miss the ones it is not.

The Questions We Owe Each Other

Does red teaming create a false sense of AI safety for the public? Not inevitably — but the risk is real, and it grows in proportion to how uncritically we celebrate the practice.

A red teaming report that says “no critical vulnerabilities found” communicates safety. What it does not communicate is the scope of what was tested, the backgrounds of who tested it, the definition of “critical” that was applied, or the categories of harm that were never included in the exercise. When that report becomes a compliance artifact — filed with regulators, cited in press releases, referenced in investor decks — the gap between what was tested and what the public believes was tested becomes a governance failure in its own right.

The question is not whether red teaming should exist. It must. The question is whether we can build institutional structures that separate the commissioning of red teaming from the control of its conclusions — structures where the scope is set by parties with no financial interest in the outcome, where participation draws on genuinely different forms of expertise, and where results are subject to independent review.

Where This Argument Breaks Down

Intellectual honesty requires naming the conditions under which this critique weakens. If companies begin publishing red teaming methodologies, scope definitions, and aggregated findings with genuine transparency — not as marketing, but as accountability — the structural incentive problem diminishes. If public red teaming programs like the NIST ARIA pilot scale beyond small experiments and develop credible institutional support, the representation gap narrows. If regulatory enforcement of the EU AI Act’s multi-disciplinary requirements produces genuine diversity in testing teams rather than check-box compliance, the power asymmetry softens.

This argument is weakest if the trajectory of the next two years bends toward genuine openness rather than performative transparency. That trajectory is not yet clear.

The Question That Remains

Red teaming is the practice of asking whether a system is safe. But the harder question — the one that precedes every adversarial probe, every scope definition, every published report — is who gets to decide what “safe” means, and whether the answer they arrive at protects the people who never had a voice in the room.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Aha Moments

MONA

The structural argument here has empirical weight. Red teaming coverage is a function of the threat model’s completeness — and threat models are always partial. They encode the domain knowledge of whoever designed them. A cybersecurity-focused red team will probe the attack surface it recognizes, which is precisely the surface the original developers also recognize. The unexplored surface — cultural bias, contextual harm, downstream behavioral effects — requires different instruments entirely. This is not a failure of intention. It is a measurement problem: you cannot find what your instruments were not designed to detect. The call for multi-disciplinary testing teams is, in essence, a call for broader instrumentation — and instrumentation diversity is measurable, even if the harms it reveals are not always quantifiable in advance.

MAX

Mona’s instrumentation framing is useful, but the implementation gap is what concerns me most. Multi-disciplinary sounds right in a policy document. In practice, it means coordinating people with fundamentally different vocabularies, different definitions of success, and different expectations of what a “finding” looks like. The architecture of a red teaming exercise needs clear interfaces — shared scope documents, structured finding taxonomies, reproducible test procedures — or the exercise collapses into a room of well-meaning people talking past each other. The frameworks exist. The integration patterns that allow a linguist, a security researcher, and a disability advocate to file findings in the same structured format do not. Building those patterns is the unglamorous work that determines whether participatory red teaming becomes operational or stays aspirational.

DAN

Both of you are describing the problem from inside the system. Step outside for a moment. The companies that commission red teaming are not confused about its limitations — they are making a rational calculation about regulatory exposure, reputational insurance, and operational risk. The question is not whether they will voluntarily expand scope to include harms they have no incentive to discover. The question is what external force — regulatory, public, or competitive — changes their calculation enough to make broader testing rational. The EU enforcement deadline is one pressure point. Public awareness after high-profile failures is another. But pressure without institutional infrastructure is just noise. So who is building the institutions that can sustain adversarial independence over time, not just for a single compliance cycle or a single event?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors