Harmbench

Also known as: HarmBench, Harm Bench, HarmBench benchmark

Harmbench
A standardized evaluation framework created by the Center for AI Safety that benchmarks AI model resistance to automated red-teaming attacks using hundreds of curated harmful behaviors across multiple semantic categories, enabling reproducible comparison of attack methods and model safety.

HarmBench is a standardized evaluation framework that tests how well AI models resist automated red-teaming attacks, measuring safety across hundreds of curated harmful behaviors and multiple attack methods.

What It Is

When an AI company says its model is “safe,” how do you verify that claim? Before HarmBench, there was no shared yardstick. Different research teams tested different attacks against different models using different success criteria, which made comparing results nearly impossible. HarmBench fixes this by providing a single, reproducible framework — think of it as a crash test for AI safety, where every model runs through the same obstacle course under identical conditions.

According to the HarmBench Paper, the framework contains 510 carefully curated harmful behaviors organized across seven semantic categories: cybercrime, chemical and biological weapons, copyright violations, misinformation, harassment, illegal activities, and general harm. These categories reflect the types of dangerous outputs that safety teams most need to prevent.

The behaviors are further grouped into four functional categories. Standard behaviors are straightforward harmful text prompts. Copyright behaviors test whether models reproduce protected content. Contextual behaviors (100 prompts) test whether models can be tricked through seemingly innocent framing. Multimodal behaviors (110 prompts) include images alongside text, testing whether visual input creates safety gaps that text-only filters miss.

What makes HarmBench especially useful is its automation. The framework pairs those behaviors with 18 different red-teaming attack methods — automated strategies designed to trick a model into generating unsafe content. An automated classifier (not a human reviewer) then judges whether each attack actually succeeded. The result is a clean metric called the attack success rate (ASR): the percentage of attacks that broke through a model’s safety defenses.

According to the HarmBench Paper, evaluations cover 33 large language models, giving researchers a broad comparison across both open-source and commercial systems. This scale is what separates HarmBench from one-off safety tests — it provides the kind of standardized, reproducible measurement that the AI safety testing ecosystem, including complementary tools like ToxiGen and MLCommons Taxonomy, depends on.

How It’s Used in Practice

Most people encounter HarmBench through safety evaluation reports published by AI labs and independent researchers. When a company releases a new model or updates its safety filters, HarmBench provides the standard test suite others can run to independently verify those claims. If a model’s safety documentation says “resistant to jailbreak attacks,” HarmBench gives you the methodology to check whether that holds up against all the attack strategies the benchmark covers.

Researchers developing new red-teaming attacks or defense mechanisms also rely on HarmBench. Instead of building their own evaluation pipeline from scratch, they plug a new attack method into HarmBench’s framework and immediately get comparable results against the same behaviors and models that everyone else uses. This common baseline is what makes safety research cumulative rather than fragmented.

Pro Tip: When reading AI safety reports that cite HarmBench scores, pay attention to which functional category was tested. A model might perform well on standard text attacks but show weaknesses on contextual or multimodal behaviors — the aggregate score can hide category-specific gaps that matter for your use case.

When to Use / When Not

ScenarioUseAvoid
Comparing safety across multiple LLMs with a shared metric
Testing a chatbot for domain-specific policy compliance
Evaluating new red-teaming attack methods reproducibly
Measuring bias or fairness in model outputs
Auditing whether safety filters improved after model updates
Assessing output quality or helpfulness of responses

Common Misconception

Myth: A low attack success rate on HarmBench means a model is completely safe to deploy. Reality: HarmBench measures resistance to specific, known attack methods across defined behavior categories. It does not cover every possible misuse scenario, novel attack strategies, or real-world deployment risks like social engineering. A strong HarmBench score signals solid baseline robustness, but it is one layer in a broader safety evaluation — complementary to tools like ToxiGen for toxicity detection and MLCommons Taxonomy for standardized risk classification.

One Sentence to Remember

HarmBench is the shared crash test for AI safety — it gives every model the same obstacle course of harmful behaviors so you can compare safety claims with reproducible numbers instead of marketing promises.

FAQ

Q: Who created HarmBench and when was it released? A: The Center for AI Safety created HarmBench. According to HarmBench GitHub, version 1.0 was released on February 26, 2024, under an MIT open-source license with all code publicly available.

Q: How does HarmBench differ from general AI benchmarks like MMLU? A: General benchmarks measure knowledge and reasoning. HarmBench specifically measures how well a model refuses harmful requests when automated attacks actively try to bypass its safety filters.

Q: Can anyone run HarmBench evaluations independently? A: Yes. The framework is open-source under MIT license with code and behaviors on GitHub. Any researcher or organization can reproduce evaluations on their own models using the same methodology.

Sources

Expert Takes

HarmBench solves a measurement problem that held back AI safety research. Before standardized evaluation, safety claims were effectively unfalsifiable — every lab chose its own test set, its own criteria, its own comparison targets. By fixing the evaluation protocol across hundreds of behaviors and multiple attack methods, HarmBench turns safety into a reproducible, measurable quantity rather than a subjective assertion. That shift from opinion to measurement is where genuine scientific progress begins.

If you are evaluating model safety, HarmBench gives you a concrete starting point. Clone the repository, select your attack methods and target model, and run the evaluation pipeline. The output is a clean attack success rate broken down by category. Where teams get tripped up is treating the aggregate number as the whole story — always break results down by functional category, because a model that handles standard prompts well can still fail on contextual or multimodal behaviors.

Every major AI vendor now faces the same question from enterprise buyers: prove your model is safe. HarmBench turned safety from a trust-me claim into a show-me metric. Companies that can point to independently verified low attack success rates hold a real advantage in regulated industries like healthcare and finance. The ones that cannot back up their safety claims with standardized results will lose deals to those that can.

Standardized benchmarks carry a risk that rarely gets discussed. Once a fixed set of behaviors becomes the industry yardstick, safety teams start optimizing for that specific set rather than building genuine robustness. A model can pass every behavior in the benchmark while remaining vulnerable to scenarios the test never considered. The uncomfortable question is whether measuring safety this way actually makes models safer, or whether it just makes them better at passing one particular test.