Safety & Red Teaming

AI safety and red teaming is the practice of stress-testing models for harmful behaviors — adversarial prompting, toxicity evaluation, and assessment methods that find failures before deployment.

Authors 25 articles 248 min total read Updated Mar 28, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

4 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Bias and Fairness Metrics →

Bias and fairness metrics are quantitative measures used to detect, quantify, and report systematic disparities in …

6 articles

Hallucination →

Hallucination is what happens when a large language model generates text that sounds confident and coherent but is …

6 articles

Red Teaming for AI →

Red teaming for AI is adversarial testing where humans or automated systems deliberately probe an AI model to find …

7 articles

Toxicity and Safety Evaluation →

Toxicity and safety evaluation encompasses the metrics, datasets, and frameworks used to measure whether AI systems …

6 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Mar 28, 2026

Mathematical proof notation with competing fairness metric equations pulling a balance point in three irreconcilable

MONA explainer 10 min Mar 28, 2026

The Impossibility Theorem and Why No Model Can Satisfy Every Fairness Metric at Once

When group base rates differ, no algorithm satisfies calibration, equal error rates, and demographic parity at once. Learn the math behind fairness trade-offs.

Balanced probability distributions splitting across protected groups with a fairness threshold line

MONA explainer 10 min Mar 28, 2026

What Are Bias and Fairness Metrics and How They Detect Discrimination in ML Predictions

Fairness metrics test whether ML models discriminate by group. Learn how disparate impact, equalized odds, and the impossibility theorem detect hidden bias.

Overlapping safety benchmark taxonomies visualized as intersecting geometric planes with color-coded hazard categories

MONA explainer 10 min Mar 28, 2026

HarmBench, ToxiGen, and MLCommons Taxonomy: The Datasets and Standards Behind AI Safety Testing

HarmBench, ToxiGen, and MLCommons AILuminate define how AI safety is measured. Learn the datasets, classifiers, and taxonomies behind modern toxicity evaluation.

Toxicity classifier decision boundaries separating harmful from safe regions in AI output evaluation space

MONA explainer 10 min Mar 28, 2026

What Is Toxicity and Safety Evaluation and How Guard Models Score Harmful AI Outputs

Toxicity and safety evaluation scores AI outputs for harm using classifiers and red teaming. Learn how guard models detect toxic content and where they fail.

Three intersecting geometric boundaries representing competing fairness constraints across a population distribution

MONA explainer 10 min Mar 28, 2026

Demographic Parity vs. Equalized Odds vs. Calibration: Core Fairness Metrics Compared

Demographic parity, equalized odds, and calibration define fairness differently and cannot all be satisfied at once. Learn what that trade-off means.

Diverging toxicity confidence scores revealing systematic classifier bias patterns across different language dialects

MONA explainer 10 min Mar 26, 2026

False Positives in Toxicity Detection: Dialect Bias, Bypasses

Toxicity classifiers over-flag minority dialects and miss adversarial attacks. Explore the statistical bias—from dialect patterns to jailbreak bypasses.

Geometric diagram of interconnected security framework layers mapping AI system vulnerabilities

MONA explainer 11 min Mar 26, 2026

OWASP LLM Top 10, MITRE ATLAS, and the Frameworks That Structure AI Red Teaming

OWASP LLM Top 10 and MITRE ATLAS give red teams structured attack categories. Learn how these frameworks turn AI security testing from guesswork into coverage.

Particles forming adversarial attack vectors converging on an AI model decision boundary

MONA explainer 10 min Mar 26, 2026

Red Teaming for AI: Adversarial Testing Exposes Failures

Red teaming uses adversarial testing to reveal AI vulnerabilities. Discover what it catches, mechanics, and why it outperforms traditional security approaches.

Branching classification tree with split pathways representing hallucination taxonomy categories

MONA explainer 10 min Mar 26, 2026

Intrinsic vs. Extrinsic, Closed vs. Open Domain: The Taxonomy and Prerequisites of LLM Hallucination

LLM hallucination isn't one problem — it's four. Learn the intrinsic vs. extrinsic taxonomy, the domain split, and the prerequisites that reframe the field.

Probability distribution branching into confident but factually diverging output paths from a language model

MONA explainer 9 min Mar 26, 2026

What Is AI Hallucination and How Statistical Next-Token Prediction Creates Confident Falsehoods

AI hallucinations aren't bugs — they emerge from how next-token prediction works. Learn why LLMs produce confident falsehoods and what limits current fixes.

Mathematical proof that language model hallucination cannot be eliminated, showing fundamental limits of autoregressive

MONA explainer 9 min Mar 26, 2026

Why Zero-Hallucination LLMs Remain Impossible: Autoregressive Limits and Benchmark Ceilings in 2026

LLM hallucination is mathematically inevitable. Explore the autoregressive limits, benchmark ceilings, and why zero-hallucination LLMs remain impossible in 2026.

Overlapping automated and human search beams with a dark gap between them representing red teaming coverage limits

MONA explainer 10 min Mar 26, 2026

Automated Red Teaming Misses What Humans Catch: Coverage Gaps

Automated red teaming outperforms human testing but misses critical failures. Coverage gaps explain why automated testing remains fundamentally incomplete.

Safety & Red Teaming

What topics does this domain cover?

Bias and Fairness Metrics →

Hallucination →

Red Teaming for AI →

Toxicity and Safety Evaluation →

Four perspectives on this domain

The Impossibility Theorem and Why No Model Can Satisfy Every Fairness Metric at Once

What Are Bias and Fairness Metrics and How They Detect Discrimination in ML Predictions

HarmBench, ToxiGen, and MLCommons Taxonomy: The Datasets and Standards Behind AI Safety Testing

What Is Toxicity and Safety Evaluation and How Guard Models Score Harmful AI Outputs

Demographic Parity vs. Equalized Odds vs. Calibration: Core Fairness Metrics Compared

False Positives in Toxicity Detection: Dialect Bias, Bypasses

OWASP LLM Top 10, MITRE ATLAS, and the Frameworks That Structure AI Red Teaming

Red Teaming for AI: Adversarial Testing Exposes Failures

Intrinsic vs. Extrinsic, Closed vs. Open Domain: The Taxonomy and Prerequisites of LLM Hallucination

What Is AI Hallucination and How Statistical Next-Token Prediction Creates Confident Falsehoods

Why Zero-Hallucination LLMs Remain Impossible: Autoregressive Limits and Benchmark Ceilings in 2026

Automated Red Teaming Misses What Humans Catch: Coverage Gaps

How to Audit ML Models for Bias Using AI Fairness 360, Fairlearn, and What-If Tool in 2026

AI Safety Evaluation: Llama Guard, Perspective API, promptfoo 2026

AI Safety Testing for Developers: What Maps and What Breaks

How to Detect and Reduce LLM Hallucinations with DeepEval, RAGAS, and RAG Grounding in 2026

How to Red Team an LLM with Promptfoo, PyRIT, and Garak in 2026

From COMPAS to the EU AI Act: Fairness Metrics Reshaping AI Accountability in 2026

AI Safety Tools: Llama Guard 4, DuoGuard, ISC-Bench 2026

From GPT-4 Pre-Launch to Frontier Model Audits: How AI Red Teaming Became Industry Standard by 2026

From Courtroom Fabrications to Finix-S1's 1.8% Error Rate: Hallucination Failures and Fixes in 2026

Fairness by Numbers: When Bias Metrics Mask Structural Inequality Instead of Fixing It

Who Decides Toxicity? Bias, Overcensorship, Power in AI Safety

When AI Lies Confidently: Liability, Disclosure, and the Unsolved Ethics of LLM Hallucination

Who Gets to Break the Model: Power, Access, and Accountability Gaps in AI Red Teaming

Cookie Settings