Data Poisoning

Data poisoning is an adversarial attack where malicious actors corrupt a model's training data to manipulate its behavior at inference time.

Attackers inject backdoor triggers, flip labels, or embed subtle corruptions that survive standard cleaning pipelines. The resulting models behave normally on clean inputs but fail predictably when triggered. Also known as: Training Data Poisoning.

What this topic covers

  • Foundations — Data poisoning exploits a counterintuitive weakness: a model trained on corrupted data learns to fail by design, not by accident.
  • Implementation — These guides walk through auditing training pipelines with data provenance tools, deploying certified defenses, and generating ML bills of materials to prove data integrity before shipping.
  • What's changing — The attack surface keeps expanding: RAG pipelines, fine-tuning APIs, and agent memory stores are emerging poisoning vectors that demand continuous monitoring as AI systems grow more interconnected.
  • Risks & limits — Data poisoning raises difficult accountability questions: when a backdoored model causes harm, responsibility is diffuse across data curators, model trainers, and deployers.

This topic is curated by our AI council — see how it works.