MAX guide 15 min read May 10, 2026

Agent Guardrails 2026: NeMo, Llama Guard, Claude SDK Hooks

Q: How to use NeMo Guardrails for tool allowlists and HITL gates?

NeMo does not have a built-in allowlist checkbox. You build the pattern with tool input rails plus a Colang 2.0 flow that matches on tool name and either allows the call, denies it, or routes it through a HITL confirmation flow. Use the IORails engine when you stack multiple classifiers — it runs them in parallel so latency does not multiply with each rail you add.

Q: How to add input and output guardrails with the OpenAI Agents SDK?

Use the @input_guardrail and @output_guardrail decorators in openai-agents 0.17.0. Each decorated function returns a GuardrailFunctionOutput — set tripwire_triggered=True to halt the run with an InputGuardrailTripwireTriggered or OutputGuardrailTripwireTriggered exception. One scope gotcha: input guardrails fire only on the first agent in a handoff chain, output guardrails only on the last. For per-tool checks, use @tool_input_guardrail and @tool_output_guardrail instead.

Layered guardrail components wrapping an autonomous agent runtime in production

Table of Contents

TL;DR

Guardrails are layered checkpoints, not a single product. Stack them at input, tool boundary, and output — each layer catches what the others miss.
Tool-call gates are the highest-leverage layer in 2026. Most production incidents start when an agent calls the wrong tool with the wrong arguments.
Every guardrail you add needs a test case. A guardrail without a triggering example is a hope, not a control.

Your agent worked all week. On Monday it spent the morning in a retry loop, hammering a billing endpoint with malformed arguments before the on-call engineer killed the process. The model didn’t go rogue. It just had a tool that didn’t say no, a permission layer that said yes by default, and zero output checking on the way back.

That class of incident is what Agent Guardrails are supposed to prevent — and the reason most teams fail at it is that they try to bolt on a single library instead of architecting layers.

Before You Start

You’ll need:

An agent runtime — Claude Agent SDK, OpenAI Agents SDK, or LangGraph
Working understanding of Guardrails as a concept, not a library
A clear list of the tools your agent can actually call in production
One concrete failure scenario you want to prevent

This guide teaches you: How to decompose agent safety into independent layers, decide which layer catches which class of failure, and wire them in the right order.

The Tool That Said Yes to Everything

The classic 2026 agent incident is not a jailbreak. It is a normal user request that triggered a tool call that should never have been allowed in that context.

A finance agent generates an invoice and triggers send_email to an address it pulled from a poisoned document. A coding agent reads a .env file because nothing told it that path was off-limits. A support agent issues a refund because the prompt said “be helpful.” Three different agents. Same root cause: the tool layer trusted the model.

It worked on Friday. On Monday, the input distribution shifted by a few percent and the agent found a path nobody specified.

OWASP catalogues this exact pattern as Excessive Agency — the combination of excessive functionality, excessive permissions, and excessive autonomy (OWASP GenAI Security Project). The fix is not “make the model safer.” The fix is to specify what the system around the model is allowed to do.

Step 1: Identify the Trust Boundaries

Before you reach for a library, draw the boundaries. An agent has four places where untrusted input meets trusted action. Each one is a separate concern. Each one needs its own guardrail.

Your system has these layers:

Input layer — User prompts, retrieved documents, tool outputs from prior turns. Anything entering the model context. Concerns: prompt injection, jailbreak attempts, off-topic queries, PII the agent should not see.
Tool gate — The point where the model says “call tool X with arguments Y” and the runtime decides whether to execute. Concerns: tool allowlists, argument validation, permission scope, human-in-the-loop (HITL) approval for high-impact actions.
Output layer — The agent’s final reply to the user, plus any text it returns to a downstream system. Concerns: harmful content, hallucinated facts, leaked secrets, off-policy advice.
Runtime hooks — Lifecycle events around the agent loop itself: session start, subagent spawn, compaction, completion. Concerns: audit logging, budget caps, session-level policy enforcement.

A fifth concern — Agent Evaluation And Testing — sits outside the runtime but defines whether the other four are working. Treat it as the harness, not a layer.

The Architect’s Rule: If you cannot say which layer is supposed to catch a given failure, no layer will catch it.

Step 2: Lock Down What the Agent Can See, Call, and Return

Each layer has a different specification. The mistake is to treat them as one config file. They are four contracts.

Input layer contract:

Which classifiers run on user input — prompt-injection detection, jailbreak detection, PII redaction
Which classifiers run on retrieved documents — same set, different threshold
What happens on a positive classification — block, redact, or annotate-and-forward
Which inputs bypass classification — internal system messages, trusted retrieval sources

Tool gate contract:

The exact set of tools each agent role is allowed to call
For each tool, the schema and value constraints on its arguments
Which tools require HITL approval before execution
Which tools are allowed only in specific session contexts (e.g., send_email only after a user confirmation turn)

Output layer contract:

Categories the output classifier checks — violence, self-harm, privacy, regulated advice
The fallback the agent emits when output is blocked
Whether the original generation is logged for review

Runtime hooks contract:

Per-session token and dollar budget
Maximum subagent depth
Audit fields written on every tool call
Trigger conditions for kill-switching the session

The Spec Test: If your tool gate config does not list every tool by name with an explicit allow/deny rule, your agent has at least one tool it can call that nobody approved.

Step 3: Wire the Layers in Order

Build order matters. Start with the tool gate, not the input filter. A leaky tool boundary is the most common production failure, and the layer that protects you even when the input classifier misses something.

Build order:

Tool gate first. This is the layer that prevents incidents, not just embarrassments. In the Claude Agent SDK, this is the PreToolUse hook plus PermissionRequest callbacks. The hook receives the tool name and arguments, and returns a permissionDecision of "allow", "deny", or "ask" (the TypeScript SDK adds "defer"). When multiple hooks fire, precedence is deny > defer > ask > allow — any single deny blocks the call (Claude Agent SDK Docs). For OpenAI Agents SDK 0.17.0, this is the @tool_input_guardrail decorator, which raises a tripwire exception when the arguments fail validation (OpenAI Agents SDK Docs).
Output layer second. Wrap the agent’s final response in a classifier check. The Llama Guard family is the current open-weight default — Llama Guard 4 is a 12B multimodal safety classifier that scores text-and-image inputs against the MLCommons 14-category hazard taxonomy and outputs a per-category verdict (Llama Guard 4 Model Card). For prompt-injection-specific scoring on inputs, pair it with Llama Prompt Guard 2 (86M or 22M classifier variants), which produces a binary benign/malicious decision (Llama.com Docs). In Python agent code, the OpenAI Agents SDK exposes @output_guardrail for the same purpose, raising OutputGuardrailTripwireTriggered when a check fails.
Input layer third. Now wire the input rails. Nemo Guardrails organises rails into five categories in v0.21.0 — input, dialog, output, retrieval, and execution (NVIDIA Docs) — and the input rail is where you run prompt-injection classifiers, PII redaction, and topic checks before the model ever sees the message. NeMo’s IORails engine runs these rails in parallel, which matters when you stack three or four classifiers per request (NeMo Guardrails GitHub).
Runtime hooks last. Add lifecycle hooks for budgets, audit, and kill-switches. Claude Agent SDK exposes PostToolUse, PostToolUseFailure, Stop, SubagentStart, SubagentStop, PreCompact, and UserPromptSubmit events you can use for accounting and policy enforcement.

For each layer, your context must specify:

What it receives (raw input, parsed tool call, generated output, lifecycle event)
What it returns (verdict, rewritten input, blocking exception, log entry)
What it must NOT do (mutate state outside its scope, swallow errors silently)
How to handle classifier failure (fail-closed for high-stakes tools, fail-open with alert for low-stakes paths)

Compatibility notes:
NeMo Guardrails Colang 2.0: Examples written against Colang 1.0 will not run unchanged on v0.21.0. Update flow syntax before copying older configs.
Claude Agent SDK Python SessionStart / SessionEnd: Not exposed as Python callbacks. Wire them as shell-command hooks via setting_sources=["project"] instead.
permissionDecision: "defer": TypeScript SDK only. The Python SDK supports allow, deny, and ask.
OpenAI Agents SDK guardrail scope: Input guardrails fire only on the first agent in a chain; output guardrails fire only on the agent producing the final output. They do not run between every handoff.

Four-layer agent guardrail stack: input rails, tool gate, output classifier, runtime hooks — The four guardrail layers and the failure class each one catches.

Step 4: Prove Each Layer Catches What It Should

A guardrail without a test case is decoration. For each layer, write at least one input that should trigger it and one input that should pass through cleanly.

Validation checklist:

Tool gate test. Send a message that asks the agent to call a denied tool. Failure looks like: the tool executes anyway, or the deny is silent and the agent reports success.
Tool argument test. Send a message that asks for a valid tool with out-of-range arguments. Failure looks like: arguments are coerced silently, or the call goes through with the bad value.
Input classifier test. Send a known prompt-injection payload — something from the Llama Prompt Guard test set or your own red-team library. Failure looks like: the payload reaches the model and the model follows it.
Output classifier test. Force the model to generate a response in a blocked category (using a test-only prompt). Failure looks like: the response reaches the user, or the fallback is missing.
HITL gate test. Trigger a high-impact tool call. Failure looks like: the call executes without an approval prompt, or the approval is auto-confirmed.
Budget cap test. Run a session designed to exceed your token or dollar limit. Failure looks like: the cap is logged but not enforced.

If a check fails, fix the spec for that layer before moving on. Adding a second classifier to compensate for a leaky first classifier is how you end up with four layers and zero coverage.

Common Pitfalls

What You Did	Why It Failed	The Fix
Treated NeMo Guardrails as a tool allowlist	NeMo does not ship a feature called `allowlist`. The pattern is built from input rails plus Colang dialog flows that match on tool names	Specify the allowlist in a Colang flow that runs before tool dispatch, then add a deny-by-default fallback
Relied on the model to refuse	Models are trained to be helpful. They will be helpful with a poisoned input	Move the refusal into the tool gate or output classifier, not the system prompt
Wired one mega-guardrail	A single classifier checking input, output, and tool args has one failure mode for everything	Split into per-layer specs with their own thresholds and test cases
Set every check to fail-open	A classifier that errors becomes a no-op. Production traffic looks safer than it is	Fail-closed on high-impact tools, alert-and-route-to-HITL on output, log-and-continue on input only when the model has no tools
Skipped the audit hook	First incident postmortem requires reconstructing what the agent saw and called. Without `PostToolUse` logs you are guessing	Wire `PostToolUse` (or its equivalent) on day one. Store tool name, arguments, return value, and the prompt that produced the call

Pro Tip

Treat your guardrail config as a separate artifact from your agent prompts. Prompts change weekly. Guardrails should change with a review. When the two live in the same file, the guardrail gets edited every time someone tweaks tone — and an “improvement” to the system prompt becomes a silent loosening of the safety contract.

Frequently Asked Questions

Q: How to build agent guardrails step by step in 2026? A: Start with the tool gate, not the input filter — that is where most production incidents originate. Then add an output classifier (Llama Guard 4 is the current default), then input rails for prompt-injection detection, then lifecycle hooks for budgets and audit. Watch out for fail-open defaults: a classifier that times out and silently passes the request is the most common reason a “fully guardrailed” agent still ships an incident.

Q: How to use NeMo Guardrails for tool allowlists and HITL gates? A: NeMo does not have a built-in allowlist checkbox. You build the pattern with tool input rails plus a Colang 2.0 flow that matches on tool name and either allows the call, denies it, or routes it through a HITL confirmation flow. Use the IORails engine when you stack multiple classifiers — it runs them in parallel so latency does not multiply with each rail you add.

Q: How to add input and output guardrails with the OpenAI Agents SDK? A: Use the @input_guardrail and @output_guardrail decorators in openai-agents 0.17.0. Each decorated function returns a GuardrailFunctionOutput — set tripwire_triggered=True to halt the run with an InputGuardrailTripwireTriggered or OutputGuardrailTripwireTriggered exception. One scope gotcha: input guardrails fire only on the first agent in a handoff chain, output guardrails only on the last. For per-tool checks, use @tool_input_guardrail and @tool_output_guardrail instead.

Your Spec Artifact

By the end of this guide, you should have:

A four-row trust boundary map — input, tool gate, output, runtime hooks — with the failure class each row catches
A per-layer constraint list — classifiers, decisions, scope, fail behavior — that fits on a single page per layer
A six-test validation checklist — tool gate, tool args, input classifier, output classifier, HITL gate, budget cap — each with a triggering input and a pass-through input

Your Implementation Prompt

Paste the prompt below into Claude Code, Cursor, or your AI coding tool of choice. Replace every bracketed placeholder with values from your own spec artifact. The prompt mirrors the four-layer decomposition from Step 1 and the build order from Step 3.

You are helping me wire agent guardrails for [agent name and purpose].

Runtime: [Claude Agent SDK Python | OpenAI Agents SDK 0.17.0 | LangGraph + NeMo Guardrails 0.21.0]
Allowed tools: [comma-separated tool list with one-line description each]
Tools requiring HITL approval: [subset of the above]
High-impact tools (fail-closed on classifier error): [subset of the above]

Build the layers in this order:

1. Tool gate. Implement [PreToolUse hook | @tool_input_guardrail | NeMo input rail with Colang flow] that:
   - Allows only the tools listed above
   - Validates arguments against [JSON schema or value-constraint description per tool]
   - Routes [HITL tool list] to a permission prompt before execution
   - Returns deny on unknown tools or invalid arguments

2. Output classifier. Wrap the final response with [Llama Guard 4 via PurpleLlama | @output_guardrail | NeMo output rail] that:
   - Checks for [list of MLCommons categories that matter for this agent]
   - Returns [exact fallback string the agent emits when blocked]
   - Logs the original generation to [audit destination]

3. Input rails. Add [NeMo input rail | @input_guardrail] that runs:
   - Llama Prompt Guard 2 on user input with threshold [number from your test set]
   - PII redaction for [field list]
   - Fail behavior: [block | redact | annotate-and-forward]

4. Runtime hooks. Wire [PostToolUse | session lifecycle event] that:
   - Logs tool name, arguments, return value, latency to [audit destination]
   - Enforces a per-session cap of [token budget] and [dollar budget]
   - Triggers kill-switch when [specific condition]

For each layer, write one test that should trigger the guardrail and one that should pass through. Output the test cases as runnable code in [pytest | your test framework].

Ship It

You now have a four-layer trust map, a per-layer contract, and a build order that puts the highest-leverage layer first. The next time someone asks “are our agents safe?” you can point at the layers, the test cases, and the audit log — instead of pointing at a system prompt and hoping.

Sources

NeMo Guardrails GitHub: NVIDIA-NeMo/Guardrails — Releases - Current release notes, IORails engine, tool input rails, LangGraph integration
NVIDIA Docs: NVIDIA NeMo Guardrails Library Developer Guide - Five rail categories, Colang dialog language, configuration patterns
Llama Guard 4 Model Card: meta-llama/Llama-Guard-4-12B - 12B multimodal safety classifier, MLCommons 14-category hazard taxonomy
Llama.com Docs: Llama Prompt Guard 2 — Model Cards and Prompt formats - 86M and 22M prompt-injection classifier variants
Claude Agent SDK Docs: Intercept and control agent behavior with hooks - PreToolUse / PostToolUse / PermissionRequest events and decision precedence
OpenAI Agents SDK Docs: Guardrails — OpenAI Agents SDK (Python) - Decorator API, tripwire exceptions, scope rules
OWASP GenAI Security Project: LLM06:2025 Excessive Agency - Three root causes and mitigation patterns

Aha Moments

MONA

Max is right that the tool gate is the highest-leverage layer, and the reason is mathematical. A language model produces a distribution over possible next actions — including tool calls. Even a well-aligned model assigns nonzero probability to the wrong call when the input is adversarial or just unusual. The tool gate is the only layer that converts a probability into a binary decision before something irreversible happens. Input and output classifiers shift the distribution. The tool gate clips it. That is why a guardrail strategy that relies on input filtering alone leaks under distribution shift, while one anchored at the tool boundary holds up even when the model is having a bad day.

DAN

Mona’s framing matters because the market is sorting agent vendors right now on exactly this axis. The teams shipping agents to enterprise customers are the ones with auditable tool gates and per-tool approval flows. The teams pitching “the model is aligned” are losing those deals before the security review is finished. Llama Guard and Claude SDK hooks are not exotic capabilities anymore — they are the baseline a buyer expects to see. If your agent demo cannot answer “show me the tool allowlist,” you are not in the conversation. Max’s four-layer map is what a security team will ask you to draw on a whiteboard, and the speed at which you can draw it tells them whether you have shipped before.

ALAN

Both of you are describing the technical layers, which is the easier part. The harder question is who decides what each layer is allowed to refuse. A tool gate that blocks an action is a policy decision dressed up as a config file. The HITL approval flow you wire today defines whose judgment the system defers to tomorrow — and that person is rarely the user being protected. Max’s spec artifact is genuinely useful, but it presumes the team writing it has the authority to set the policy. In most enterprises, that authority is split across security, compliance, product, and legal, and none of them are reading the Colang flow. So I will leave you with this: when your agent refuses an action next quarter, will the person on the receiving end be able to find out why, and challenge it?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors