Agent Evaluation And Testing
Also known as: agent evals, agent testing, AI agent evaluation
- Agent Evaluation And Testing
- Agent evaluation and testing measures whether AI agents — systems that plan, call tools, and produce multi-step outputs — perform correctly. It scores both outcome (did the task succeed?) and trajectory (were the right tool calls made in the right order?).
Agent evaluation and testing is the practice of measuring whether AI agents perform reliably across multi-step tasks, scoring both the final outcome and the trajectory of tool calls used to reach it.
What It Is
According to Google Cloud Docs, a trajectory is the sequence of tool calls an agent took to reach the final response. That single definition explains why agent evaluation differs from evaluating a chatbot: there is now a path to grade, not just an answer. An agent that books your flight by calling the wrong API in the wrong order — but happens to land on the right itinerary — has produced a correct outcome through a broken trajectory. Evaluation has to catch both shapes of failure.
The discipline emerged because production teams kept hitting the same wall. A model that answers questions accurately in isolation behaves unpredictably once it can plan, call tools, and chain steps. According to the LangChain State of Agent Engineering, quality has become the leading deployment barrier for organizations running agents in production. Teams needed a measurement layer deeper than “did the response look right.”
Modern agent evaluation works on two complementary layers. The outcome layer scores final state: did the database get updated, did the email send, did the answer match the gold reference? The trajectory layer scores the path: which tools were called, in what order, with what arguments, and was that the right way to get there? According to Anthropic Engineering, an evaluation harness orbits around a small set of primitives — Task, Trial, Grader, Transcript, Outcome — and you grade trajectories by feeding the trace through automated checks, LLM-as-judge graders, or both.
Trajectory grading is rarely binary. According to LangChain’s GitHub repository, the AgentEvals library exposes four match modes: strict (exact ordered sequence), unordered (same tools, any order), subset (expected calls are contained in the actual run), and superset (actual calls are contained in the expected set). Different agents need different strictness — a research agent benefits from unordered matching, a payment agent needs strict ordering. Picking the right match mode is half the design work.
How It’s Used in Practice
The most common entry point is connecting an agent framework to an evaluation platform during development. A product team building a customer-support agent on top of Claude or GPT will instrument runs through LangSmith, Vertex AI’s evaluation service, Arize Phoenix, or Braintrust. The platform captures every tool call, system prompt, and intermediate response as a trace. From there, the team writes a fixed dataset of test conversations, runs the agent over them in CI, and inspects which outcomes regressed and which trajectories drifted between releases.
The second common scenario is production monitoring. Once the agent ships, traces from real user sessions stream into the same evaluator, where trajectory rules and LLM-as-judge graders flag suspicious patterns — repeated tool retries, hallucinated function arguments, broken plan-and-execute loops. Evaluation becomes a continuous process rather than a one-time gate.
Pro Tip: Build your trajectory dataset from real production traces before writing synthetic ones. Synthetic test cases capture what you imagine the agent will face; production traces capture what it actually faces, including the edge cases you never thought to write.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Multi-step agent calling two or more tools per task | ✅ | |
| Stateless single-turn chatbot with no tool use | ❌ | |
| Customer-facing agent where wrong path equals compliance risk | ✅ | |
| Quick internal prototype shown to stakeholders next week | ❌ | |
| Agent retrained or prompt-tuned every sprint | ✅ | |
| One-off script run once by a single engineer | ❌ |
Common Misconception
Myth: If the agent gets the right final answer, the evaluation passes. Reality: Two agents can return the same answer through completely different trajectories — one efficient, one wasteful, one safe, one risky. Outcome scoring rewards both equally, which is why trajectory evaluation exists. A correct answer reached through three retries, a hallucinated argument, and a fallback to memorized text is not the same as a correct answer reached through the right tool on the first try.
One Sentence to Remember
Treat agent evaluation as two questions stacked on top of each other — did the agent get there, and did it take the right path — and you stop confusing lucky outcomes with reliable systems; pick a trajectory match mode that fits the agent’s job before you write your first test.
FAQ
Q: What’s the difference between agent evaluation and LLM evaluation? A: LLM evaluation grades single-turn outputs from the model. Agent evaluation grades multi-step runs that include planning, tool calls, and intermediate decisions, where the path matters as much as the final answer.
Q: Do I need both outcome and trajectory metrics? A: Yes. Outcome metrics tell you whether the agent succeeded; trajectory metrics tell you whether the path was correct, efficient, and safe. An agent can pass one and fail the other in either direction.
Q: Which match mode should I use for trajectory evaluation? A: Use strict ordering for high-stakes agents like payments or security, unordered for research and information-gathering, subset when expected calls must appear, superset when extra calls signal drift.
Sources
- Google Cloud Docs: Evaluate Gen AI agents — Vertex AI documentation - Reference for trajectory definition and match metrics.
- Anthropic Engineering: Demystifying evals for AI agents - Eval harness primitives and grading patterns.
Expert Takes
Not behavior. Statistics. Agent evaluation works because we treat each trajectory as a sample from a distribution — many runs of the same task expose probabilistic failure modes that single executions hide. Outcome metrics tell you the agent reached the right state; trajectory metrics tell you whether the path was efficient and correct. Both matter because LLM agents are stochastic systems, not deterministic functions, and one passing run does not guarantee the next will look anything like it.
Most agent failures I diagnose are specification failures, not model failures. The eval shows the agent picked the wrong tool — and the cause is almost always that the system prompt left the choice ambiguous. Write the trajectory you expect, run it as a fixture, and treat trajectory mismatches as missing context. Fix the spec, the test passes. Skip that step and you’ll keep tuning prompts that never told the agent what “done” actually looks like.
Quality is now the gating barrier to agent deployment, not capability. Teams that ship are the ones who built evaluation infrastructure before scaling features. The market is splitting cleanly: vendors who expose trajectory data win enterprise contracts, vendors who only return final answers lose them. You’re either measuring agent paths or you’re flying blind in production. The window for treating agent evals as optional just closed.
Who writes the test cases that decide an agent is safe to deploy? The same team that built it. The same vendor that profits from it shipping. Trajectory evaluation tells you whether the agent followed the rules you wrote — but it cannot tell you whether the rules were the right ones to write. What happens when the failures we never tested for are the ones that matter most?