Agents to Evaluation - Measuring What Matters | Intentional / Deliberate / Engineering

Why Agent Evaluation is Hard

Evaluating agents is fundamentally harder than evaluating RAG.

In RAG:

Input: query
Output: answer
Evaluation: Does the answer match expected? Is it grounded in retrieved docs?

In agents:

Input: task
Output: action sequence + final result
Evaluation: Did it complete the task? Did it take reasonable steps? Did it NOT do anything harmful?

Agent evaluation has THREE dimensions:

Task completion: Did it achieve the goal?
Process quality: Did it take a reasonable path?
Safety: Did it avoid harmful actions?

What Goes Wrong Without This:

Evaluation Failure Patterns

The Three Dimensions

Three-Axis Evaluation

An agent that completes tasks but leaks data is dangerous. An agent that’s safe but takes 10 minutes per request is useless. An agent that’s fast and safe but wrong is worthless.

Measure all three.

Task Completion Evaluation

Did the agent achieve the goal?

Task Completion Types

How to measure:

Task type	Evaluation method	Automation
Binary (fact lookup)	Compare to ground truth	Automated
Graded (code, structured)	Test suite, schema validation	Semi-automated
Open-ended (creative, research)	Human review or LLM-as-judge	Manual/expensive

The ground truth problem: For many agent tasks, “correct” is subjective. Is a 500-word summary better than a 200-word summary? Depends on context. Build evaluation criteria BEFORE building the agent.

Trajectory Evaluation

Was the agent’s process reasonable?

Trajectory: The Path The Agent Took

Trajectory metrics:

Metric	What it measures	Target
Step count	Efficiency	Task-dependent minimum
Tool misuse rate	Selection accuracy	0%
Retry rate	Error recovery	Low
Loop detection	Getting stuck	Never
Unnecessary actions	Scope creep	0

How to evaluate trajectories:

Golden trajectories: Define expected action sequences for test tasks
Trajectory comparison: Compare actual vs expected (allow variations)
LLM-as-judge: Have another model evaluate if trajectory was reasonable
Human review: Sample production trajectories for manual review

Safety Evaluation

Did the agent stay within bounds?

Safety Dimensions

Red-team evaluation:

Create adversarial test cases designed to break safety:

Adversarial Test Examples

Safety evaluation is never “done.” New attacks emerge. Update red-team tests continuously.

Evaluation Strategies

Different strategies for different needs:

Evaluation Strategy Matrix

Recommended combination:

Unit tests for regression prevention
Trajectory tests for efficiency monitoring
Fuzzing for edge case discovery
LLM-as-judge for scale with human review for calibration

Building an Evaluation Suite

Start with these test categories:

Evaluation Suite Structure

Run cadence:

Golden set: Every deployment
Edge cases: Weekly
Adversarial: Before major releases
Regression: Every deployment (these are bugs that must not return)

Production Monitoring

Evaluation doesn’t end at deployment:

Production Monitoring

Alert on: Success rate drop, latency spike, cost spike, safety threshold breach.

Honest Truths About Agent Evaluation

Honest Truths

The practical takeaway: Design for observability from day 1. You will debug in production. Make it possible.

Common Misconceptions

”If the agent completes the task, it’s working”

HOW it completes matters. An agent that succeeds in 50 steps when 5 would do is wasting money. An agent that succeeds by accessing data it shouldn’t is a security risk.

Evaluate task completion AND trajectory AND safety. All three.

”I’ll test a few examples and ship”

Agents are non-deterministic. The same input can produce different trajectories. A few tests might miss failure modes that appear 1% of the time—which means daily in production.

You need statistical confidence. Run each test multiple times. Budget for evaluation.

”LLM-as-judge solves evaluation”

LLM judges have biases. They favor longer responses. They miss subtle errors. They can be fooled by confident-sounding failures.

LLM-as-judge is A tool, not THE solution. Combine with human review on samples.

Key Takeaways

Verify Your Understanding

Before considering yourself capable of agent evaluation:

Your agent has three tools: [search_docs, query_api, respond_to_user].

Design 3 test cases that test task completion
Design 2 test cases that test trajectory quality
Design 2 test cases that test safety boundaries

Agent succeeds on 95% of your test set. Is it ready for production?

What else do you need to know?
What could go wrong that your test set doesn’t cover?

You’re using LLM-as-judge to evaluate your agent.

List 3 ways LLM-as-judge could give wrong evaluations
How would you validate that your LLM judge is trustworthy?

Your agent costs $0.50 per task and runs 10,000 tasks/day.

How much is evaluation costing?
If each eval run costs $0.10, how many times can you run your test suite monthly?

Explain why monitoring is MORE important than pre-deployment testing for agents.

What can monitoring catch that testing can’t?
What 5 metrics would you track from day 1?

Series Complete

You’ve completed the AI Engineering Fundamentals series!

What you’ve learned:

Text → Tokens: How text becomes processable units
Tokens → Embeddings: How meaning becomes vectors
Embeddings → Attention: How tokens relate to each other
Attention → Generation: How models produce text
Generation → Retrieval: How to ground LLMs in facts
Retrieval → RAG: The complete retrieval-augmented generation pipeline
RAG → Agents: From single-shot Q&A to multi-step reasoning
Agents → Evaluation: How to measure what matters

What’s next:

Build production systems with this foundation
Go deeper on specific topics (fine-tuning, reasoning models, memory systems)
Apply to real problems in your domain

Agents to Evaluation - Measuring What Matters

Why Agent Evaluation is Hard

The Three Dimensions

Task Completion Evaluation

Trajectory Evaluation

Safety Evaluation

Evaluation Strategies

Building an Evaluation Suite

Production Monitoring

Honest Truths About Agent Evaluation

Common Misconceptions

”If the agent completes the task, it’s working”

”I’ll test a few examples and ship”

”LLM-as-judge solves evaluation”

Key Takeaways

Verify Your Understanding

Series Complete

Concepts covered in this article

Table of Contents

Ai-engineering Series

Why Agent Evaluation is Hard

The Three Dimensions

Task Completion Evaluation

Trajectory Evaluation

Safety Evaluation

Evaluation Strategies

Building an Evaluation Suite

Production Monitoring

Honest Truths About Agent Evaluation

Common Misconceptions

”If the agent completes the task, it’s working”

”I’ll test a few examples and ship”

”LLM-as-judge solves evaluation”

Key Takeaways

Verify Your Understanding

Series Complete

Concepts covered in this article

Table of Contents