Skip to content

Ai-engineering Series

Agents to Evaluation - Measuring What Matters

Deep dive into agent evaluation: the three dimensions (task completion, process quality, safety), evaluation strategies, building test suites, and production monitoring

Why Agent Evaluation is Hard

Evaluating agents is fundamentally harder than evaluating RAG.

In RAG:

  • Input: query
  • Output: answer
  • Evaluation: Does the answer match expected? Is it grounded in retrieved docs?

In agents:

  • Input: task
  • Output: action sequence + final result
  • Evaluation: Did it complete the task? Did it take reasonable steps? Did it NOT do anything harmful?

Agent evaluation has THREE dimensions:

  1. Task completion: Did it achieve the goal?
  2. Process quality: Did it take a reasonable path?
  3. Safety: Did it avoid harmful actions?

What Goes Wrong Without This:

Evaluation Failure Patterns

The Three Dimensions

Three-Axis Evaluation

An agent that completes tasks but leaks data is dangerous. An agent that’s safe but takes 10 minutes per request is useless. An agent that’s fast and safe but wrong is worthless.

Measure all three.


Task Completion Evaluation

Did the agent achieve the goal?

Task Completion Types

How to measure:

Task typeEvaluation methodAutomation
Binary (fact lookup)Compare to ground truthAutomated
Graded (code, structured)Test suite, schema validationSemi-automated
Open-ended (creative, research)Human review or LLM-as-judgeManual/expensive

The ground truth problem: For many agent tasks, “correct” is subjective. Is a 500-word summary better than a 200-word summary? Depends on context. Build evaluation criteria BEFORE building the agent.


Trajectory Evaluation

Was the agent’s process reasonable?

Trajectory: The Path The Agent Took

Trajectory metrics:

MetricWhat it measuresTarget
Step countEfficiencyTask-dependent minimum
Tool misuse rateSelection accuracy0%
Retry rateError recoveryLow
Loop detectionGetting stuckNever
Unnecessary actionsScope creep0

How to evaluate trajectories:

  1. Golden trajectories: Define expected action sequences for test tasks
  2. Trajectory comparison: Compare actual vs expected (allow variations)
  3. LLM-as-judge: Have another model evaluate if trajectory was reasonable
  4. Human review: Sample production trajectories for manual review

Safety Evaluation

Did the agent stay within bounds?

Safety Dimensions

Red-team evaluation:

Create adversarial test cases designed to break safety:

Adversarial Test Examples

Safety evaluation is never “done.” New attacks emerge. Update red-team tests continuously.


Evaluation Strategies

Different strategies for different needs:

Evaluation Strategy Matrix

Recommended combination:

  • Unit tests for regression prevention
  • Trajectory tests for efficiency monitoring
  • Fuzzing for edge case discovery
  • LLM-as-judge for scale with human review for calibration

Building an Evaluation Suite

Start with these test categories:

Evaluation Suite Structure

Run cadence:

  • Golden set: Every deployment
  • Edge cases: Weekly
  • Adversarial: Before major releases
  • Regression: Every deployment (these are bugs that must not return)

Production Monitoring

Evaluation doesn’t end at deployment:

Production Monitoring

Alert on: Success rate drop, latency spike, cost spike, safety threshold breach.


Honest Truths About Agent Evaluation

Honest Truths

The practical takeaway: Design for observability from day 1. You will debug in production. Make it possible.


Common Misconceptions

”If the agent completes the task, it’s working”

HOW it completes matters. An agent that succeeds in 50 steps when 5 would do is wasting money. An agent that succeeds by accessing data it shouldn’t is a security risk.

Evaluate task completion AND trajectory AND safety. All three.

”I’ll test a few examples and ship”

Agents are non-deterministic. The same input can produce different trajectories. A few tests might miss failure modes that appear 1% of the time—which means daily in production.

You need statistical confidence. Run each test multiple times. Budget for evaluation.

”LLM-as-judge solves evaluation”

LLM judges have biases. They favor longer responses. They miss subtle errors. They can be fooled by confident-sounding failures.

LLM-as-judge is A tool, not THE solution. Combine with human review on samples.


Key Takeaways

Key Takeaways

Verify Your Understanding

Before considering yourself capable of agent evaluation:

Your agent has three tools: [search_docs, query_api, respond_to_user].

  • Design 3 test cases that test task completion
  • Design 2 test cases that test trajectory quality
  • Design 2 test cases that test safety boundaries

Agent succeeds on 95% of your test set. Is it ready for production?

  • What else do you need to know?
  • What could go wrong that your test set doesn’t cover?

You’re using LLM-as-judge to evaluate your agent.

  • List 3 ways LLM-as-judge could give wrong evaluations
  • How would you validate that your LLM judge is trustworthy?

Your agent costs $0.50 per task and runs 10,000 tasks/day.

  • How much is evaluation costing?
  • If each eval run costs $0.10, how many times can you run your test suite monthly?

Explain why monitoring is MORE important than pre-deployment testing for agents.

  • What can monitoring catch that testing can’t?
  • What 5 metrics would you track from day 1?

Series Complete

You’ve completed the AI Engineering Fundamentals series!

What you’ve learned:

  1. Text → Tokens: How text becomes processable units
  2. Tokens → Embeddings: How meaning becomes vectors
  3. Embeddings → Attention: How tokens relate to each other
  4. Attention → Generation: How models produce text
  5. Generation → Retrieval: How to ground LLMs in facts
  6. Retrieval → RAG: The complete retrieval-augmented generation pipeline
  7. RAG → Agents: From single-shot Q&A to multi-step reasoning
  8. Agents → Evaluation: How to measure what matters

What’s next:

  • Build production systems with this foundation
  • Go deeper on specific topics (fine-tuning, reasoning models, memory systems)
  • Apply to real problems in your domain

Concepts covered in this article