The loop body is the easy part. Termination is the part that determines whether your agent ships or pages oncall at 3am.
Building On Previous Knowledge
The previous chapter taught the RAG debugging decision tree: every wrong RAG answer is one of three failure classes. RAG is one-shot — retrieve once, generate once. That’s enough when the user query maps cleanly to a passage. It’s not enough when the user wants an action: refund this order, schedule that meeting, query and then update three systems.
This chapter promotes the data path one level: from “retrieve a document” to “decide which tool to call, observe its output, decide what to do next”. The mechanism is the agent loop; the canonical academic reference is Yao et al. 2022, ReAct [yao2022]; the production reality is messier than the paper suggests.
Where most agent tutorials stop: they show the Thought → Action → Observation triple, demo a happy-path single iteration, and ship a LangChain snippet. They never tell you what terminates the loop. The trap nobody names: a ReAct loop without explicit stop conditions silently runs until you blow your token budget. This chapter shows the three stop conditions every production agent ships with. The bridge into the production-agents series follows — checkpointing, idempotency, and cost-control turn the loop body into something that survives users.
Takeaway: agents are the same data path as RAG with one promotion — the model now picks the next action in a loop, not just the answer to one prompt. The hard engineering is termination, not generation.
Why This Matters
RAG answers questions. Agents solve problems.
When a user asks “What’s the status of order #12345?”, RAG retrieves a document. But what if answering requires:
- Querying an order database
- Checking shipping status from an API
- Calculating estimated delivery based on location
- Composing a response with all that information
RAG can’t do this. RAG retrieves static documents. Agents take actions.
If you try to build multi-step systems with RAG patterns, you’ll create brittle pipelines that break on variation. Understanding the agent mental model lets you build flexible systems that adapt.
Takeaway: agents are warranted when the strategy depends on what you find — not when “multi-step” describes the workflow. Static multi-step pipelines beat agents on reliability and cost.
What Goes Wrong Without This:
Symptom: Your "smart assistant" can only answer questions from documents. Users ask for actions, it apologizes. Cause: You built RAG when you needed an agent. RAG retrieves information. It doesn't take action or call APIs. Symptom: Your multi-step pipeline is 500 lines of if/else handling every edge case. Adding a new capability requires 2 weeks. Cause: You hardcoded the reasoning that should be delegated to the LLM. Every variation is a code branch. Symptom: Your agent attempts an action, fails, and doesn't recover. It returns "Error occurred" to the user. Cause: You built a pipeline, not an agent. Pipelines don't adapt. Agents observe results and adjust.
Pipelines vs Agents
There are two ways to build multi-step AI systems:
PIPELINE (Code decides): Input → Step 1 → Step 2 → Step 3 → Output ↓ ↓ ↓ [fixed] [fixed] [fixed] The code determines what happens at each step. Each branch is explicitly written. Predictable, but rigid. AGENT (Model decides): ┌───────────────────────────┐ │ │ Input → │ Observe current state │◄────────┐ │ ↓ │ │ │ Think: what next? │ │ │ ↓ │ │ │ Act: execute decision │─────────┘ └─────────────┬─────────────┘ ↓ Output (when done) The model determines what happens at each step. Flexible, but less predictable.
The key question: Who decides the next step—your code or the model?
- Pipeline: You enumerate all paths. Reliable for known scenarios. Fails on novel scenarios.
- Agent: Model reasons about what to do. Handles variation. Can make mistakes.
Neither is better. They solve different problems.
Takeaway: the deciding question is who picks the next step — your code (pipeline) or the model (agent). Pipelines are predictable but rigid; agents are flexible but non-deterministic. Pick by failure-mode tolerance, not by hype.
The Agent Loop
An agent is a loop. The LLM decides what to do, executes it, observes the result, and decides again.
┌──────────────┐ ┌──────▶│ OBSERVE │ │ │ │ │ │ What do I │ │ │ know now? │ │ └──────┬───────┘ │ │ │ ▼ │ ┌──────────────┐ │ │ THINK │ │ │ │ │ │ What should │ │ │ I do next? │ │ └──────┬───────┘ │ │ │ ▼ │ ┌──────────────┐ ┌──────┴──┐ │ ACT │ │ Not done│◄───┤ │ └─────────┘ │ Execute the │ │ decision │ └──────┬───────┘ │ ▼ ┌─────────┐ │ Done? │ └────┬────┘ │ Yes ▼ ┌─────────┐ │ OUTPUT │ └─────────┘
Each iteration:
- Observe: What information do I have? What just happened?
- Think: Given my goal and current state, what’s the best next action?
- Act: Execute the chosen action
- Evaluate: Am I done? If not, loop.
The magic: the model decides the action at step 2. This is what makes it an agent, not a pipeline.
Takeaway: the loop is Observe → Think → Act → Evaluate. Three iteration types exist (continue / exit / error) and the most common production bug is shipping with only the first one defined.
Tools: The Agent’s Capabilities
An agent without tools is just a chatbot. Tools are functions the agent can call.
Tool Definition: ┌───────────────────────────────────────────────────────────┐ │ name: "search_orders" │ │ description: "Search orders by user ID, order ID, │ │ or date range" │ │ parameters: │ │ user_id: string (optional) │ │ order_id: string (optional) │ │ date_from: date (optional) │ └───────────────────────────────────────────────────────────┘ Agent receives tool descriptions → LLM learns WHEN to use Agent receives user query → LLM decides WHICH tool + arguments Tool returns result → Agent observes and continues
Common tool categories:
| Category | Examples | What it enables |
|---|---|---|
| Data retrieval | search_docs, query_database | Access information |
| External APIs | get_weather, check_inventory | Real-time data |
| Actions | send_email, create_ticket | Side effects |
| Computation | calculate, run_code | Complex logic |
| User interaction | ask_user, show_options | Clarification |
Tool descriptions are prompts. Good descriptions = agent uses tools correctly. Bad descriptions = agent guesses wrong. Both OpenAI and Anthropic ship the same loop shape with slightly different schemas ([openai-tools], [anthropic-tools]); the underlying mental model is identical.
RAG as a Tool
Here’s the insight: RAG doesn’t get replaced by agents—it becomes a tool.
Available tools: • search_docs: Search knowledge base for relevant info • query_orders: Get order details from database • check_shipping: Get real-time shipping status User: "When will my order #789 arrive?" Agent thinks: "I need order details first" Agent acts: query_orders(order_id="789") Observation: {status: "shipped", carrier: "FedEx", ...} Agent thinks: "Order is shipped, need tracking info" Agent acts: check_shipping(tracking="FX123456") Observation: {location: "Chicago", est_delivery: "Dec 5"} Agent thinks: "I have all the info, can answer now" Agent responds: "Your order shipped via FedEx and should arrive December 5th."
RAG is retrieval. Agents can use retrieval as one capability among many.
Takeaway: tools are the agent’s interface to the world — their descriptions are prompts in disguise. RAG doesn’t get replaced by agents; it becomes one of their tools. Vague tool descriptions cause silent failures more often than buggy tool code.
The ReAct Pattern
The most common agent architecture is ReAct — Reason + Act. The pattern was named by Yao et al. 2022 [yao2022], which paired explicit reasoning traces with environment-coupled actions. On ALFWorld it beat the prior imitation-learning baseline by 34 percentage points absolute, and on WebShop by 10 points [yao2022]. The lift comes from making the reasoning step explicit and inspectable, not from a smarter model.
User: "What's the refund status for alice@example.com?" ┌─────────────────────────────────────────────────────────┐ │ THOUGHT: I need to find any refund requests from this │ │ user. I'll search the orders first. │ │ │ │ ACTION: search_orders(email="alice@example.com") │ └─────────────────────────────────────────────────────────┘ OBSERVATION: [ {order_id: "456", status: "refund_requested", amount: 99} ] ┌─────────────────────────────────────────────────────────┐ │ THOUGHT: Found an order with refund requested. │ │ I need to check the refund processing status. │ │ │ │ ACTION: check_refund(order_id="456") │ └─────────────────────────────────────────────────────────┘ OBSERVATION: {status: "approved", processed_date: "Dec 3"} ┌─────────────────────────────────────────────────────────┐ │ THOUGHT: The refund has been approved and processed. │ │ I have enough info to answer. │ │ │ │ ACTION: respond_to_user │ └─────────────────────────────────────────────────────────┘ RESPONSE: "Alice's refund of $99 for order #456 was approved and processed on December 3rd."
The THOUGHT step makes the agent’s reasoning visible. This helps with:
- Debugging (you can see why it chose an action)
- Guidance (you can provide examples of good reasoning)
- Error recovery (model realizes when it’s stuck)
Stop Conditions: The Half ReAct Tutorials Skip
ReAct loops don’t terminate on their own. The original paper introduced a special Finish[answer] action — when the model emits it, the loop ends and the bracketed string is returned. Production agents need three termination paths, not one. This is the chapter’s load-bearing claim, and the hero diagram at the top of the chapter visualises why:
Finish[answer]— the model decided it has the answer. Return it and exit.max_iterationsbudget reached — every loop body has a budget (typically 5–15 iterations). When hit, return a partial answer + flag for human review. Never run unbounded.- Terminal tool error or guardrail trip — circuit-breaker open, permission denied, catastrophic-action blocker fired. Return a structured failure; do not retry from the same state.
The runaway-loop trap nobody names: a ReAct agent with no max_iterations ceiling will call a broken tool 30+ times against a hard query, burn dollars in tokens, and never return. The user has already opened a support ticket. The fix is one parameter on the loop plus a logged outcome. This is the bridge into the production-agents series, which treats cost-control budgets and durable-execution checkpoints as first-class concerns [pa-cost, pa-durable].
Takeaway: ReAct’s contribution is the explicit Thought/Action/Observation triple, not a smarter model. The triple is the easy half. The hard half is wiring three stop conditions into the loop before a single user touches it.
Agent Memory
Agents without memory forget everything between turns. Production agents need memory.
SHORT-TERM MEMORY (Conversation Context) ──────────────────────────────────────── What: Previous messages in current session How: Append to LLM context Limit: Context window size User: "Check order #123" Agent: "Order #123 shipped Dec 1" User: "When will IT arrive?" ← "it" = order #123 Short-term memory resolves the reference LONG-TERM MEMORY (Persistent Knowledge) ──────────────────────────────────────── What: Facts that persist across sessions How: Vector store for semantic retrieval Limit: Storage capacity Session 1: User says "I prefer email over SMS" → Store: ("user_preference", "prefers email for notifications") Session 2: Agent needs to notify user → Retrieve preference → Send email WORKING MEMORY (Scratch Pad) ──────────────────────────────────────── What: Intermediate results during task execution How: Structured state object Limit: Task complexity Task: "Calculate total revenue by region" Working memory: { "north": 150000, "south": 120000, ← Accumulated as agent works "east": pending... }
Without memory, agents can’t handle multi-turn conversations, learn user preferences, or maintain context across sessions.
Takeaway: three orthogonal memories — short-term (in-context, ephemeral), long-term (vector-store, persistent), working (structured scratch-pad, task-scoped). Conflating them is one of the most expensive design errors in production agents.
When Agents Are Wrong
Agents are not always the answer. Sometimes they’re the problem.
USE DIRECT LLM CALL when: • Single-step task (summarize, translate, classify) • No external data needed • No actions required USE RAG when: • Answer exists in your documents • Single retrieval + generation is sufficient • You want predictable, auditable answers USE PIPELINE when: • Steps are known and fixed • High reliability required • Each step must happen regardless of previous results USE AGENT when: • Task requires multiple tools/data sources • Strategy depends on intermediate results • User requests vary significantly • Recovery from failure requires reasoning
The “agent for everything” anti-pattern:
User: "What's 2 + 2?" BAD (over-engineering): Agent thinks: "I should use the calculator tool" Agent acts: calculate("2 + 2") Observation: 4 Agent responds: "The answer is 4" Cost: Multiple LLM calls, tool overhead Time: 2-3 seconds GOOD (direct): LLM responds: "4" Cost: One LLM call Time: 200ms
Agents add:
- Latency: Multiple LLM calls per request
- Cost: Each thought/action cycle costs tokens
- Non-determinism: Same input can produce different paths
- New failure modes: Wrong tool selection, hallucinated arguments, infinite loops
Don’t use an agent when a simpler approach works.
Takeaway: Direct LLM / RAG / Pipeline / Agent is a four-way fork — pick the simplest that fits the problem. “Agent for everything” is the most expensive mistake in the field; every iteration of the loop is multiplied by latency, cost, and a new failure surface.
Agent Failure Modes
Agents introduce new ways to fail:
1. WRONG TOOL SELECTION Agent picks search_docs when it should use query_orders Cause: Ambiguous tool descriptions, poor examples 2. HALLUCINATED ARGUMENTS Agent calls: check_order(order_id="MADE_UP_ID") Cause: Model invents plausible-looking arguments 3. INFINITE LOOPS Agent keeps trying the same failing action Cause: No loop detection, poor error handling instructions 4. PREMATURE TERMINATION Agent responds before gathering enough information Cause: Weak instructions to be thorough 5. SCOPE CREEP Agent takes actions beyond what user asked Cause: Unclear boundaries, model being "helpful" 6. CATASTROPHIC ACTIONS Agent deletes data, sends emails, makes purchases Cause: Powerful tools without guardrails
Takeaway: agents introduce six failure modes RAG doesn’t have — wrong tool, hallucinated arguments, infinite loops, premature termination, scope creep, catastrophic actions. The first three are addressed by the agent’s prompt and the stop conditions; the last three need guardrails outside the loop, which is exactly what the production-agents series covers.
Common Pitfalls & Misconceptions
The agent mental model is enough to design one. The table below is enough to ship one. Each row maps a misconception that derails new agent projects to its concrete fix.
| Misconception | Why it’s wrong | What to do instead |
|---|---|---|
| ”More tools = more capability” | A 30-tool agent picks the wrong tool more often than a 3-tool agent picks the right one. Tool descriptions compete for the same context budget; selection accuracy drops with tool count. | Cap tool count at ~10 per agent. Compose agents if you need more capability — don’t extend tool lists. |
| ”We added retry logic, the loop is safe now” | Retry without a stop condition is just a faster runaway loop. The model “helpfully” re-tries the same failing tool 47 times with the same arguments. | Wire all three stop conditions: Finish[answer], max_iterations, terminal tool error. Test by forcing each one in eval. |
| ”Tool descriptions don’t matter — the model is smart” | Tool descriptions are prompts. A vague description: "handles data" triggers selection errors that look like model bugs. The model is doing exactly what you described — just not what you meant. | Write tool descriptions as if briefing a new colleague: name the trigger condition, the inputs, the output shape, and the side effects. |
| ”Agents are non-deterministic, so we can’t evaluate them” | The trajectory is non-deterministic, but the outcome on a fixed eval set is measurable. Same input → score the final answer and the trajectory shape (tool sequence, iteration count). | Evaluate three dimensions separately: task completion (did the user get the answer?), process quality (right tools in a reasonable order?), and safety (no out-of-scope actions). |
| ”We added an LLM-judge eval, we’re done” | LLM-judge scores drift with the judge model version. Same agent, same answers, different judge release → different score. The production-agents series covers this trap in depth [pa-testing]. | Pin the judge model version. Run a held-out human-labelled eval set quarterly to calibrate drift. |
| ”The agent worked in dev but goes wild in prod” | Dev tasks are crafted to match the agent’s tools. Real users ask for things that map ambiguously onto the tool set — the agent picks the wrong tool, falls into a retry loop, or scopes-creeps into “helpful” side actions. | Log every (user query → tool sequence → outcome) triple in prod. Eval against the real distribution. Add HITL escalation when confidence is low. |
| ”We’ll add guardrails later” | Catastrophic actions (delete production data, send unauthorized email, charge a card) are irreversible. There’s no later. | Implement permission boundaries + confirmation prompts + audit logging before the agent has access to the destructive tool. Make the guardrail the precondition, not the patch. |
Takeaway: agents fail in classes RAG doesn’t. Most are addressable with three structural fixes — explicit stop conditions, tool-count discipline, and guardrails outside the loop. The production-agents series is the operator-grade deep dive on those three.
Code Example
A minimal ReAct loop pinned to current OpenAI tool-calling semantics, with all three stop conditions wired in. The loop body is ~30 lines; the stop-condition logic is most of the value:
# Tested on:
# openai==1.40.0
# Python 3.11
import json
from openai import OpenAI
client = OpenAI()
# 1. Tool schemas (OpenAI 1.x function-calling spec) --------------------------
tools = [
{
"type": "function",
"function": {
"name": "search_orders",
"description": (
"Search for orders by user email OR order ID. Returns a list of orders "
"with fields: order_id, status, amount. Use when the user names a customer "
"or order; do NOT use for shipping or refund status queries."
),
"parameters": {
"type": "object",
"properties": {
"email": {"type": "string", "description": "User email address"},
"order_id": {"type": "string", "description": "Order ID"},
},
},
},
},
{
"type": "function",
"function": {
"name": "check_refund",
"description": (
"Check refund processing status for a specific order. Returns "
"{status, processed_date}. Requires an order_id from a prior search_orders call."
),
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
},
},
},
]
# 2. Tool implementations (mocked) ---------------------------------------------
def search_orders(email=None, order_id=None):
return [{"order_id": "456", "status": "refund_requested", "amount": 99}]
def check_refund(order_id):
return {"status": "approved", "processed_date": "Dec 3"}
TOOL_REGISTRY = {"search_orders": search_orders, "check_refund": check_refund}
def execute_tool(name: str, arguments: dict) -> dict:
fn = TOOL_REGISTRY.get(name)
if fn is None:
# stop #3 — terminal tool error (guardrail trip)
return {"_terminal_error": f"Unknown tool: {name}"}
try:
return fn(**arguments)
except Exception as e:
return {"_terminal_error": str(e)}
# 3. The agent loop with all three stop conditions ----------------------------
def run_agent(user_message: str, max_iterations: int = 5) -> dict:
messages = [
{"role": "system", "content": (
"You are a customer-service agent. Use tools to gather information, "
"then answer the user. If a tool fails twice with the same arguments, stop "
"and report the failure — do not retry indefinitely."
)},
{"role": "user", "content": user_message},
]
for iteration in range(max_iterations):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools,
)
msg = resp.choices[0].message
# stop #1 — Finish[answer]: model emitted no tool calls → answer is ready
if not msg.tool_calls:
return {"answer": msg.content, "iterations": iteration + 1, "stop": "finish"}
messages.append(msg)
for tool_call in msg.tool_calls:
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
result = execute_tool(name, args)
# stop #3 — terminal tool error: return structured failure, do not retry
if "_terminal_error" in result:
return {"answer": None, "error": result["_terminal_error"], "stop": "tool_error"}
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# stop #2 — max_iterations: return partial state for HITL review
return {"answer": None, "stop": "max_iterations", "messages": messages}
result = run_agent("What's the refund status for alice@example.com?")
print(result)
The three return paths inside the loop are the three stop conditions. Strip any of them and you have a runaway-loop incident waiting for a hard query. Production agents bolt cost-tracking, idempotency keys, and durable checkpointing onto this skeleton — covered in the production-agents series [pa-overview].
Verify Your Understanding
Before continuing, you should be able to answer these from memory:
- Pipeline vs agent in one sentence each. Then name a task where a pipeline beats an agent on reliability and cost, and a task where the reverse is true.
- Walk the agent loop. Observe → Think → Act → Evaluate. What does the model decide at each step? Where does the prompt show up? Where does the tool schema show up?
- Name the three stop conditions. For each, give the symptom that fires it and the structured response the loop returns. What goes wrong if you skip stop #2?
- The 30-tools mistake. Why does a 30-tool agent typically perform worse than a 3-tool agent? Explain at the level of the tool-selection prompt.
- Diagnose a runaway. Your agent kept calling
search_orders47 times against an empty-result observation. Which of the three stop conditions is missing, what code change fixes it, and what guardrail outside the loop catches the residual cost risk?
What’s Next
Agents make the data path bigger and more dangerous. The next chapter — Agents → Evaluation — covers the three eval dimensions (task completion · process quality · safety), the LLM-judge drift trap, and how to build a test suite that survives a model upgrade.
References
- [yao2022] Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. Named the Thought / Action / Observation pattern; introduced the
Finish[answer]terminal action. Benchmark gains: +34pp on ALFWorld and +10pp on WebShop vs prior imitation/RL baselines. Cited in §§ Building On Previous Knowledge, The ReAct Pattern, Stop Conditions. - [openai-tools] OpenAI. Function calling and tools guide. platform.openai.com/docs/guides/function-calling. Canonical tool-schema,
tool_callsarray, androle: "tool"message conventions used in the Code Example. Cited in § Code Example. - [anthropic-tools] Anthropic. Tool use with Claude. docs.claude.com/en/docs/agents-and-tools/tool-use. Equivalent tool-calling pattern for Claude — same loop, slightly different schema. Cited in § Tools: The Agent’s Capabilities.
- [pa-overview] Production Agents — Part 0: Overview. Cross-series bridge: the operator-grade companion to this chapter. Cited in § Code Example.
- [pa-cost] Production Agents — Part 4: Cost Control & Token Budgets. Operationalises stop condition #2 (
max_iterations) as a budget enforcement layer. Cited in § The ReAct Pattern — Stop Conditions. - [pa-durable] Production Agents — Part 6: Durable Execution. Temporal / Inngest / Restate patterns that turn the loop body into a crash-safe workflow. Cited in § The ReAct Pattern — Stop Conditions.
- [pa-testing] Production Agents — Part 8: Testing & Evaluation. LLM-judge drift trap and golden-dataset patterns referenced in the Common Pitfalls table. Cited in § Common Pitfalls & Misconceptions.
Go Deeper: Production Agents
This article covers the agent mental model. For production patterns (idempotency, checkpointing, HITL, cost control), see the Production Agents Deep Dive series:
| Part | Topic | What You’ll Learn |
|---|---|---|
| 0 | Overview | Why 98% of orgs haven’t deployed agents at scale |
| 1 | Idempotency | Safe retries, the Stripe pattern |
| 2 | State & Memory | Checkpointing, memory systems |
| 3 | Human-in-the-Loop | Confidence routing, escalation |
| 4 | Cost Control | Token budgets, circuit breakers |
| 5 | Observability | Silent failure detection |
| 6 | Durable Execution | Temporal, Inngest, Restate |
| 7 | Security | Sandboxing, prompt injection |
| 8 | Testing | Golden datasets, evaluation |