RAG to Agents - From Retrieval to Action | Intentional / Deliberate / Engineering

Two-panel hero. Left: canonical ReAct loop — Thought → Action → Observation, with three explicit stop conditions panelled below in green: stop #1 'Finish[answer] action emitted', stop #2 'max_iterations budget reached', stop #3 'terminal tool error or guardrail trip'. Right: the same loop without stop conditions — at iteration 47 with the model calling search_orders(BAD_ID) repeatedly against an empty observation. A flashing red panel reads 'No stop condition = production incident. 47 LLM calls, ~$3.40 burned on one user query, 30s+ latency, no exit signal, model helpfully keeps trying the same failing tool — the trap nobody names.' — ReAct loop: three explicit stop conditions, one runaway trap

Building On Previous Knowledge

The previous chapter taught the RAG debugging decision tree: every wrong RAG answer is one of three failure classes. RAG is one-shot — retrieve once, generate once. That’s enough when the user query maps cleanly to a passage. It’s not enough when the user wants an action: refund this order, schedule that meeting, query and then update three systems.

This chapter promotes the data path one level: from “retrieve a document” to “decide which tool to call, observe its output, decide what to do next”. The mechanism is the agent loop; the canonical academic reference is Yao et al. 2022, ReAct [yao2022]; the production reality is messier than the paper suggests.

Where most agent tutorials stop: they show the Thought → Action → Observation triple, demo a happy-path single iteration, and ship a LangChain snippet. They never tell you what terminates the loop. The trap nobody names: a ReAct loop without explicit stop conditions silently runs until you blow your token budget. This chapter shows the three stop conditions every production agent ships with. The bridge into the production-agents series follows — checkpointing, idempotency, and cost-control turn the loop body into something that survives users.

Takeaway: agents are the same data path as RAG with one promotion — the model now picks the next action in a loop, not just the answer to one prompt. The hard engineering is termination, not generation.

Why This Matters

RAG answers questions. Agents solve problems.

When a user asks “What’s the status of order #12345?”, RAG retrieves a document. But what if answering requires:

Querying an order database
Checking shipping status from an API
Calculating estimated delivery based on location
Composing a response with all that information

RAG can’t do this. RAG retrieves static documents. Agents take actions.

If you try to build multi-step systems with RAG patterns, you’ll create brittle pipelines that break on variation. Understanding the agent mental model lets you build flexible systems that adapt.

Takeaway: agents are warranted when the strategy depends on what you find — not when “multi-step” describes the workflow. Static multi-step pipelines beat agents on reliability and cost.

What Goes Wrong Without This:

Agent Failure Patterns

Symptom: Your "smart assistant" can only answer questions from
       documents. Users ask for actions, it apologizes.
Cause:   You built RAG when you needed an agent. RAG retrieves
       information. It doesn't take action or call APIs.

Symptom: Your multi-step pipeline is 500 lines of if/else handling
every edge case. Adding a new capability requires 2 weeks.
Cause: You hardcoded the reasoning that should be delegated to the LLM.
Every variation is a code branch.

Symptom: Your agent attempts an action, fails, and doesn't recover.
It returns "Error occurred" to the user.
Cause: You built a pipeline, not an agent. Pipelines don't adapt.
Agents observe results and adjust.

Pipelines vs Agents

There are two ways to build multi-step AI systems:

Pipelines vs Agents

PIPELINE (Code decides):

Input → Step 1 → Step 2 → Step 3 → Output
↓ ↓ ↓
[fixed] [fixed] [fixed]

The code determines what happens at each step.
Each branch is explicitly written.
Predictable, but rigid.

AGENT (Model decides):

       ┌───────────────────────────┐
       │                           │
Input →  │  Observe current state    │◄────────┐
       │          ↓                │         │
       │  Think: what next?        │         │
       │          ↓                │         │
       │  Act: execute decision    │─────────┘
       └─────────────┬─────────────┘
                     ↓
             Output (when done)

The model determines what happens at each step.
Flexible, but less predictable.

The key question: Who decides the next step—your code or the model?

Pipeline: You enumerate all paths. Reliable for known scenarios. Fails on novel scenarios.
Agent: Model reasons about what to do. Handles variation. Can make mistakes.

Neither is better. They solve different problems.

Takeaway: the deciding question is who picks the next step — your code (pipeline) or the model (agent). Pipelines are predictable but rigid; agents are flexible but non-deterministic. Pick by failure-mode tolerance, not by hype.

The Agent Loop

An agent is a loop. The LLM decides what to do, executes it, observes the result, and decides again.

The Agent Loop

                  ┌──────────────┐
          ┌──────▶│   OBSERVE    │
          │       │              │
          │       │ What do I    │
          │       │ know now?    │
          │       └──────┬───────┘
          │              │
          │              ▼
          │       ┌──────────────┐
          │       │    THINK     │
          │       │              │
          │       │ What should  │
          │       │ I do next?   │
          │       └──────┬───────┘
          │              │
          │              ▼
          │       ┌──────────────┐
   ┌──────┴──┐    │     ACT      │
   │ Not done│◄───┤              │
   └─────────┘    │ Execute the  │
                  │ decision     │
                  └──────┬───────┘
                         │
                         ▼
                    ┌─────────┐
                    │  Done?  │
                    └────┬────┘
                         │ Yes
                         ▼
                    ┌─────────┐
                    │ OUTPUT  │
                    └─────────┘

Each iteration:

Observe: What information do I have? What just happened?
Think: Given my goal and current state, what’s the best next action?
Act: Execute the chosen action
Evaluate: Am I done? If not, loop.

The magic: the model decides the action at step 2. This is what makes it an agent, not a pipeline.

Takeaway: the loop is Observe → Think → Act → Evaluate. Three iteration types exist (continue / exit / error) and the most common production bug is shipping with only the first one defined.

Tools: The Agent’s Capabilities

An agent without tools is just a chatbot. Tools are functions the agent can call.

Tools Give Agents Capabilities

Tool Definition:
┌───────────────────────────────────────────────────────────┐
│ name: "search_orders" │
│ description: "Search orders by user ID, order ID, │
│ or date range" │
│ parameters: │
│ user_id: string (optional) │
│ order_id: string (optional) │
│ date_from: date (optional) │
└───────────────────────────────────────────────────────────┘

Agent receives tool descriptions → LLM learns WHEN to use
Agent receives user query → LLM decides WHICH tool + arguments
Tool returns result → Agent observes and continues

Common tool categories:

Category	Examples	What it enables
Data retrieval	search_docs, query_database	Access information
External APIs	get_weather, check_inventory	Real-time data
Actions	send_email, create_ticket	Side effects
Computation	calculate, run_code	Complex logic
User interaction	ask_user, show_options	Clarification

Tool descriptions are prompts. Good descriptions = agent uses tools correctly. Bad descriptions = agent guesses wrong. Both OpenAI and Anthropic ship the same loop shape with slightly different schemas ([openai-tools], [anthropic-tools]); the underlying mental model is identical.

RAG as a Tool

Here’s the insight: RAG doesn’t get replaced by agents—it becomes a tool.

RAG as Agent Tool

Available tools:
• search_docs: Search knowledge base for relevant info
• query_orders: Get order details from database
• check_shipping: Get real-time shipping status

User: "When will my order #789 arrive?"

Agent thinks: "I need order details first"
Agent acts: query_orders(order_id="789")
Observation: {status: "shipped", carrier: "FedEx", ...}

Agent thinks: "Order is shipped, need tracking info"
Agent acts: check_shipping(tracking="FX123456")
Observation: {location: "Chicago", est_delivery: "Dec 5"}

Agent thinks: "I have all the info, can answer now"
Agent responds: "Your order shipped via FedEx and
should arrive December 5th."

RAG is retrieval. Agents can use retrieval as one capability among many.

Takeaway: tools are the agent’s interface to the world — their descriptions are prompts in disguise. RAG doesn’t get replaced by agents; it becomes one of their tools. Vague tool descriptions cause silent failures more often than buggy tool code.

The ReAct Pattern

The most common agent architecture is ReAct — Reason + Act. The pattern was named by Yao et al. 2022 [yao2022], which paired explicit reasoning traces with environment-coupled actions. On ALFWorld it beat the prior imitation-learning baseline by 34 percentage points absolute, and on WebShop by 10 points [yao2022]. The lift comes from making the reasoning step explicit and inspectable, not from a smarter model.

ReAct Example

User: "What's the refund status for alice@example.com?"

┌─────────────────────────────────────────────────────────┐
│ THOUGHT: I need to find any refund requests from this │
│ user. I'll search the orders first. │
│ │
│ ACTION: search_orders(email="alice@example.com") │
└─────────────────────────────────────────────────────────┘

OBSERVATION: [
{order_id: "456", status: "refund_requested", amount: 99}
]

┌─────────────────────────────────────────────────────────┐
│ THOUGHT: Found an order with refund requested. │
│ I need to check the refund processing status. │
│ │
│ ACTION: check_refund(order_id="456") │
└─────────────────────────────────────────────────────────┘

OBSERVATION: {status: "approved", processed_date: "Dec 3"}

┌─────────────────────────────────────────────────────────┐
│ THOUGHT: The refund has been approved and processed. │
│ I have enough info to answer. │
│ │
│ ACTION: respond_to_user │
└─────────────────────────────────────────────────────────┘

RESPONSE: "Alice's refund of $99 for order #456 was
approved and processed on December 3rd."

The THOUGHT step makes the agent’s reasoning visible. This helps with:

Debugging (you can see why it chose an action)
Guidance (you can provide examples of good reasoning)
Error recovery (model realizes when it’s stuck)

Stop Conditions: The Half ReAct Tutorials Skip

ReAct loops don’t terminate on their own. The original paper introduced a special Finish[answer] action — when the model emits it, the loop ends and the bracketed string is returned. Production agents need three termination paths, not one. This is the chapter’s load-bearing claim, and the hero diagram at the top of the chapter visualises why:

Finish[answer] — the model decided it has the answer. Return it and exit.
max_iterations budget reached — every loop body has a budget (typically 5–15 iterations). When hit, return a partial answer + flag for human review. Never run unbounded.
Terminal tool error or guardrail trip — circuit-breaker open, permission denied, catastrophic-action blocker fired. Return a structured failure; do not retry from the same state.

The runaway-loop trap nobody names: a ReAct agent with no max_iterations ceiling will call a broken tool 30+ times against a hard query, burn dollars in tokens, and never return. The user has already opened a support ticket. The fix is one parameter on the loop plus a logged outcome. This is the bridge into the production-agents series, which treats cost-control budgets and durable-execution checkpoints as first-class concerns [pa-cost, pa-durable].

Takeaway: ReAct’s contribution is the explicit Thought/Action/Observation triple, not a smarter model. The triple is the easy half. The hard half is wiring three stop conditions into the loop before a single user touches it.

Agent Memory

Agents without memory forget everything between turns. Production agents need memory.

Memory Types

SHORT-TERM MEMORY (Conversation Context)
────────────────────────────────────────
What: Previous messages in current session
How: Append to LLM context
Limit: Context window size

User: "Check order #123"
Agent: "Order #123 shipped Dec 1"
User: "When will IT arrive?" ← "it" = order #123
Short-term memory resolves the reference

LONG-TERM MEMORY (Persistent Knowledge)
────────────────────────────────────────
What: Facts that persist across sessions
How: Vector store for semantic retrieval
Limit: Storage capacity

Session 1: User says "I prefer email over SMS"
→ Store: ("user_preference", "prefers email for notifications")

Session 2: Agent needs to notify user
→ Retrieve preference → Send email

WORKING MEMORY (Scratch Pad)
────────────────────────────────────────
What: Intermediate results during task execution
How: Structured state object
Limit: Task complexity

Task: "Calculate total revenue by region"
Working memory: {
"north": 150000,
"south": 120000, ← Accumulated as agent works
"east": pending...
}

Without memory, agents can’t handle multi-turn conversations, learn user preferences, or maintain context across sessions.

Takeaway: three orthogonal memories — short-term (in-context, ephemeral), long-term (vector-store, persistent), working (structured scratch-pad, task-scoped). Conflating them is one of the most expensive design errors in production agents.

When Agents Are Wrong

Agents are not always the answer. Sometimes they’re the problem.

When to Use What

USE DIRECT LLM CALL when:
• Single-step task (summarize, translate, classify)
• No external data needed
• No actions required

USE RAG when:
• Answer exists in your documents
• Single retrieval + generation is sufficient
• You want predictable, auditable answers

USE PIPELINE when:
• Steps are known and fixed
• High reliability required
• Each step must happen regardless of previous results

USE AGENT when:
• Task requires multiple tools/data sources
• Strategy depends on intermediate results
• User requests vary significantly
• Recovery from failure requires reasoning

The “agent for everything” anti-pattern:

Over-Engineering Anti-Pattern

User: "What's 2 + 2?"

BAD (over-engineering):
Agent thinks: "I should use the calculator tool"
Agent acts: calculate("2 + 2")
Observation: 4
Agent responds: "The answer is 4"

Cost: Multiple LLM calls, tool overhead
Time: 2-3 seconds

GOOD (direct):
LLM responds: "4"

Cost: One LLM call
Time: 200ms

Agents add:

Latency: Multiple LLM calls per request
Cost: Each thought/action cycle costs tokens
Non-determinism: Same input can produce different paths
New failure modes: Wrong tool selection, hallucinated arguments, infinite loops

Don’t use an agent when a simpler approach works.

Takeaway: Direct LLM / RAG / Pipeline / Agent is a four-way fork — pick the simplest that fits the problem. “Agent for everything” is the most expensive mistake in the field; every iteration of the loop is multiplied by latency, cost, and a new failure surface.

Agent Failure Modes

Agents introduce new ways to fail:

Agent-Specific Failures

1. WRONG TOOL SELECTION
 Agent picks search_docs when it should use query_orders
 Cause: Ambiguous tool descriptions, poor examples

2. HALLUCINATED ARGUMENTS
 Agent calls: check_order(order_id="MADE_UP_ID")
 Cause: Model invents plausible-looking arguments

3. INFINITE LOOPS
 Agent keeps trying the same failing action
 Cause: No loop detection, poor error handling instructions

4. PREMATURE TERMINATION
 Agent responds before gathering enough information
 Cause: Weak instructions to be thorough

5. SCOPE CREEP
 Agent takes actions beyond what user asked
 Cause: Unclear boundaries, model being "helpful"

6. CATASTROPHIC ACTIONS
 Agent deletes data, sends emails, makes purchases
 Cause: Powerful tools without guardrails

Takeaway: agents introduce six failure modes RAG doesn’t have — wrong tool, hallucinated arguments, infinite loops, premature termination, scope creep, catastrophic actions. The first three are addressed by the agent’s prompt and the stop conditions; the last three need guardrails outside the loop, which is exactly what the production-agents series covers.

Common Pitfalls & Misconceptions

The agent mental model is enough to design one. The table below is enough to ship one. Each row maps a misconception that derails new agent projects to its concrete fix.

Misconception	Why it’s wrong	What to do instead
”More tools = more capability”	A 30-tool agent picks the wrong tool more often than a 3-tool agent picks the right one. Tool descriptions compete for the same context budget; selection accuracy drops with tool count.	Cap tool count at ~10 per agent. Compose agents if you need more capability — don’t extend tool lists.
”We added retry logic, the loop is safe now”	Retry without a stop condition is just a faster runaway loop. The model “helpfully” re-tries the same failing tool 47 times with the same arguments.	Wire all three stop conditions: `Finish[answer]`, `max_iterations`, terminal tool error. Test by forcing each one in eval.
”Tool descriptions don’t matter — the model is smart”	Tool descriptions are prompts. A vague `description: "handles data"` triggers selection errors that look like model bugs. The model is doing exactly what you described — just not what you meant.	Write tool descriptions as if briefing a new colleague: name the trigger condition, the inputs, the output shape, and the side effects.
”Agents are non-deterministic, so we can’t evaluate them”	The trajectory is non-deterministic, but the outcome on a fixed eval set is measurable. Same input → score the final answer and the trajectory shape (tool sequence, iteration count).	Evaluate three dimensions separately: task completion (did the user get the answer?), process quality (right tools in a reasonable order?), and safety (no out-of-scope actions).
”We added an LLM-judge eval, we’re done”	LLM-judge scores drift with the judge model version. Same agent, same answers, different judge release → different score. The production-agents series covers this trap in depth [pa-testing].	Pin the judge model version. Run a held-out human-labelled eval set quarterly to calibrate drift.
”The agent worked in dev but goes wild in prod”	Dev tasks are crafted to match the agent’s tools. Real users ask for things that map ambiguously onto the tool set — the agent picks the wrong tool, falls into a retry loop, or scopes-creeps into “helpful” side actions.	Log every (user query → tool sequence → outcome) triple in prod. Eval against the real distribution. Add `HITL` escalation when confidence is low.
”We’ll add guardrails later”	Catastrophic actions (delete production data, send unauthorized email, charge a card) are irreversible. There’s no later.	Implement permission boundaries + confirmation prompts + audit logging before the agent has access to the destructive tool. Make the guardrail the precondition, not the patch.

Takeaway: agents fail in classes RAG doesn’t. Most are addressable with three structural fixes — explicit stop conditions, tool-count discipline, and guardrails outside the loop. The production-agents series is the operator-grade deep dive on those three.

Code Example

A minimal ReAct loop pinned to current OpenAI tool-calling semantics, with all three stop conditions wired in. The loop body is ~30 lines; the stop-condition logic is most of the value:

# Tested on:
#   openai==1.40.0
# Python 3.11
import json
from openai import OpenAI

client = OpenAI()

# 1. Tool schemas (OpenAI 1.x function-calling spec) --------------------------
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_orders",
            "description": (
                "Search for orders by user email OR order ID. Returns a list of orders "
                "with fields: order_id, status, amount. Use when the user names a customer "
                "or order; do NOT use for shipping or refund status queries."
            ),
            "parameters": {
                "type": "object",
                "properties": {
                    "email": {"type": "string", "description": "User email address"},
                    "order_id": {"type": "string", "description": "Order ID"},
                },
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "check_refund",
            "description": (
                "Check refund processing status for a specific order. Returns "
                "{status, processed_date}. Requires an order_id from a prior search_orders call."
            ),
            "parameters": {
                "type": "object",
                "properties": {"order_id": {"type": "string"}},
                "required": ["order_id"],
            },
        },
    },
]


# 2. Tool implementations (mocked) ---------------------------------------------
def search_orders(email=None, order_id=None):
    return [{"order_id": "456", "status": "refund_requested", "amount": 99}]


def check_refund(order_id):
    return {"status": "approved", "processed_date": "Dec 3"}


TOOL_REGISTRY = {"search_orders": search_orders, "check_refund": check_refund}


def execute_tool(name: str, arguments: dict) -> dict:
    fn = TOOL_REGISTRY.get(name)
    if fn is None:
        # stop #3 — terminal tool error (guardrail trip)
        return {"_terminal_error": f"Unknown tool: {name}"}
    try:
        return fn(**arguments)
    except Exception as e:
        return {"_terminal_error": str(e)}


# 3. The agent loop with all three stop conditions ----------------------------
def run_agent(user_message: str, max_iterations: int = 5) -> dict:
    messages = [
        {"role": "system", "content": (
            "You are a customer-service agent. Use tools to gather information, "
            "then answer the user. If a tool fails twice with the same arguments, stop "
            "and report the failure — do not retry indefinitely."
        )},
        {"role": "user", "content": user_message},
    ]

    for iteration in range(max_iterations):
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=tools,
        )
        msg = resp.choices[0].message

        # stop #1 — Finish[answer]: model emitted no tool calls → answer is ready
        if not msg.tool_calls:
            return {"answer": msg.content, "iterations": iteration + 1, "stop": "finish"}

        messages.append(msg)
        for tool_call in msg.tool_calls:
            name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            result = execute_tool(name, args)

            # stop #3 — terminal tool error: return structured failure, do not retry
            if "_terminal_error" in result:
                return {"answer": None, "error": result["_terminal_error"], "stop": "tool_error"}

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result),
            })

    # stop #2 — max_iterations: return partial state for HITL review
    return {"answer": None, "stop": "max_iterations", "messages": messages}


result = run_agent("What's the refund status for alice@example.com?")
print(result)

The three return paths inside the loop are the three stop conditions. Strip any of them and you have a runaway-loop incident waiting for a hard query. Production agents bolt cost-tracking, idempotency keys, and durable checkpointing onto this skeleton — covered in the production-agents series [pa-overview].

Verify Your Understanding

Before continuing, you should be able to answer these from memory:

Pipeline vs agent in one sentence each. Then name a task where a pipeline beats an agent on reliability and cost, and a task where the reverse is true.
Walk the agent loop. Observe → Think → Act → Evaluate. What does the model decide at each step? Where does the prompt show up? Where does the tool schema show up?
Name the three stop conditions. For each, give the symptom that fires it and the structured response the loop returns. What goes wrong if you skip stop #2?
The 30-tools mistake. Why does a 30-tool agent typically perform worse than a 3-tool agent? Explain at the level of the tool-selection prompt.
Diagnose a runaway. Your agent kept calling search_orders 47 times against an empty-result observation. Which of the three stop conditions is missing, what code change fixes it, and what guardrail outside the loop catches the residual cost risk?

What’s Next

Agents make the data path bigger and more dangerous. The next chapter — Agents → Evaluation — covers the three eval dimensions (task completion · process quality · safety), the LLM-judge drift trap, and how to build a test suite that survives a model upgrade.

References

[yao2022] Yao, S. et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. Named the Thought / Action / Observation pattern; introduced the Finish[answer] terminal action. Benchmark gains: +34pp on ALFWorld and +10pp on WebShop vs prior imitation/RL baselines. Cited in §§ Building On Previous Knowledge, The ReAct Pattern, Stop Conditions.
[openai-tools] OpenAI. Function calling and tools guide. platform.openai.com/docs/guides/function-calling. Canonical tool-schema, tool_calls array, and role: "tool" message conventions used in the Code Example. Cited in § Code Example.
[anthropic-tools] Anthropic. Tool use with Claude. docs.claude.com/en/docs/agents-and-tools/tool-use. Equivalent tool-calling pattern for Claude — same loop, slightly different schema. Cited in § Tools: The Agent’s Capabilities.
[pa-overview] Production Agents — Part 0: Overview. Cross-series bridge: the operator-grade companion to this chapter. Cited in § Code Example.
[pa-cost] Production Agents — Part 4: Cost Control & Token Budgets. Operationalises stop condition #2 (max_iterations) as a budget enforcement layer. Cited in § The ReAct Pattern — Stop Conditions.
[pa-durable] Production Agents — Part 6: Durable Execution. Temporal / Inngest / Restate patterns that turn the loop body into a crash-safe workflow. Cited in § The ReAct Pattern — Stop Conditions.
[pa-testing] Production Agents — Part 8: Testing & Evaluation. LLM-judge drift trap and golden-dataset patterns referenced in the Common Pitfalls table. Cited in § Common Pitfalls & Misconceptions.

Go Deeper: Production Agents

This article covers the agent mental model. For production patterns (idempotency, checkpointing, HITL, cost control), see the Production Agents Deep Dive series:

Part	Topic	What You’ll Learn
0	Overview	Why 98% of orgs haven’t deployed agents at scale
1	Idempotency	Safe retries, the Stripe pattern
2	State & Memory	Checkpointing, memory systems
3	Human-in-the-Loop	Confidence routing, escalation
4	Cost Control	Token budgets, circuit breakers
5	Observability	Silent failure detection
6	Durable Execution	Temporal, Inngest, Restate
7	Security	Sandboxing, prompt injection
8	Testing	Golden datasets, evaluation