Skip to content

The Agent Loop Is a Lie

The Agent Loop Is a Lie

You’ve seen this diagram:

THE AGENT LOOP

Elegant. Clean. Fits on a slide.

It’s also a lie.

Not because it’s wrong — it’s a fine abstraction for the happy path. The lie is that tutorials present it as complete. It’s not. It’s maybe 20% of what a production agent does. The other 80% is everything that happens when things go wrong.

What Actually Happens

The loop says: OBSERVE → THINK → ACT.

Production says:

Here’s a real trace. User says “Book me a flight to NYC tomorrow”:

Seven things happened that the loop doesn’t model:

  1. Partial observation — missing departure city
  2. Inference under uncertainty — assumed SF, was wrong about airport
  3. Timeout and retry — API didn’t respond
  4. Mid-task correction — user wanted LGA, not JFK
  5. External state change — flight sold out during conversation
  6. Multi-step sequences — not single atomic actions
  7. Recovery and alternatives — fallback to 9am flight

The loop was a starting point. What we needed was a mess of retries, corrections, and fallbacks.

The Four Gaps

Gap 1: Observation Is Partial

The loop implies you observe, then have full state. Reality: you never have full state.

# Tutorial version
state = agent.observe()  # Returns complete state

# Production
state = agent.observe()  # Always partial

# The real question: do I ask or infer?
if confidence(state) < THRESHOLD:
    if cost_of_asking < cost_of_being_wrong:
        state = agent.ask_for_clarification()
    else:
        state = agent.infer_from_context()
        state.mark_inferred()  # Track what we guessed

The decision to ask vs. infer is a judgment call with real tradeoffs. Asking is slow and annoying. Inferring is fast and risky. The loop pretends this decision doesn’t exist.

Gap 2: Thinking Produces Plans, Not Actions

The loop implies: think once, get one action. Reality: thinking produces a strategy with fallbacks.

# Tutorial
action = agent.think(state)
agent.execute(action)

# Production: think returns a plan
plan = agent.think(state)

# Plan structure:
# - primary: first thing to try
# - fallback: if primary fails
# - timeout: how long to wait
# - retry_policy: how to retry
# - abort_conditions: when to give up
# - human_trigger: when to escalate

result = execute_with_recovery(plan)

The “think” step doesn’t produce an action. It produces a strategy for what to do when things go wrong.

Gap 3: Actions Have Messy Outcomes

The loop implies: action succeeds or fails, back to observe. Reality: actions have partial success, side effects, and delayed consequences.

# Tutorial
if result.success:
    continue
else:
    handle_error()

# Production: three kinds of messy outcomes

# 1. Partial success
if result.partial:
    # Flight booked, but seat not assigned yet
    # Email sent, but delivery not confirmed
    agent.track_pending(result.pending_items)

# 2. Side effects
if result.side_effects:
    # Booking consumed travel credit
    # API call triggered rate limiting
    agent.update_world_model(result.side_effects)

# 3. Deferred results
if result.deferred:
    # Confirmation comes in 24-48 hours
    # Human approval required
    agent.schedule_followup(result.callback)

# 4. Pending operations (critical!)
# Track what you're WAITING FOR, not just what you've DONE
if result.pending:
    # Email sent, waiting for CRM to sync
    # Payment initiated, waiting for confirmation
    agent.track_awaiting(result.pending_items, timeout=300)

Actions aren’t atomic. They’re transactions that might half-succeed, trigger side effects, or need followup later.

The 47 emails bug: An email agent observed “no response,” sent a follow-up, checkpointed, then observed “no response” again (email takes time to sync to CRM). It sent 47 follow-ups before someone noticed. The fix: track pending state — what you’re waiting for, not just what you’ve done.

Gap 4: The Loop Is 20% of the Agent

The loop is the core reasoning cycle. But production agents need infrastructure around it:

WHAT THE LOOP COVERS          WHAT PRODUCTION NEEDS
─────────────────────         ─────────────────────
 Observe                      Observe
 Think                        Think
 Act                          Act
                              + Retry with backoff
                              + Timeout handling
                              + Partial failure recovery
                              + State persistence
                              + Checkpoint/resume
                              + Idempotency
                              + Human escalation
                              + Audit logging
                              + Replay for debugging

The loop is necessary but not sufficient.

The Real Hard Problems

The reviewers are right: I’ve been diagnosing without prescribing. Here are the actual problems you’ll hit and how to solve them.

Problem 1: Idempotency

This is the killer. If book_flight() times out and you retry, do you get two bookings?

# Wrong: retry blindly
def book_flight(flight_id):
    return api.book(flight_id)  # Might create duplicate

# Right: idempotency key
def book_flight(flight_id, idempotency_key):
    # Same key = same result, no duplicate
    return api.book(flight_id, idempotency_key=idempotency_key)

# Generate key from task context (stable identifiers only!)
# Bad:  f"{user_id}:{timestamp}" - different on retry
# Bad:  f"{user_id}:{retry_count}" - different on retry
# Good: f"{user_id}:{task_id}:{step.id}" - same across retries
key = f"{user_id}:{task_id}:{step.id}"

Every external action needs an idempotency strategy:

  • API calls: Use idempotency keys if supported
  • Emails: Check if already sent before sending
  • Database writes: Use upserts or check-then-write
  • Payments: Always use idempotency keys

Problem 2: State Persistence

Agent crashes mid-task. User closes browser. Server restarts. What happens to the booking?

# Wrong: state lives in memory
class Agent:
    def __init__(self):
        self.state = {}  # Lost on crash

# Right: checkpoint to durable storage
class Agent:
    def __init__(self, task_id, storage):
        self.task_id = task_id
        self.storage = storage
        self.state = storage.load(task_id) or {}

    def checkpoint(self):
        self.storage.save(self.task_id, self.state)

    def execute_step(self, step):
        # Checkpoint BEFORE execution (mark intent)
        self.state['in_progress'] = step.id
        self.checkpoint()

        # Execute
        result = step.run()

        # Checkpoint AFTER execution (mark completion)
        self.state['completed'].append(step.id)
        del self.state['in_progress']
        self.state['last_result'] = result
        self.checkpoint()
        return result

    def resume(self):
        """On restart, check if we crashed mid-step"""
        if 'in_progress' in self.state:
            step_id = self.state['in_progress']
            # Check if step actually completed (idempotent read)
            if self.check_step_completed(step_id):
                self.state['completed'].append(step_id)
            # Otherwise, re-execute with idempotency key

Why checkpoint before execution? If you crash between step.run() and checkpoint(), you don’t know if the step ran. Checkpointing intent first lets you detect and handle this on resume.

Caveat: For true reliability, ensure your storage supports atomic writes. If checkpoint() can partially fail, you need write-ahead logging or versioned checkpoints.

Problem 3: Replay for Debugging

Agent failed last night. How do you figure out why?

# Wrong: print statements, hope for the best
def think(state):
    print(f"Thinking about {state}")  # Lost forever
    return decide(state)

# Right: structured audit log
def think(state):
    decision = decide(state)
    audit.log({
        'timestamp': now(),
        'task_id': self.task_id,
        'phase': 'think',
        'input_state': state,
        'decision': decision,
        'reasoning': decision.explanation,
        'confidence': decision.confidence
    })

    return decision

# Now you can replay:
# 1. Load audit log for failed task
# 2. See exact state at each step
# 3. Reproduce the failure deterministically

Good agent frameworks give you replay for free. If yours doesn’t, build it.

Problem 4: Human Escalation (Not Failure)

Human-in-the-loop isn’t a fallback for when the agent fails. It’s a feature for when judgment is needed.

# Wrong: humans are error handlers
except AgentStuck:
    notify_human("Agent failed, please help")

# Right: humans are decision makers
class EscalationPolicy:
    def should_escalate(self, decision):
        # High stakes? Get human approval.
        if decision.involves_payment and decision.amount > 500:
            return True

        # Low confidence? Ask human.
        if decision.confidence < 0.7:
            return True

        # Irreversible? Double-check.
        if not decision.reversible:
            return True

        return False

# In the agent loop:
decision = agent.think(state)
if escalation_policy.should_escalate(decision):
    decision = await human.review(decision)  # Human approves/modifies
result = agent.execute(decision)

Design for human collaboration from day one. The agent prepares options; the human makes judgment calls.

Scaling caveat: Human escalation doesn’t scale linearly. As agent volume grows, human bandwidth doesn’t. At one fintech, month 1 looked great — humans approved/rejected thoughtfully. By month 6, the approval queue was 200 items deep and humans were rubber-stamping everything. Month 9: a $50K fraud slipped through.

The fix: don’t escalate 100% of edge cases. Use sampling — escalate a random 10-20% of borderline decisions. Humans become the audit mechanism, not the last line of defense. Track override rates and adjust thresholds based on what humans actually catch.

The Minimal Production Agent

Here’s what you actually need. Not a diagram — working structure.

class ProductionAgent:
    def __init__(self, task_id, storage, audit, escalation):
        self.task_id = task_id
        self.storage = storage
        self.audit = audit
        self.escalation = escalation
        self.state = storage.load(task_id) or initial_state()

    async def run(self):
        while not self.state.complete:
            try:
                # 1. Observe (with partial state handling)
                observation = await self.observe()
                if observation.needs_clarification:
                    await self.ask_user(observation.questions)
                    continue

                # 2. Think (produces plan, not action)
                plan = self.think(observation)
                self.audit.log('plan', plan)

                # 3. Check if human needed
                if self.escalation.should_escalate(plan):
                    plan = await self.escalation.get_approval(plan)

                # 4. Execute with recovery
                for step in plan.steps:
                    result = await self.execute_with_retry(
                        step,
                        idempotency_key=self.make_key(step),
                        timeout=plan.timeout,
                        retries=plan.retries
                    )

                    self.state.record(step, result)
                    self.storage.checkpoint(self.state)  # Persist

                    if result.failed and not plan.fallback:
                        return await self.escalation.handoff(
                            "Step failed, no fallback",
                            self.state
                        )

            except Exception as e:
                self.audit.log('error', e, self.state)
                return await self.escalation.handoff(str(e), self.state)

        return self.state.result

~50 lines. Handles:

  • Partial observation → ask for clarification
  • Plans with fallbacks → not single actions
  • Human escalation → built-in, not bolted-on
  • Idempotency → key per step
  • Persistence → checkpoint after each step
  • Audit → log for replay
  • Recovery → retry with backoff, fallback, or escalate

What Changes Tomorrow

If you’re building agents:

1. Add idempotency keys to every external action. This is the single highest-leverage fix. Without it, retries create duplicates.

2. Checkpoint state after each step. Use Redis, Postgres, S3, whatever. Just don’t keep state in memory only.

3. Log decisions, not just actions. When debugging, you need to know why the agent did something, not just what it did.

4. Design human escalation as a feature. Define when to escalate before things go wrong. Not “agent failed,” but “agent needs judgment.”

5. Build replay from day one. You will need to debug a failed agent run. Make sure you can reproduce it.

6. Set timeouts on everything. Every API call, every LLM call, every user wait. Without timeouts, agents hang forever and resources leak.

The loop is a fine abstraction. It’s just not complete. Wrap it in recovery, persistence, and human collaboration. Then you have a production agent.


Appendix: Architecture Diagrams

What Tutorials Leave Out

This is what a production agent actually needs — the orchestrator, error handling, state management, and human handoff that tutorials skip:

Production Agent Components

The Full Production View

PRODUCTION AGENT ARCHITECTURE

The loop lives inside REASONING. Everything else makes it production-ready.


Go Deeper: Production Agents Series

This post covers the “what.” The deep dive series covers the “how.”

PartTopicWhat You’ll Learn
0OverviewWhy 98% of orgs haven’t deployed agents at scale
1IdempotencySafe retries, the Stripe pattern, cascading failure prevention
2State PersistenceCheckpointing, LangGraph patterns, hybrid memory
3Human-in-the-LoopConfidence routing, scaling without rubber-stamping
4Cost ControlToken budgets, circuit breakers, model routing
5ObservabilitySilent failure detection, semantic monitoring
6Durable ExecutionTemporal, Inngest, Restate, AWS/Azure/GCP offerings
7SecuritySandboxing levels, prompt injection defense
8TestingSimulation-based testing, evaluation metrics