Production Agents Overview - The Loop Is 20% of the Work | Intentional / Deliberate / Engineering

Context: This is Part 0 of the Production Agents Deep Dive series — the hub. For a quick introduction, read The Agent Loop Is a Lie first.

Left half: the observe-think-act loop sits alone with no surrounding gates, labelled 'tutorial view (20% of the work)' — and is surrounded by red sparks marking 'double charge', '3am pager', '$10K bill', 'silent failure', 'data exfil', 'lost state'. Right half: the same loop is surrounded by eight production gates — idempotency, state persistence, HITL, cost control, observability, durable execution, security, testing — labelled 'production view (the other 80%).' A divider in the middle reads 'the loop alone → the loop + eight gates.' — The loop is 20% of the work

Why This Matters

You’ve built an agent. It works in development. It demos beautifully. You deploy it.

Then it:

Books the same flight twice when the API times out
Loses all progress when a user closes their browser
Sends 47 follow-up emails because it didn’t know it was waiting for a response
Burns through your monthly API budget in 3 hours
Does the wrong thing without crashing — and you don’t find out until a customer complains

The problem isn’t your agent’s reasoning. It’s everything around the reasoning that tutorials don’t teach.

What other content gets wrong: every “agent in 50 lines” tutorial shows the observe-think-act loop and stops. The loop is correct — and it’s about 20% of what production demands. The other 80% is idempotency, state persistence, human-in-the-loop, cost control, observability, durable execution, security, and testing. This series covers each one as a depth chapter, with the failure modes the framework docs gloss over.

Takeaway: tutorials optimise for understanding; production optimises for reliability. The capabilities below are what closes the gap.

What Goes Wrong Without This:

PRODUCTION FAILURE PATTERNS

Symptom: Agent works in dev, fails in production.
Cause:   Dev has no timeouts, no crashes, no concurrent users.
       Production has all of these. Your agent wasn't built for them.

Symptom: Costs spiral out of control after launch.
Cause: Agents consume 5-20x more tokens than simple chains.
Without budgets and circuit breakers, loops run forever.

Symptom: Agent "completes" tasks but users complain about wrong results.
Cause: Agents fail silently. Traditional monitoring misses semantic errors.
You're tracking latency, not correctness.

Takeaway: every symptom here has a named capability that prevents it. Tutorials don’t teach those capabilities; this series does.

The Eight Capabilities Tutorials Skip

Every tutorial teaches observe-think-act. Here’s what they leave out:

THE AGENT LOOP

   ┌───────────┐
┌─▶│  OBSERVE  │ ← environment
│  └─────┬─────┘
│        │
│        ▼
│  ┌───────────┐
│  │   THINK   │ ← reasoning
│  └─────┬─────┘
│        │
│        ▼
│  ┌───────────┐
│  │    ACT    │ → execute
│  └─────┬─────┘
│        │
└────────┘
  (repeat)

PRODUCTION VIEW

┌──────────────────────────────────────────────────────────────────────┐
│                   PRODUCTION VIEW                                    │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │ IDEMPOTENCY  │  │    STATE     │  │    HUMAN     │                │
│  │              │  │ PERSISTENCE  │  │  ESCALATION  │                │
│  │ Safe retries │  │ Checkpoints  │  │  Judgment    │                │
│  │ No duplicates│  │ Recovery     │  │  Approval    │                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
│         │                 │                 │                        │
│         └─────────────────┼─────────────────┘                        │
│                           │                                          │
│                           ▼                                          │
│                    ┌─────────────┐                                   │
│                    │  THE LOOP   │                                   │
│                    │  (20% of    │                                   │
│                    │   the work) │                                   │
│                    └─────────────┘                                   │
│                           │                                          │
│         ┌─────────────────┼─────────────────┐                        │
│         │                 │                 │                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                │
│  │    COST      │  │ OBSERVABILITY│  │   SECURITY   │                │
│  │   CONTROL    │  │              │  │              │                │
│  │ Token budgets│  │ Silent fail  │  │ Sandboxing   │                │
│  │ Circuit break│  │ detection    │  │ Prompt inject│                │
│  └──────────────┘  └──────────────┘  └──────────────┘                │
└──────────────────────────────────────────────────────────────────────┘

1. Idempotency

The problem: Agent calls book_flight(). API times out. Agent retries. Customer gets charged twice.

The solution: Every action with side effects needs an idempotency key. Same key = same result, no duplicates.

# Bad: Retry creates duplicate
result = api.book(flight_id)

# Good: Retry returns same result
result = api.book(flight_id, idempotency_key=f"{user_id}:{task_id}:{step_id}")

Deep dive: Part 1: Idempotency & Safe Retries

2. State Persistence

The problem: Agent crashes mid-task. User closes browser. Server restarts. All progress lost.

The solution: Checkpoint state after every significant step. Resume from last checkpoint on restart.

# Bad: State in memory
self.state = {}  # Lost on crash

# Good: Checkpoint to durable storage
self.state['in_progress'] = step.id
self.checkpoint()  # Survives crashes
result = step.run()
self.state['completed'].append(step.id)
self.checkpoint()

Deep dive: Part 2: State Persistence & Checkpointing

3. Human-in-the-Loop

The problem: Agent makes $50K decision autonomously. It’s wrong. No one reviewed it.

The solution: Escalate to humans for high-stakes, low-confidence, or irreversible decisions. Not as a fallback — as a feature.

# Bad: Humans are error handlers
except AgentFailed:
    notify_human("Help!")

# Good: Humans are decision makers
if decision.confidence < 0.7 or decision.amount > 500:
    decision = await human.review(decision)

Deep dive: Part 3: Human-in-the-Loop Patterns

4. Cost Control

The problem: Agent enters loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.

The solution: Token budgets per task, circuit breakers for loops, max step limits.

class TokenBudget:
    def __init__(self, max_tokens=50000):
        self.max = max_tokens
        self.used = 0

    def check(self, tokens):
        if self.used + tokens > self.max:
            raise BudgetExceeded()
        self.used += tokens

Deep dive: Part 4: Cost Control & Token Budgets

5. Observability

The problem: Agent completes task. User says result is wrong. You check logs. Latency was fine. No errors. What happened?

The solution: Track tool selection, reasoning traces, confidence scores. Detect semantic failures, not just crashes.

# Bad: Traditional monitoring
metrics.record_latency(response_time)

# Good: Agent-specific observability
audit.log({
    'tool_selected': decision.tool,
    'alternatives_considered': decision.alternatives,
    'confidence': decision.confidence,
    'reasoning': decision.chain_of_thought
})

Deep dive: Part 5: Observability & Silent Failures

6. Security

The problem: Agent reads email. Email contains prompt injection. Agent follows injected instructions. Data exfiltrated.

The solution: Sandbox tool execution. Validate inputs and outputs. Match isolation level to risk.

SECURITY ISOLATION LEVELS

Low risk (RAG, search):       Hardened containers
Medium risk (code execution): gVisor / GKE Sandbox
High risk (financial):        Firecracker MicroVMs

Deep dive: Part 7: Security & Sandboxing

7. Durable Execution

The problem: Idempotency, retries, checkpointing, exactly-once delivery — you can hand-roll all of them, badly. Or you can use a framework that’s been battle-tested.

The solution: pick one durable-execution framework (Temporal · Inngest · Restate · AWS Step Functions · Azure Durable Functions) by deployment shape, not slogan. The framework gives you retry policies, automatic checkpoints, signal-based HITL, and replay determinism behind a single decorator.

# Temporal Python SDK — replay-safe workflow with retry policy
from temporalio.common import RetryPolicy

retry_policy = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    maximum_attempts=5,
    non_retryable_error_types=["ValidationError"],
)

Deep dive: Part 6: Durable Execution Frameworks

8. Testing & Evaluation

The problem: same input, different output (LLMs are non-deterministic). The judge model silently updates and your test suite drifts even though your agent didn’t change.

The solution: testing pyramid (many unit tests · some integration with mocked LLM · few real-LLM E2E) + golden dataset + behavioural assertions + drift detector. Pin every judge model; cross-grade with a second provider; re-grade a control sample so you can tell agent change from judge change.

# Behavioural assertion instead of exact-match — survives non-determinism
result = agent.process(test_case.input)
assert result.intent in ["book_flight", "search_flights"]  # bounded set
assert "NYC" in result.entities.values()                   # required entity
assert len(result.actions) <= 10                           # no infinite loop
assert result.tokens_used < 50000                          # within budget

Deep dive: Part 8: Testing & Evaluation

Takeaway: the eight capabilities aren’t optional add-ons — they’re the price of admission. Each chapter is a deep dive into one capability with the failure modes most public coverage misses.

When to Use Agents vs Pipelines

Not every problem needs an agent. Here’s how to decide:

DECISION MATRIX: PIPELINE vs AGENT

┌──────────────────────────────────────────────────────────────────────┐
│                    DECISION MATRIX                                   │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  USE A PIPELINE WHEN:              USE AN AGENT WHEN:                │
│                                                                      │
│  • Steps are fixed and known       • Steps depend on results         │
│  • Input → Output is predictable   • Need to adapt to surprises      │
│  • Failures are simple (retry/fail)• Failures need judgment          │
│  • No external state changes       • Actions have side effects       │
│  • Speed > flexibility             • Flexibility > speed             │
│                                                                      │
│  Examples:                         Examples:                         │
│  • RAG (retrieve → generate)       • Customer support (varies)       │
│  • Summarization                   • Code generation (iterative)     │
│  • Classification                  • Research tasks (exploratory)    │
│  • Extraction                      • Multi-step bookings             │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

The test: If you can draw the flowchart before running, it’s a pipeline. If the flowchart depends on what happens, it’s an agent.

Takeaway: pipelines are predictable and cheap; agents are flexible and expensive. Don’t pay for flexibility you don’t need — but when you need it, pay the full price (all eight capabilities).

The Production Checklist

Before deploying an agent:

PRODUCTION DEPLOYMENT CHECKLIST

IDEMPOTENCY
[ ] Every external action has an idempotency key
[ ] Keys use stable identifiers (not timestamps)
[ ] Retries classified (RETRY vs NEVER_RETRY errors)
[ ] Backoff includes jitter

STATE PERSISTENCE
[ ] Checkpointing to durable storage (PostgreSQL in production)
[ ] Checkpoint BEFORE execution, not after
[ ] Resume logic handles in_progress state
[ ] Progress files for multi-session tasks

HUMAN ESCALATION
[ ] Confidence-based routing defined
[ ] High-stakes actions require approval
[ ] Escalation metrics tracked
[ ] Sampling strategy for scale

COST CONTROL
[ ] Token budget per task
[ ] Max step limit per loop
[ ] Circuit breakers on failure spikes
[ ] Cost alerts configured

OBSERVABILITY
[ ] Structured audit logging
[ ] Tool selection tracked
[ ] Semantic failure detection
[ ] Replay capability for debugging

SECURITY
[ ] Sandboxing appropriate to risk level
[ ] Input validation on tool calls AND tool outputs
[ ] Output sanitisation + egress allow-list
[ ] Prompt injection defenses (all three channels)

DURABLE EXECUTION
[ ] One framework chosen (Temporal · Inngest · Restate · AWS · Azure)
[ ] Replay determinism preserved (no non-deterministic code in workflow)
[ ] Workflow + activity boundaries match domain side-effect boundaries

TESTING & EVALUATION
[ ] Unit tests for every tool
[ ] Mocked-LLM integration tests for flows
[ ] Versioned golden dataset (happy + edge + adversarial)
[ ] LLM judge pinned to specific model + version
[ ] Behavioural assertions for non-determinism

Takeaway: ship none of these and the next agent incident has no audit trail. Ship all eight blocks and the agent becomes auditable, recoverable, and predictable.

Series Roadmap

This series covers each capability in depth:

Part	Topic	What You’ll Learn
0	Overview (you are here)	The loop is 20% of the work; the eight capabilities tutorials skip
1	Idempotency	Safe retries, Stripe pattern, full-jitter formula, cascade prevention
2	State Persistence	LangGraph PostgresSaver, checkpoint timing, hybrid memory
3	Human-in-the-Loop	Confidence routing, LangGraph `interrupt`, EU AI Act Article 14
4	Cost Control	Atomic token budgets, parallel-agent race, model routing
5	Observability	OTel GenAI semconv, five intent threats, FinOps from spans
6	Durable Execution	Temporal · Inngest · Restate · Step Functions · Durable Functions
7	Security	Three injection channels (user · doc · tool output) + six defense layers
8	Testing	Pyramid + golden dataset + LLM-judge drift detector

Closing

The eight capabilities aren’t independent. They compound: idempotency keys make checkpoints safely resumable; checkpoints make HITL pauses survive redeploys; observability tells you when any of the eight is failing silently; durable execution productises the first six. Add them in the order they’re listed — that’s roughly the order of leverage.

Testable question: pick last week’s worst agent incident. Which of the eight capabilities, if it had been in place, would have caught it before customers noticed? If the answer is “none,” the incident is the eighth-capability gap that no chapter alone covers.

→ Start with Part 1: Idempotency & Safe Retries — the single highest-leverage fix.

Or jump to the capability you need most:

Crashes losing progress? → Part 2: State Persistence
Need human oversight? → Part 3: Human-in-the-Loop
Costs out of control? → Part 4: Cost Control
Tired of reinventing checkpointing? → Part 6: Durable Execution
Flaky tests, drifting judges? → Part 8: Testing & Evaluation

References

[anthropic-context2025] Anthropic Applied AI — Effective Context Engineering for AI Agents, 2025-09-29. anthropic.com/engineering/effective-context-engineering-for-ai-agents. Anchor source for the long-horizon-tasks framing that motivates state persistence (Ch02).
[owasp-llm10] OWASP — Top 10 for LLM Applications 2025. genai.owasp.org/llm-top-10. LLM01:2025 Prompt Injection — the risk that drives the security chapter (Ch07).
[eu-ai-act-art14] European Parliament & Council — EU AI Act, Article 14 (Human Oversight). artificialintelligenceact.eu/article/14. Regulatory backbone for the HITL chapter (Ch03).
[otel-genai] OpenTelemetry — Semantic Conventions for Generative AI Systems. opentelemetry.io/docs/specs/semconv/gen-ai. Telemetry standard pinned by the observability chapter (Ch05).
[stripe-idempotency] Stripe — Idempotent Requests. docs.stripe.com/api/idempotent_requests. Canonical idempotency-key contract referenced throughout the series.
[chroma-rot] Hong, Troynikov, Huber — Context Rot: How Increasing Input Tokens Impacts LLM Performance, Chroma Technical Report, 2025-07-14. trychroma.com/research/context-rot. Empirical basis for context management in Ch02 and observability in Ch05.