Production Agents Overview - Why 98% Haven't Deployed
Deep dive into why most agent deployments fail, the six capabilities tutorials skip, and how to build agents that survive production
15 minutes•Senior Level•Dec 2024
Context: This is Part 0 of the Production Agents Deep Dive series.
For a quick introduction, read The Agent Loop Is a Lie first.
Why This Matters
You’ve built an agent. It works in development. It demos beautifully. You deploy it.
Then it:
Books the same flight twice when the API times out
Loses all progress when a user closes their browser
Sends 47 follow-up emails because it didn’t know it was waiting for a response
Burns through your monthly API budget in 3 hours
Does the wrong thing without crashing — and you don’t find out until a customer complains
You’re not alone. Only 2% of organizations have successfully deployed agentic AI at scale. Gartner predicts 40%+ of agentic AI projects will be canceled by 2027 due to cost overruns, unclear business value, and inadequate risk controls.
The problem isn’t your agent’s reasoning. It’s everything around the reasoning that tutorials don’t teach.
What Goes Wrong Without This:
PRODUCTION FAILURE PATTERNS
PRODUCTION FAILURE PATTERNS
Symptom: Agent works in dev, fails in production.
Cause: Dev has no timeouts, no crashes, no concurrent users.
Production has all of these. Your agent wasn't built for them.
Symptom: Costs spiral out of control after launch.
Cause: Agents consume 5-20x more tokens than simple chains.
Without budgets and circuit breakers, loops run forever.
Symptom: Agent "completes" tasks but users complain about wrong results.
Cause: Agents fail silently. Traditional monitoring misses semantic errors.
You're tracking latency, not correctness.
The Six Capabilities Tutorials Skip
Every tutorial teaches observe-think-act. Here’s what they leave out:
The problem: Agent crashes mid-task. User closes browser. Server restarts. All progress lost.
The solution: Checkpoint state after every significant step. Resume from last checkpoint on restart.
# Bad: State in memoryself.state = {} # Lost on crash# Good: Checkpoint to durable storageself.state['in_progress'] = step.idself.checkpoint() # Survives crashesresult = step.run()self.state['completed'].append(step.id)self.checkpoint()
Not every problem needs an agent. Here’s how to decide:
DECISION MATRIX: PIPELINE vs AGENT
DECISION MATRIX: PIPELINE vs AGENT
┌──────────────────────────────────────────────────────────────────────┐│ DECISION MATRIX │├──────────────────────────────────────────────────────────────────────┤│││USE A PIPELINE WHEN: USE AN AGENT WHEN: ││││ • Steps are fixed and known • Steps depend on results││ • Input → Output is predictable • Need to adapt to surprises ││ • Failures are simple (retry/fail)• Failures need judgment││ • No external state changes • Actions have side effects││ • Speed > flexibility • Flexibility > speed ││││Examples: Examples: ││ • RAG (retrieve → generate) • Customer support (varies) ││ • Summarization • Code generation (iterative) ││ • Classification • Research tasks (exploratory) ││ • Extraction • Multi-step bookings │││└──────────────────────────────────────────────────────────────────────┘
The test: If you can draw the flowchart before running, it’s a pipeline. If the flowchart depends on what happens, it’s an agent.
The Production Checklist
Before deploying an agent:
PRODUCTION DEPLOYMENT CHECKLIST
PRODUCTION DEPLOYMENT CHECKLIST
IDEMPOTENCY
[ ] Every external action has an idempotency key
[ ] Keys use stable identifiers (not timestamps)
[ ] Retries classified (RETRY vs NEVER_RETRY errors)
[ ] Backoff includes jitter
STATE PERSISTENCE
[ ] Checkpointing to durable storage (PostgreSQL in production)
[ ] Checkpoint BEFORE execution, not after
[ ] Resume logic handles in_progress state
[ ] Progress files for multi-session tasks
HUMAN ESCALATION
[ ] Confidence-based routing defined
[ ] High-stakes actions require approval
[ ] Escalation metrics tracked
[ ] Sampling strategy for scale
COST CONTROL
[ ] Token budget per task
[ ] Max step limit per loop
[ ] Circuit breakers on failure spikes
[ ] Cost alerts configured
OBSERVABILITY
[ ] Structured audit logging
[ ] Tool selection tracked
[ ] Semantic failure detection
[ ] Replay capability for debugging
SECURITY
[ ] Sandboxing appropriate to risk level
[ ] Input validation on tool calls
[ ] Output sanitization
[ ] Prompt injection defenses
The loop is 20% of the work. The other 80% is handling failures, persisting state, controlling costs, and keeping humans in the loop.
Tutorials optimize for understanding. Production optimizes for reliability. They’re different goals.
Agents fail differently than APIs. They don’t crash — they do the wrong thing quietly. Your monitoring needs to catch semantic failures, not just exceptions.
Start with one capability. Add idempotency first (highest leverage). Then state persistence. Then the rest.
Not every problem needs an agent. If steps are fixed, use a pipeline. Agents add flexibility at the cost of complexity.