Prerequisite: This is Part 4 of the Production Agents Deep Dive series. Read Part 1: Idempotency first — retry storms are the most common driver of cost surprises.
Same agent. Same task class. With token budgets + loop-breakers + model routing, the bill stops being a horror story.
Why This Matters
Your agent enters a loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.
Three structural reasons agents burn tokens harder than chains: they loop (observe-think-act repeats until “done”), they accumulate context (each turn pushes tool outputs onto the prompt), and they retry external calls under failure. Without bounds, any one of those three can run for hours.
What other content gets wrong: framework docs show max_iterations=10 and call it cost control. They never show the parallel sub-agent budget race — a coordinator that fans out N workers, each checking the same shared budget without atomic check-and-commit, blowing through the cap before the first worker reports back. They show pricing tables that go stale within a quarter and hardcode them into client code. This chapter pins the patterns that actually scale: atomic budget accounting, version-pinned pricing, and graceful partial-result shutdown when the cap fires mid-task.
Takeaway: cost control isn’t about being cheap. It’s about making the bill predictable. An agent that occasionally costs $50 is fine; an agent that occasionally costs $10,000 is a resignation letter.
What Goes Wrong Without This:
Symptom: Monthly API bill 10x higher than expected. Cause: Agent retry loop when external API was down. No circuit breaker. Kept calling LLM for 6 hours. Symptom: Single user task consumed $500 in tokens. Cause: Complex research task with no budget limit. Agent kept gathering more context, expanding scope. Symptom: Costs vary wildly between identical requests. Cause: No model routing. Using a flagship model for tasks a cheap model handles fine. No visibility into per-task costs.
Takeaway: every $10K bill traces back to one of three causes — unbounded loop, missing dedup key, or wrong model for the task. The patterns below close all three.
Why Agents Are Expensive
Agents aren’t just more LLM calls. They’re structurally more expensive.
| Factor | Simple Chain | Agent |
|---|---|---|
| LLM calls per task | 1-3 | 5-50+ |
| Context size growth | None | Accumulates each turn |
| Retries | Rare | Common (external dependencies) |
| Tool outputs in context | Minimal | Large (file contents, API responses) |
| Loops | None | Yes (observe-think-act) |
Example cost breakdown:
Simple RAG query: 1 embedding call: $0.0001 1 completion call: $0.01 Total: $0.01 Agent research task: 5 planning calls: $0.05 20 tool calls: $0.20 10 analysis calls: $0.10 3 retry loops: $0.15 Total: $0.50
50x more expensive for a single task. At scale, this compounds — and the per-task variance is what makes the bill unpredictable, not the average.
Takeaway: agents aren’t more LLM calls; they’re a different cost shape. Loop count × context size × retry rate is the real driver, not the per-call price.
Pattern 1: Token Budgets
Every task gets a budget. Exceed it, gracefully stop.
class TokenBudget:
def __init__(self, max_tokens=50000, warn_at=0.8):
self.max = max_tokens
self.warn_threshold = warn_at
self.used = 0
self.warning_issued = False
def consume(self, tokens):
self.used += tokens
if not self.warning_issued and self.used >= self.max * self.warn_threshold:
self.warning_issued = True
logger.warning(f"Token budget at {self.used}/{self.max} ({self.warn_threshold*100}%)")
if self.used >= self.max:
raise TokenBudgetExceeded(
used=self.used,
max=self.max,
message="Task exceeded token budget. Gracefully stopping."
)
@property
def remaining(self):
return max(0, self.max - self.used)
@property
def percentage_used(self):
return self.used / self.max
# Usage in agent
budget = TokenBudget(max_tokens=100000)
for step in agent_loop():
try:
response = llm.call(prompt)
budget.consume(response.usage.total_tokens)
except TokenBudgetExceeded:
return agent.graceful_shutdown("Budget exceeded")
Budget Sizing Guidelines
| Task Type | Suggested Budget | Rationale |
|---|---|---|
| Simple Q&A | 5,000 tokens | 1-2 turns max |
| Document analysis | 50,000 tokens | Large context, few turns |
| Research task | 100,000 tokens | Many tool calls, iteration |
| Code generation | 150,000 tokens | Multiple files, testing |
| Complex workflow | 500,000 tokens | Multi-step, human-in-loop |
Start conservative. Increase based on actual usage patterns, not guesses.
The parallel sub-agent race condition
The naïve TokenBudget above breaks the moment a coordinator fans out parallel sub-agents that all check budget.remaining > N before committing. Each sub-agent reads the same value, each decides it can afford to call the LLM, all of them spend simultaneously, and the cap is shattered before any of them update self.used. Fix: atomic check-and-commit, ideally via a single counter in Redis or your durable execution framework’s state:
class AtomicTokenBudget:
"""Atomic check-and-commit. Safe across parallel sub-agents."""
def __init__(self, max_tokens: int, store):
self.max = max_tokens
self.store = store # Redis client or equivalent atomic-counter store
def reserve(self, tokens: int) -> bool:
"""Reserve tokens before the LLM call. Returns False if cap would be breached."""
# INCRBY is atomic; if the new total exceeds the cap, roll back.
new_total = self.store.incrby("budget:used", tokens)
if new_total > self.max:
self.store.decrby("budget:used", tokens) # roll back the reservation
return False
return True
The rule: reserve before you spend. Reconcile actual usage against reservation after the LLM responds — small variance is fine; large variance signals a token-counting bug.
Takeaway: a budget that isn’t atomic isn’t a budget. The check-then-spend gap is where parallel agents blow through the cap.
Pattern 2: Circuit Breakers for Loops
Agents loop. Loops can run forever. Circuit breakers stop them.
class LoopBreaker:
def __init__(self, max_iterations=25, max_same_action=3):
self.max_iterations = max_iterations
self.max_same_action = max_same_action
self.iterations = 0
self.action_history = []
def check(self, action):
self.iterations += 1
self.action_history.append(action)
# Too many total iterations
if self.iterations >= self.max_iterations:
raise LoopLimitExceeded(
f"Agent exceeded {self.max_iterations} iterations"
)
# Stuck in same action
recent = self.action_history[-self.max_same_action:]
if len(recent) == self.max_same_action and len(set(recent)) == 1:
raise StuckInLoop(
f"Agent repeated '{action}' {self.max_same_action} times"
)
# Usage
breaker = LoopBreaker(max_iterations=25, max_same_action=3)
while not done:
action = agent.decide()
breaker.check(action.type) # Raises if stuck
result = agent.execute(action)
Loop Detection Strategies
| Strategy | Detects | Implementation |
|---|---|---|
| Max iterations | Runaway loops | Counter, hard limit |
| Same action repeated | Stuck agent | Track last N actions |
| No progress | Spinning without results | Track state changes |
| Time limit | Slow infinite loops | Wall clock timeout |
Takeaway: budgets cap spend. Circuit breakers cap time and repetition. You need both — an agent stuck in a fast loop can exhaust a generous budget in seconds.
Pattern 3: Model Routing
Use expensive models only when needed.
class ModelRouter:
def __init__(self):
self.models = {
"simple": "gpt-4o-mini", # $0.15/1M input
"standard": "gpt-4o", # $5/1M input
"complex": "claude-opus", # $15/1M input
}
def route(self, task):
# Classify task complexity
if task.type in ["clarification", "formatting", "simple_qa"]:
return self.models["simple"]
if task.requires_reasoning or task.type in ["analysis", "planning"]:
return self.models["standard"]
if task.type in ["code_review", "complex_research", "multi_step"]:
return self.models["complex"]
return self.models["standard"] # Default
# Usage
router = ModelRouter()
model = router.route(current_task)
response = llm.call(model=model, prompt=prompt)
Model cost tiers (indicative, not authoritative)
Prices change quarterly; the numbers below are rough order-of-magnitude figures for mid-2026 to illustrate the tiering. Always read the current vendor pricing page before pinning numbers in production code [anthropic-pricing], [openai-pricing].
| Tier | Use for | Cost shape |
|---|---|---|
| Small (Haiku/Mini-class) | Formatting, classification, simple Q&A | ~10–30× cheaper than flagship |
| Mid (Sonnet/standard-class) | Most reasoning + tool use, balanced | The default for production agents |
| Flagship (Opus/large-class) | Hard reasoning, code, long-horizon planning | 3–5× cost of mid-tier; reserve for the steps that need it |
The math: if 60% of your tasks can use small-tier models, the small tier saves ~90% on those tasks. Routing pays back faster than any other cost-control pattern.
Takeaway: routing is the highest-leverage cost lever. Don’t hardcode prices — link to vendor pages or pull from a billing-API source-of-truth.
Pattern 4: Cost Tracking
You can’t control what you don’t measure.
class CostTracker:
"""Pricing is injected — never hardcoded. Load from your billing-API
source-of-truth at startup; refresh hourly. Snapshots below are
approximate $/1k-token figures for shape only."""
def __init__(self, price_table: dict, alert_threshold: float = 10.0):
# price_table: {"model-name": {"input": $/1k, "output": $/1k}}
self.pricing = price_table
self.total_cost = 0
self.cost_by_model = {}
self.cost_by_task_type = {}
self.alert_threshold = alert_threshold
def record(self, model, input_tokens, output_tokens, task_type=None):
pricing = self.pricing.get(model, {"input": 0.01, "output": 0.03})
cost = (
(input_tokens * pricing["input"] / 1000) +
(output_tokens * pricing["output"] / 1000)
)
self.total_cost += cost
self.cost_by_model[model] = self.cost_by_model.get(model, 0) + cost
if task_type:
self.cost_by_task_type[task_type] = (
self.cost_by_task_type.get(task_type, 0) + cost
)
if self.total_cost >= self.alert_threshold:
self.trigger_alert()
return cost
def trigger_alert(self):
alert.send(
channel="slack-finops",
message=f"Agent cost alert: ${self.total_cost:.2f} exceeded threshold"
)
def report(self):
return {
"total_cost": self.total_cost,
"by_model": self.cost_by_model,
"by_task_type": self.cost_by_task_type,
}
Cost Attribution Dimensions
| Dimension | How to Track | Why It Matters |
|---|---|---|
| Per request | Tag spans with request_id | Identify expensive requests |
| Per user | Tag with user_id | Fair billing, abuse detection |
| Per task type | Classify tasks | Optimize high-cost task types |
| Per model | Track model in each call | Validate routing effectiveness |
| Per feature | Feature flags on tasks | ROI by feature |
Takeaway: track 5 dimensions, alert on 2. Per-request + per-user are the load-bearing axes; the others are debugging tools.
Pattern 5: Max Step Limits
Hard limits prevent catastrophic runaway.
class AgentExecutor:
def __init__(self, max_steps=50, max_tool_calls=100):
self.max_steps = max_steps
self.max_tool_calls = max_tool_calls
def run(self, task):
steps = 0
tool_calls = 0
while not task.is_complete():
steps += 1
if steps > self.max_steps:
return self.force_completion(
task,
reason=f"Exceeded max steps ({self.max_steps})"
)
action = self.agent.decide(task)
if action.is_tool_call:
tool_calls += 1
if tool_calls > self.max_tool_calls:
return self.force_completion(
task,
reason=f"Exceeded max tool calls ({self.max_tool_calls})"
)
task = self.agent.execute(action)
return task.result
def force_completion(self, task, reason):
logger.warning(f"Force completing task: {reason}")
return self.agent.summarize_progress(task, interrupted=True)
Takeaway: max-step limits are the catastrophic-stop. Token budgets are the proportional-stop. You want both — and force_completion should return something, never nothing.
Alerting Strategy
# Example alerting rules
alerts:
- name: high_cost_request
condition: request_cost > $5
severity: warning
action: log_and_review
- name: budget_exceeded
condition: daily_cost > $100
severity: critical
action: page_oncall
- name: runaway_loop
condition: iterations > 30
severity: critical
action: kill_and_alert
- name: cost_spike
condition: hourly_cost > 3x_average
severity: warning
action: investigate
- name: model_misrouting
condition: expensive_model_on_simple_task
severity: info
action: log_for_review
Takeaway: alert on rate-of-change (cost spike vs hourly average) more than absolute value. Static thresholds catch yesterday’s bugs; rate-of-change catches today’s.
What to Copy, What to Skip
| Pattern | Copy when | Skip when |
|---|---|---|
Per-task TokenBudget | Every production agent. Default. | Single-shot, single-LLM-call utility scripts |
AtomicTokenBudget via Redis | Parallel sub-agent coordinators, fan-out workflows | Single-threaded agents — overhead without payoff |
LoopBreaker (iterations + same-action) | Any agent that loops more than 3 turns | Linear pipelines with no decision branching |
| Model routing by task class | Mixed-difficulty workloads where the small tier handles ≥30% of traffic | Uniform-difficulty workloads (e.g. all-code-review) — routing overhead exceeds savings |
CostTracker with injected pricing | Any agent with non-trivial spend | Sub-$10/mo workloads — logging cost outweighs insight |
| Hardcoded model prices in client code | Never — prices change; the bill won’t match your dashboards | Always inject from billing API or pricing source-of-truth |
| Max-step + max-tool-call ceilings | Every production agent | n/a — always include |
| Rate-of-change alerting (hourly spike vs avg) | Production traffic with stable baseline | Day-zero deployments — baseline doesn’t exist yet |
Takeaway: every pattern has a “skip when”. The one row with no skip is “never hardcode pricing” — that one is a hard rule.
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| No budget on dev | Works in dev, explodes in prod | Budget in all environments; dev budgets can be smaller, never absent |
| Budget too tight | Tasks fail legitimately | Monitor actual p95/p99 usage; size cap at p99 + 50% |
| No graceful shutdown | Task fails with no results | force_completion returns partial progress, never bare exception |
| Static routing | Over-using expensive models | Dynamic complexity classification; default to small tier, escalate on need |
| No per-user limits | One user burns budget for all | User-level quotas + per-user dashboards |
| Alerting too late | See bill at end of month | Real-time cost stream, hourly rate-of-change alerts |
| Hardcoded prices | Bills don’t match span-derived cost | Inject pricing from billing API or single source-of-truth table |
Naïve check-then-spend budget | Cap shattered by parallel sub-agents | Atomic reserve-then-commit via Redis or DEF state |
The Cost Control Checklist
Before deploying an agent:
TOKEN BUDGETS [ ] Per-task budget defined [ ] Warning at 80% threshold [ ] Graceful shutdown when exceeded [ ] Budget sizes based on actual usage data LOOP PROTECTION [ ] Max iterations limit [ ] Same-action detection [ ] Time limit as backstop [ ] Progress tracking (no-op detection) MODEL ROUTING [ ] Task complexity classification [ ] Model selection based on task [ ] Default model is cost-efficient [ ] Override for critical tasks COST TRACKING [ ] Per-request cost calculation [ ] Per-user attribution [ ] Per-task-type breakdown [ ] Real-time dashboards ALERTING [ ] Per-request cost alerts [ ] Daily budget alerts [ ] Anomaly detection [ ] Oncall escalation configured
Takeaway: ship none of these and a single bad agent run can absorb your monthly budget. Ship all five blocks and cost stops being a horror story — it becomes a flat line.
Closing
Cost control is three orthogonal mechanisms — budget caps the spend, loop-breakers cap the time, model routing caps the per-call rate. None substitute for the others. Together they turn an unbounded liability into a predictable line item. The next chapter shows how to detect when the agent is doing the wrong thing inside its budget — silent semantic failure, where the bill is fine but the result is wrong.
Testable question: for your top three agent task types, what’s the p95 cost per task, and at what cap would force_completion fire? If you can’t answer in seconds, the cost control is theatre.
→ Part 5: Observability & Silent Failures — catching the wrong-but-quiet failures the budget can’t see.
References
- [anthropic-pricing] Anthropic — API Pricing. anthropic.com/pricing. Source of truth for current per-1k-token rates by model. Always read live before pinning numbers in production code.
- [openai-pricing] OpenAI — API Pricing. openai.com/api/pricing. Same source-of-truth pattern for OpenAI models.
- [aws-jitter] Marc Brooker (AWS Architecture Blog) — Exponential Backoff and Jitter, 2015-03-04. aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter. Background for why naïve retry storms compound into cost incidents.
- [pa-idempotency] Production Agents — Part 1: Idempotency & Safe Retries. The retry-storm pattern that turns a transient API outage into a 10× billing event.
- [pa-state] Production Agents — Part 2: State Persistence & Checkpointing. The atomic-state pattern that makes
AtomicTokenBudgetwork across parallel sub-agents. - [pa-observability] Production Agents — Part 5: Observability & Silent Failures. FinOps-from-span-data — cost calculation belongs in the query layer over OTel spans, not hardcoded in agent code.