I/D/E · production-agents

Cost Control & Token Budgets - Preventing $10K Surprises

Summary

Deep dive into cost control for production agents: token budgets, circuit breakers, model routing, max step limits, and preventing runaway loops that burn through API credits

Prerequisite: This is Part 4 of the Production Agents Deep Dive series. Read Part 1: Idempotency first — retry storms are the most common driver of cost surprises.

Cost control: from runaway loop to bounded loop

Same agent. Same task class. With token budgets + loop-breakers + model routing, the bill stops being a horror story.

Why This Matters

Your agent enters a loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.

Three structural reasons agents burn tokens harder than chains: they loop (observe-think-act repeats until “done”), they accumulate context (each turn pushes tool outputs onto the prompt), and they retry external calls under failure. Without bounds, any one of those three can run for hours.

What other content gets wrong: framework docs show max_iterations=10 and call it cost control. They never show the parallel sub-agent budget race — a coordinator that fans out N workers, each checking the same shared budget without atomic check-and-commit, blowing through the cap before the first worker reports back. They show pricing tables that go stale within a quarter and hardcode them into client code. This chapter pins the patterns that actually scale: atomic budget accounting, version-pinned pricing, and graceful partial-result shutdown when the cap fires mid-task.

Takeaway: cost control isn’t about being cheap. It’s about making the bill predictable. An agent that occasionally costs $50 is fine; an agent that occasionally costs $10,000 is a resignation letter.

What Goes Wrong Without This:

COST CONTROL FAILURE PATTERNS
Symptom: Monthly API bill 10x higher than expected.
Cause:   Agent retry loop when external API was down.
       No circuit breaker. Kept calling LLM for 6 hours.

Symptom: Single user task consumed $500 in tokens.
Cause: Complex research task with no budget limit.
Agent kept gathering more context, expanding scope.

Symptom: Costs vary wildly between identical requests.
Cause: No model routing. Using a flagship model for tasks a cheap model handles fine.
No visibility into per-task costs.

Takeaway: every $10K bill traces back to one of three causes — unbounded loop, missing dedup key, or wrong model for the task. The patterns below close all three.


Why Agents Are Expensive

Agents aren’t just more LLM calls. They’re structurally more expensive.

FactorSimple ChainAgent
LLM calls per task1-35-50+
Context size growthNoneAccumulates each turn
RetriesRareCommon (external dependencies)
Tool outputs in contextMinimalLarge (file contents, API responses)
LoopsNoneYes (observe-think-act)

Example cost breakdown:

COST COMPARISON: RAG vs AGENT
Simple RAG query:
1 embedding call:     $0.0001
1 completion call:    $0.01
Total:                $0.01

Agent research task:
5 planning calls: $0.05
20 tool calls: $0.20
10 analysis calls: $0.10
3 retry loops: $0.15
Total: $0.50

50x more expensive for a single task. At scale, this compounds — and the per-task variance is what makes the bill unpredictable, not the average.

Takeaway: agents aren’t more LLM calls; they’re a different cost shape. Loop count × context size × retry rate is the real driver, not the per-call price.


Pattern 1: Token Budgets

Every task gets a budget. Exceed it, gracefully stop.

class TokenBudget:
    def __init__(self, max_tokens=50000, warn_at=0.8):
        self.max = max_tokens
        self.warn_threshold = warn_at
        self.used = 0
        self.warning_issued = False

    def consume(self, tokens):
        self.used += tokens

        if not self.warning_issued and self.used >= self.max * self.warn_threshold:
            self.warning_issued = True
            logger.warning(f"Token budget at {self.used}/{self.max} ({self.warn_threshold*100}%)")

        if self.used >= self.max:
            raise TokenBudgetExceeded(
                used=self.used,
                max=self.max,
                message="Task exceeded token budget. Gracefully stopping."
            )

    @property
    def remaining(self):
        return max(0, self.max - self.used)

    @property
    def percentage_used(self):
        return self.used / self.max

# Usage in agent
budget = TokenBudget(max_tokens=100000)

for step in agent_loop():
    try:
        response = llm.call(prompt)
        budget.consume(response.usage.total_tokens)
    except TokenBudgetExceeded:
        return agent.graceful_shutdown("Budget exceeded")

Budget Sizing Guidelines

Task TypeSuggested BudgetRationale
Simple Q&A5,000 tokens1-2 turns max
Document analysis50,000 tokensLarge context, few turns
Research task100,000 tokensMany tool calls, iteration
Code generation150,000 tokensMultiple files, testing
Complex workflow500,000 tokensMulti-step, human-in-loop

Start conservative. Increase based on actual usage patterns, not guesses.

The parallel sub-agent race condition

The naïve TokenBudget above breaks the moment a coordinator fans out parallel sub-agents that all check budget.remaining > N before committing. Each sub-agent reads the same value, each decides it can afford to call the LLM, all of them spend simultaneously, and the cap is shattered before any of them update self.used. Fix: atomic check-and-commit, ideally via a single counter in Redis or your durable execution framework’s state:

class AtomicTokenBudget:
    """Atomic check-and-commit. Safe across parallel sub-agents."""

    def __init__(self, max_tokens: int, store):
        self.max = max_tokens
        self.store = store  # Redis client or equivalent atomic-counter store

    def reserve(self, tokens: int) -> bool:
        """Reserve tokens before the LLM call. Returns False if cap would be breached."""
        # INCRBY is atomic; if the new total exceeds the cap, roll back.
        new_total = self.store.incrby("budget:used", tokens)
        if new_total > self.max:
            self.store.decrby("budget:used", tokens)  # roll back the reservation
            return False
        return True

The rule: reserve before you spend. Reconcile actual usage against reservation after the LLM responds — small variance is fine; large variance signals a token-counting bug.

Takeaway: a budget that isn’t atomic isn’t a budget. The check-then-spend gap is where parallel agents blow through the cap.


Pattern 2: Circuit Breakers for Loops

Agents loop. Loops can run forever. Circuit breakers stop them.

class LoopBreaker:
    def __init__(self, max_iterations=25, max_same_action=3):
        self.max_iterations = max_iterations
        self.max_same_action = max_same_action
        self.iterations = 0
        self.action_history = []

    def check(self, action):
        self.iterations += 1
        self.action_history.append(action)

        # Too many total iterations
        if self.iterations >= self.max_iterations:
            raise LoopLimitExceeded(
                f"Agent exceeded {self.max_iterations} iterations"
            )

        # Stuck in same action
        recent = self.action_history[-self.max_same_action:]
        if len(recent) == self.max_same_action and len(set(recent)) == 1:
            raise StuckInLoop(
                f"Agent repeated '{action}' {self.max_same_action} times"
            )

# Usage
breaker = LoopBreaker(max_iterations=25, max_same_action=3)

while not done:
    action = agent.decide()
    breaker.check(action.type)  # Raises if stuck
    result = agent.execute(action)

Loop Detection Strategies

StrategyDetectsImplementation
Max iterationsRunaway loopsCounter, hard limit
Same action repeatedStuck agentTrack last N actions
No progressSpinning without resultsTrack state changes
Time limitSlow infinite loopsWall clock timeout

Takeaway: budgets cap spend. Circuit breakers cap time and repetition. You need both — an agent stuck in a fast loop can exhaust a generous budget in seconds.


Pattern 3: Model Routing

Use expensive models only when needed.

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": "gpt-4o-mini",      # $0.15/1M input
            "standard": "gpt-4o",         # $5/1M input
            "complex": "claude-opus",     # $15/1M input
        }

    def route(self, task):
        # Classify task complexity
        if task.type in ["clarification", "formatting", "simple_qa"]:
            return self.models["simple"]

        if task.requires_reasoning or task.type in ["analysis", "planning"]:
            return self.models["standard"]

        if task.type in ["code_review", "complex_research", "multi_step"]:
            return self.models["complex"]

        return self.models["standard"]  # Default

# Usage
router = ModelRouter()
model = router.route(current_task)
response = llm.call(model=model, prompt=prompt)

Model cost tiers (indicative, not authoritative)

Prices change quarterly; the numbers below are rough order-of-magnitude figures for mid-2026 to illustrate the tiering. Always read the current vendor pricing page before pinning numbers in production code [anthropic-pricing], [openai-pricing].

TierUse forCost shape
Small (Haiku/Mini-class)Formatting, classification, simple Q&A~10–30× cheaper than flagship
Mid (Sonnet/standard-class)Most reasoning + tool use, balancedThe default for production agents
Flagship (Opus/large-class)Hard reasoning, code, long-horizon planning3–5× cost of mid-tier; reserve for the steps that need it

The math: if 60% of your tasks can use small-tier models, the small tier saves ~90% on those tasks. Routing pays back faster than any other cost-control pattern.

Takeaway: routing is the highest-leverage cost lever. Don’t hardcode prices — link to vendor pages or pull from a billing-API source-of-truth.


Pattern 4: Cost Tracking

You can’t control what you don’t measure.

class CostTracker:
    """Pricing is injected — never hardcoded. Load from your billing-API
    source-of-truth at startup; refresh hourly. Snapshots below are
    approximate $/1k-token figures for shape only."""

    def __init__(self, price_table: dict, alert_threshold: float = 10.0):
        # price_table: {"model-name": {"input": $/1k, "output": $/1k}}
        self.pricing = price_table
        self.total_cost = 0
        self.cost_by_model = {}
        self.cost_by_task_type = {}
        self.alert_threshold = alert_threshold

    def record(self, model, input_tokens, output_tokens, task_type=None):
        pricing = self.pricing.get(model, {"input": 0.01, "output": 0.03})

        cost = (
            (input_tokens * pricing["input"] / 1000) +
            (output_tokens * pricing["output"] / 1000)
        )

        self.total_cost += cost
        self.cost_by_model[model] = self.cost_by_model.get(model, 0) + cost

        if task_type:
            self.cost_by_task_type[task_type] = (
                self.cost_by_task_type.get(task_type, 0) + cost
            )

        if self.total_cost >= self.alert_threshold:
            self.trigger_alert()

        return cost

    def trigger_alert(self):
        alert.send(
            channel="slack-finops",
            message=f"Agent cost alert: ${self.total_cost:.2f} exceeded threshold"
        )

    def report(self):
        return {
            "total_cost": self.total_cost,
            "by_model": self.cost_by_model,
            "by_task_type": self.cost_by_task_type,
        }

Cost Attribution Dimensions

DimensionHow to TrackWhy It Matters
Per requestTag spans with request_idIdentify expensive requests
Per userTag with user_idFair billing, abuse detection
Per task typeClassify tasksOptimize high-cost task types
Per modelTrack model in each callValidate routing effectiveness
Per featureFeature flags on tasksROI by feature

Takeaway: track 5 dimensions, alert on 2. Per-request + per-user are the load-bearing axes; the others are debugging tools.


Pattern 5: Max Step Limits

Hard limits prevent catastrophic runaway.

class AgentExecutor:
    def __init__(self, max_steps=50, max_tool_calls=100):
        self.max_steps = max_steps
        self.max_tool_calls = max_tool_calls

    def run(self, task):
        steps = 0
        tool_calls = 0

        while not task.is_complete():
            steps += 1

            if steps > self.max_steps:
                return self.force_completion(
                    task,
                    reason=f"Exceeded max steps ({self.max_steps})"
                )

            action = self.agent.decide(task)

            if action.is_tool_call:
                tool_calls += 1
                if tool_calls > self.max_tool_calls:
                    return self.force_completion(
                        task,
                        reason=f"Exceeded max tool calls ({self.max_tool_calls})"
                    )

            task = self.agent.execute(action)

        return task.result

    def force_completion(self, task, reason):
        logger.warning(f"Force completing task: {reason}")
        return self.agent.summarize_progress(task, interrupted=True)

Takeaway: max-step limits are the catastrophic-stop. Token budgets are the proportional-stop. You want both — and force_completion should return something, never nothing.


Alerting Strategy

# Example alerting rules

alerts:
  - name: high_cost_request
    condition: request_cost > $5
    severity: warning
    action: log_and_review

  - name: budget_exceeded
    condition: daily_cost > $100
    severity: critical
    action: page_oncall

  - name: runaway_loop
    condition: iterations > 30
    severity: critical
    action: kill_and_alert

  - name: cost_spike
    condition: hourly_cost > 3x_average
    severity: warning
    action: investigate

  - name: model_misrouting
    condition: expensive_model_on_simple_task
    severity: info
    action: log_for_review

Takeaway: alert on rate-of-change (cost spike vs hourly average) more than absolute value. Static thresholds catch yesterday’s bugs; rate-of-change catches today’s.


What to Copy, What to Skip

PatternCopy whenSkip when
Per-task TokenBudgetEvery production agent. Default.Single-shot, single-LLM-call utility scripts
AtomicTokenBudget via RedisParallel sub-agent coordinators, fan-out workflowsSingle-threaded agents — overhead without payoff
LoopBreaker (iterations + same-action)Any agent that loops more than 3 turnsLinear pipelines with no decision branching
Model routing by task classMixed-difficulty workloads where the small tier handles ≥30% of trafficUniform-difficulty workloads (e.g. all-code-review) — routing overhead exceeds savings
CostTracker with injected pricingAny agent with non-trivial spendSub-$10/mo workloads — logging cost outweighs insight
Hardcoded model prices in client codeNever — prices change; the bill won’t match your dashboardsAlways inject from billing API or pricing source-of-truth
Max-step + max-tool-call ceilingsEvery production agentn/a — always include
Rate-of-change alerting (hourly spike vs avg)Production traffic with stable baselineDay-zero deployments — baseline doesn’t exist yet

Takeaway: every pattern has a “skip when”. The one row with no skip is “never hardcode pricing” — that one is a hard rule.


Common Gotchas

GotchaSymptomFix
No budget on devWorks in dev, explodes in prodBudget in all environments; dev budgets can be smaller, never absent
Budget too tightTasks fail legitimatelyMonitor actual p95/p99 usage; size cap at p99 + 50%
No graceful shutdownTask fails with no resultsforce_completion returns partial progress, never bare exception
Static routingOver-using expensive modelsDynamic complexity classification; default to small tier, escalate on need
No per-user limitsOne user burns budget for allUser-level quotas + per-user dashboards
Alerting too lateSee bill at end of monthReal-time cost stream, hourly rate-of-change alerts
Hardcoded pricesBills don’t match span-derived costInject pricing from billing API or single source-of-truth table
Naïve check-then-spend budgetCap shattered by parallel sub-agentsAtomic reserve-then-commit via Redis or DEF state

The Cost Control Checklist

Before deploying an agent:

COST CONTROL DEPLOYMENT CHECKLIST
TOKEN BUDGETS
[ ] Per-task budget defined
[ ] Warning at 80% threshold
[ ] Graceful shutdown when exceeded
[ ] Budget sizes based on actual usage data

LOOP PROTECTION
[ ] Max iterations limit
[ ] Same-action detection
[ ] Time limit as backstop
[ ] Progress tracking (no-op detection)

MODEL ROUTING
[ ] Task complexity classification
[ ] Model selection based on task
[ ] Default model is cost-efficient
[ ] Override for critical tasks

COST TRACKING
[ ] Per-request cost calculation
[ ] Per-user attribution
[ ] Per-task-type breakdown
[ ] Real-time dashboards

ALERTING
[ ] Per-request cost alerts
[ ] Daily budget alerts
[ ] Anomaly detection
[ ] Oncall escalation configured

Takeaway: ship none of these and a single bad agent run can absorb your monthly budget. Ship all five blocks and cost stops being a horror story — it becomes a flat line.


Closing

Cost control is three orthogonal mechanisms — budget caps the spend, loop-breakers cap the time, model routing caps the per-call rate. None substitute for the others. Together they turn an unbounded liability into a predictable line item. The next chapter shows how to detect when the agent is doing the wrong thing inside its budget — silent semantic failure, where the bill is fine but the result is wrong.

Testable question: for your top three agent task types, what’s the p95 cost per task, and at what cap would force_completion fire? If you can’t answer in seconds, the cost control is theatre.

Part 5: Observability & Silent Failures — catching the wrong-but-quiet failures the budget can’t see.


References

Production-agents Ch 5/9
  1. 1 Production Agents Overview - The Loop Is 20% of the Work 15m
  2. 2 Idempotency & Safe Retries - The Stripe Pattern for Agents 20m
  3. 3 State Persistence & Agent Memory - The Complete Domain 30m
  4. 4 Human-in-the-Loop Patterns - When Agents Need Judgment 25m
  5. 5 Cost Control & Token Budgets - Preventing $10K Surprises 20m
  6. 6 Observability & Silent Failures - Catching What Doesn't Crash 25m
  7. 7 Durable Execution Frameworks - Don't Reinvent the Wheel 25m
  8. 8 Security & Sandboxing - Containing Agent Risk 20m
  9. 9 Testing & Evaluation - Validating Agent Behavior 25m