Cost Control & Token Budgets - Preventing $10K Surprises | Intentional / Deliberate / Engineering

Prerequisite: This is Part 4 of the Production Agents Deep Dive series. Read Part 1: Idempotency first — retry storms are the most common driver of cost surprises.

Left half: an agent runs an unbounded observe-think-act loop while a token meter races to 100k+ and a billing graph spikes to $10,000 — a frozen red panel reads 'budget exceeded overnight.' Right half: the same loop is wrapped by three gates — a token budget meter at 50k cap, a circuit breaker counting 25 iterations max, and a model router sending simple tasks to a cheap model — the bill plateaus at $48. A divider in the middle reads 'unbounded loop → bounded loop.' — Cost control: from runaway loop to bounded loop

Why This Matters

Your agent enters a loop. Loop calls LLM. LLM responds. Loop continues. You wake up to a $10K bill.

Three structural reasons agents burn tokens harder than chains: they loop (observe-think-act repeats until “done”), they accumulate context (each turn pushes tool outputs onto the prompt), and they retry external calls under failure. Without bounds, any one of those three can run for hours.

What other content gets wrong: framework docs show max_iterations=10 and call it cost control. They never show the parallel sub-agent budget race — a coordinator that fans out N workers, each checking the same shared budget without atomic check-and-commit, blowing through the cap before the first worker reports back. They show pricing tables that go stale within a quarter and hardcode them into client code. This chapter pins the patterns that actually scale: atomic budget accounting, version-pinned pricing, and graceful partial-result shutdown when the cap fires mid-task.

Takeaway: cost control isn’t about being cheap. It’s about making the bill predictable. An agent that occasionally costs $50 is fine; an agent that occasionally costs $10,000 is a resignation letter.

What Goes Wrong Without This:

COST CONTROL FAILURE PATTERNS

Symptom: Monthly API bill 10x higher than expected.
Cause:   Agent retry loop when external API was down.
       No circuit breaker. Kept calling LLM for 6 hours.

Symptom: Single user task consumed $500 in tokens.
Cause: Complex research task with no budget limit.
Agent kept gathering more context, expanding scope.

Symptom: Costs vary wildly between identical requests.
Cause: No model routing. Using a flagship model for tasks a cheap model handles fine.
No visibility into per-task costs.

Takeaway: every $10K bill traces back to one of three causes — unbounded loop, missing dedup key, or wrong model for the task. The patterns below close all three.

Why Agents Are Expensive

Agents aren’t just more LLM calls. They’re structurally more expensive.

Factor	Simple Chain	Agent
LLM calls per task	1-3	5-50+
Context size growth	None	Accumulates each turn
Retries	Rare	Common (external dependencies)
Tool outputs in context	Minimal	Large (file contents, API responses)
Loops	None	Yes (observe-think-act)

Example cost breakdown:

COST COMPARISON: RAG vs AGENT

Simple RAG query:
1 embedding call:     $0.0001
1 completion call:    $0.01
Total:                $0.01

Agent research task:
5 planning calls: $0.05
20 tool calls: $0.20
10 analysis calls: $0.10
3 retry loops: $0.15
Total: $0.50

50x more expensive for a single task. At scale, this compounds — and the per-task variance is what makes the bill unpredictable, not the average.

Takeaway: agents aren’t more LLM calls; they’re a different cost shape. Loop count × context size × retry rate is the real driver, not the per-call price.

Pattern 1: Token Budgets

Every task gets a budget. Exceed it, gracefully stop.

class TokenBudget:
    def __init__(self, max_tokens=50000, warn_at=0.8):
        self.max = max_tokens
        self.warn_threshold = warn_at
        self.used = 0
        self.warning_issued = False

    def consume(self, tokens):
        self.used += tokens

        if not self.warning_issued and self.used >= self.max * self.warn_threshold:
            self.warning_issued = True
            logger.warning(f"Token budget at {self.used}/{self.max} ({self.warn_threshold*100}%)")

        if self.used >= self.max:
            raise TokenBudgetExceeded(
                used=self.used,
                max=self.max,
                message="Task exceeded token budget. Gracefully stopping."
            )

    @property
    def remaining(self):
        return max(0, self.max - self.used)

    @property
    def percentage_used(self):
        return self.used / self.max

# Usage in agent
budget = TokenBudget(max_tokens=100000)

for step in agent_loop():
    try:
        response = llm.call(prompt)
        budget.consume(response.usage.total_tokens)
    except TokenBudgetExceeded:
        return agent.graceful_shutdown("Budget exceeded")

Budget Sizing Guidelines

Task Type	Suggested Budget	Rationale
Simple Q&A	5,000 tokens	1-2 turns max
Document analysis	50,000 tokens	Large context, few turns
Research task	100,000 tokens	Many tool calls, iteration
Code generation	150,000 tokens	Multiple files, testing
Complex workflow	500,000 tokens	Multi-step, human-in-loop

Start conservative. Increase based on actual usage patterns, not guesses.

The parallel sub-agent race condition

The naïve TokenBudget above breaks the moment a coordinator fans out parallel sub-agents that all check budget.remaining > N before committing. Each sub-agent reads the same value, each decides it can afford to call the LLM, all of them spend simultaneously, and the cap is shattered before any of them update self.used. Fix: atomic check-and-commit, ideally via a single counter in Redis or your durable execution framework’s state:

class AtomicTokenBudget:
    """Atomic check-and-commit. Safe across parallel sub-agents."""

    def __init__(self, max_tokens: int, store):
        self.max = max_tokens
        self.store = store  # Redis client or equivalent atomic-counter store

    def reserve(self, tokens: int) -> bool:
        """Reserve tokens before the LLM call. Returns False if cap would be breached."""
        # INCRBY is atomic; if the new total exceeds the cap, roll back.
        new_total = self.store.incrby("budget:used", tokens)
        if new_total > self.max:
            self.store.decrby("budget:used", tokens)  # roll back the reservation
            return False
        return True

The rule: reserve before you spend. Reconcile actual usage against reservation after the LLM responds — small variance is fine; large variance signals a token-counting bug.

Takeaway: a budget that isn’t atomic isn’t a budget. The check-then-spend gap is where parallel agents blow through the cap.

Pattern 2: Circuit Breakers for Loops

Agents loop. Loops can run forever. Circuit breakers stop them.

class LoopBreaker:
    def __init__(self, max_iterations=25, max_same_action=3):
        self.max_iterations = max_iterations
        self.max_same_action = max_same_action
        self.iterations = 0
        self.action_history = []

    def check(self, action):
        self.iterations += 1
        self.action_history.append(action)

        # Too many total iterations
        if self.iterations >= self.max_iterations:
            raise LoopLimitExceeded(
                f"Agent exceeded {self.max_iterations} iterations"
            )

        # Stuck in same action
        recent = self.action_history[-self.max_same_action:]
        if len(recent) == self.max_same_action and len(set(recent)) == 1:
            raise StuckInLoop(
                f"Agent repeated '{action}' {self.max_same_action} times"
            )

# Usage
breaker = LoopBreaker(max_iterations=25, max_same_action=3)

while not done:
    action = agent.decide()
    breaker.check(action.type)  # Raises if stuck
    result = agent.execute(action)

Loop Detection Strategies

Strategy	Detects	Implementation
Max iterations	Runaway loops	Counter, hard limit
Same action repeated	Stuck agent	Track last N actions
No progress	Spinning without results	Track state changes
Time limit	Slow infinite loops	Wall clock timeout

Takeaway: budgets cap spend. Circuit breakers cap time and repetition. You need both — an agent stuck in a fast loop can exhaust a generous budget in seconds.

Pattern 3: Model Routing

Use expensive models only when needed.

class ModelRouter:
    def __init__(self):
        self.models = {
            "simple": "gpt-4o-mini",      # $0.15/1M input
            "standard": "gpt-4o",         # $5/1M input
            "complex": "claude-opus",     # $15/1M input
        }

    def route(self, task):
        # Classify task complexity
        if task.type in ["clarification", "formatting", "simple_qa"]:
            return self.models["simple"]

        if task.requires_reasoning or task.type in ["analysis", "planning"]:
            return self.models["standard"]

        if task.type in ["code_review", "complex_research", "multi_step"]:
            return self.models["complex"]

        return self.models["standard"]  # Default

# Usage
router = ModelRouter()
model = router.route(current_task)
response = llm.call(model=model, prompt=prompt)

Model cost tiers (indicative, not authoritative)

Prices change quarterly; the numbers below are rough order-of-magnitude figures for mid-2026 to illustrate the tiering. Always read the current vendor pricing page before pinning numbers in production code [anthropic-pricing], [openai-pricing].

Tier	Use for	Cost shape
Small (Haiku/Mini-class)	Formatting, classification, simple Q&A	~10–30× cheaper than flagship
Mid (Sonnet/standard-class)	Most reasoning + tool use, balanced	The default for production agents
Flagship (Opus/large-class)	Hard reasoning, code, long-horizon planning	3–5× cost of mid-tier; reserve for the steps that need it

The math: if 60% of your tasks can use small-tier models, the small tier saves ~90% on those tasks. Routing pays back faster than any other cost-control pattern.

Takeaway: routing is the highest-leverage cost lever. Don’t hardcode prices — link to vendor pages or pull from a billing-API source-of-truth.

Pattern 4: Cost Tracking

You can’t control what you don’t measure.

class CostTracker:
    """Pricing is injected — never hardcoded. Load from your billing-API
    source-of-truth at startup; refresh hourly. Snapshots below are
    approximate $/1k-token figures for shape only."""

    def __init__(self, price_table: dict, alert_threshold: float = 10.0):
        # price_table: {"model-name": {"input": $/1k, "output": $/1k}}
        self.pricing = price_table
        self.total_cost = 0
        self.cost_by_model = {}
        self.cost_by_task_type = {}
        self.alert_threshold = alert_threshold

    def record(self, model, input_tokens, output_tokens, task_type=None):
        pricing = self.pricing.get(model, {"input": 0.01, "output": 0.03})

        cost = (
            (input_tokens * pricing["input"] / 1000) +
            (output_tokens * pricing["output"] / 1000)
        )

        self.total_cost += cost
        self.cost_by_model[model] = self.cost_by_model.get(model, 0) + cost

        if task_type:
            self.cost_by_task_type[task_type] = (
                self.cost_by_task_type.get(task_type, 0) + cost
            )

        if self.total_cost >= self.alert_threshold:
            self.trigger_alert()

        return cost

    def trigger_alert(self):
        alert.send(
            channel="slack-finops",
            message=f"Agent cost alert: ${self.total_cost:.2f} exceeded threshold"
        )

    def report(self):
        return {
            "total_cost": self.total_cost,
            "by_model": self.cost_by_model,
            "by_task_type": self.cost_by_task_type,
        }

Cost Attribution Dimensions

Dimension	How to Track	Why It Matters
Per request	Tag spans with request_id	Identify expensive requests
Per user	Tag with user_id	Fair billing, abuse detection
Per task type	Classify tasks	Optimize high-cost task types
Per model	Track model in each call	Validate routing effectiveness
Per feature	Feature flags on tasks	ROI by feature

Takeaway: track 5 dimensions, alert on 2. Per-request + per-user are the load-bearing axes; the others are debugging tools.

Pattern 5: Max Step Limits

Hard limits prevent catastrophic runaway.

class AgentExecutor:
    def __init__(self, max_steps=50, max_tool_calls=100):
        self.max_steps = max_steps
        self.max_tool_calls = max_tool_calls

    def run(self, task):
        steps = 0
        tool_calls = 0

        while not task.is_complete():
            steps += 1

            if steps > self.max_steps:
                return self.force_completion(
                    task,
                    reason=f"Exceeded max steps ({self.max_steps})"
                )

            action = self.agent.decide(task)

            if action.is_tool_call:
                tool_calls += 1
                if tool_calls > self.max_tool_calls:
                    return self.force_completion(
                        task,
                        reason=f"Exceeded max tool calls ({self.max_tool_calls})"
                    )

            task = self.agent.execute(action)

        return task.result

    def force_completion(self, task, reason):
        logger.warning(f"Force completing task: {reason}")
        return self.agent.summarize_progress(task, interrupted=True)

Takeaway: max-step limits are the catastrophic-stop. Token budgets are the proportional-stop. You want both — and force_completion should return something, never nothing.

Alerting Strategy

# Example alerting rules

alerts:
  - name: high_cost_request
    condition: request_cost > $5
    severity: warning
    action: log_and_review

  - name: budget_exceeded
    condition: daily_cost > $100
    severity: critical
    action: page_oncall

  - name: runaway_loop
    condition: iterations > 30
    severity: critical
    action: kill_and_alert

  - name: cost_spike
    condition: hourly_cost > 3x_average
    severity: warning
    action: investigate

  - name: model_misrouting
    condition: expensive_model_on_simple_task
    severity: info
    action: log_for_review

Takeaway: alert on rate-of-change (cost spike vs hourly average) more than absolute value. Static thresholds catch yesterday’s bugs; rate-of-change catches today’s.

What to Copy, What to Skip

Pattern	Copy when	Skip when
Per-task `TokenBudget`	Every production agent. Default.	Single-shot, single-LLM-call utility scripts
`AtomicTokenBudget` via Redis	Parallel sub-agent coordinators, fan-out workflows	Single-threaded agents — overhead without payoff
`LoopBreaker` (iterations + same-action)	Any agent that loops more than 3 turns	Linear pipelines with no decision branching
Model routing by task class	Mixed-difficulty workloads where the small tier handles ≥30% of traffic	Uniform-difficulty workloads (e.g. all-code-review) — routing overhead exceeds savings
`CostTracker` with injected pricing	Any agent with non-trivial spend	Sub-$10/mo workloads — logging cost outweighs insight
Hardcoded model prices in client code	Never — prices change; the bill won’t match your dashboards	Always inject from billing API or pricing source-of-truth
Max-step + max-tool-call ceilings	Every production agent	n/a — always include
Rate-of-change alerting (hourly spike vs avg)	Production traffic with stable baseline	Day-zero deployments — baseline doesn’t exist yet

Takeaway: every pattern has a “skip when”. The one row with no skip is “never hardcode pricing” — that one is a hard rule.

Common Gotchas

Gotcha	Symptom	Fix
No budget on dev	Works in dev, explodes in prod	Budget in all environments; dev budgets can be smaller, never absent
Budget too tight	Tasks fail legitimately	Monitor actual p95/p99 usage; size cap at p99 + 50%
No graceful shutdown	Task fails with no results	`force_completion` returns partial progress, never bare exception
Static routing	Over-using expensive models	Dynamic complexity classification; default to small tier, escalate on need
No per-user limits	One user burns budget for all	User-level quotas + per-user dashboards
Alerting too late	See bill at end of month	Real-time cost stream, hourly rate-of-change alerts
Hardcoded prices	Bills don’t match span-derived cost	Inject pricing from billing API or single source-of-truth table
Naïve `check-then-spend` budget	Cap shattered by parallel sub-agents	Atomic reserve-then-commit via Redis or DEF state

The Cost Control Checklist

Before deploying an agent:

COST CONTROL DEPLOYMENT CHECKLIST

TOKEN BUDGETS
[ ] Per-task budget defined
[ ] Warning at 80% threshold
[ ] Graceful shutdown when exceeded
[ ] Budget sizes based on actual usage data

LOOP PROTECTION
[ ] Max iterations limit
[ ] Same-action detection
[ ] Time limit as backstop
[ ] Progress tracking (no-op detection)

MODEL ROUTING
[ ] Task complexity classification
[ ] Model selection based on task
[ ] Default model is cost-efficient
[ ] Override for critical tasks

COST TRACKING
[ ] Per-request cost calculation
[ ] Per-user attribution
[ ] Per-task-type breakdown
[ ] Real-time dashboards

ALERTING
[ ] Per-request cost alerts
[ ] Daily budget alerts
[ ] Anomaly detection
[ ] Oncall escalation configured

Takeaway: ship none of these and a single bad agent run can absorb your monthly budget. Ship all five blocks and cost stops being a horror story — it becomes a flat line.

Closing

Cost control is three orthogonal mechanisms — budget caps the spend, loop-breakers cap the time, model routing caps the per-call rate. None substitute for the others. Together they turn an unbounded liability into a predictable line item. The next chapter shows how to detect when the agent is doing the wrong thing inside its budget — silent semantic failure, where the bill is fine but the result is wrong.

Testable question: for your top three agent task types, what’s the p95 cost per task, and at what cap would force_completion fire? If you can’t answer in seconds, the cost control is theatre.

→ Part 5: Observability & Silent Failures — catching the wrong-but-quiet failures the budget can’t see.

References

[anthropic-pricing] Anthropic — API Pricing. anthropic.com/pricing. Source of truth for current per-1k-token rates by model. Always read live before pinning numbers in production code.
[openai-pricing] OpenAI — API Pricing. openai.com/api/pricing. Same source-of-truth pattern for OpenAI models.
[aws-jitter] Marc Brooker (AWS Architecture Blog) — Exponential Backoff and Jitter, 2015-03-04. aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter. Background for why naïve retry storms compound into cost incidents.
[pa-idempotency] Production Agents — Part 1: Idempotency & Safe Retries. The retry-storm pattern that turns a transient API outage into a 10× billing event.
[pa-state] Production Agents — Part 2: State Persistence & Checkpointing. The atomic-state pattern that makes AtomicTokenBudget work across parallel sub-agents.
[pa-observability] Production Agents — Part 5: Observability & Silent Failures. FinOps-from-span-data — cost calculation belongs in the query layer over OTel spans, not hardcoded in agent code.