Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.
Same retry pattern, two outcomes. The idempotency gate is the difference between a charged-back customer and a non-event.
Why This Matters
Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.
This isn’t a bug. This is correct retry logic meeting real-world latency.
Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.
What other content gets wrong: tutorials show retry=3 and call it done. The Stripe docs cover idempotency keys but stop short of key derivation across retries. The AWS Builders’ Library [aws-builders-retries] explains why retries amplify upstream stress; Brooker’s companion post [aws-jitter] gives the jitter math — neither covers the agent-loop case. This chapter ties all three together — and shows what happens when your idempotency key collides across users or your circuit breaker opens mid-cascade.
Takeaway: retry logic without idempotency keys is a charge-customers-twice machine.
What Goes Wrong Without This:
Symptom: Customer charged twice for the same order. Cause: Payment API timed out. Agent retried. Both charges processed. No idempotency key to deduplicate. Symptom: User receives 47 copies of the same email. Cause: Email send succeeded but response was slow. Agent assumed failure. Retried. No deduplication on sends. Symptom: Database has duplicate records with slight variations. Cause: INSERT succeeded, network dropped response. Retry created second record. No upsert or idempotency check.
Takeaway: every failure mode here is a retry meeting a missing dedup key. Stripe’s idempotency-keys design [stripe-idempotency] exists precisely to break this loop.
What Idempotency Means
Idempotent: An operation that produces the same result when called multiple times with the same input.
Idempotent: GET /user/123 → Same user every time (safe to retry) DELETE /file/abc → File deleted, stays deleted (safe to retry) PUT /user/123 {name} → User updated to same value (safe to retry) Not Idempotent: POST /charge/$100 → New charge every time (dangerous to retry) POST /email/send → New email every time (dangerous to retry) INSERT INTO orders → New row every time (dangerous to retry) Made Idempotent: POST /charge/$100 + idempotency_key=xyz123 → Same charge on retry POST /email/send + message_id=abc456 → Same email, no duplicate INSERT ... ON CONFLICT DO NOTHING → Same row, no duplicate
Takeaway: idempotency isn’t a property of the HTTP verb — it’s a property of how you derive the dedup key.
The Stripe Pattern
Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard [stripe-idempotency]:
- Idempotency key in
Idempotency-KeyHTTP header onPOSTrequests - Up to 255 characters; UUIDv4 or random string with enough entropy
- Stripe stores the response status + body for at least 24 hours; same key in that window replays the stored result, including
500errors - Avoid using emails or personal identifiers as keys — they leak into logs
# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
# Key must be STABLE across retries
# Bad: f"{user_id}:{timestamp}" - different each retry
# Bad: f"{user_id}:{retry_count}" - different each retry
# Good: f"{user_id}:{task_id}:{step_id}" - same across retries
idempotency_key = f"{user_id}:{task_id}:{step_id}"
return api.book(
flight_id=flight_id,
idempotency_key=idempotency_key
)
# Server checks and stores
def handle_booking(request):
key = request.idempotency_key
# Check if we've processed this before
cached = cache.get(key)
if cached:
return cached # Return stored result, don't reprocess
# First time: process and store result
result = process_booking(request)
cache.set(key, result, ttl=timedelta(hours=24))
return result
Key Generation Rules
| Include | Exclude | Why |
|---|---|---|
user_id | timestamp | Timestamps change on retry |
task_id | retry_count | Retry count changes on retry |
step_id | random() | Random changes on retry |
operation_type | request_id (if regenerated) | Must be stable |
external_reference |
The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.
Takeaway: stable identifiers in, stable key out. Anything that changes per attempt (timestamp, retry count, random) breaks the contract.
Three Idempotency Strategies
Strategy 1: Idempotency Keys (Stripe Pattern)
Best for: External APIs, payments, bookings
class IdempotentClient:
def __init__(self, cache):
self.cache = cache
def execute(self, operation, idempotency_key):
# Check cache
cached = self.cache.get(idempotency_key)
if cached:
return cached
# Execute and cache
result = operation()
self.cache.set(idempotency_key, result, ttl=86400) # 24 hours
return result
# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
operation=lambda: api.book_flight(flight_id),
idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)
Strategy 2: Sequence Numbers
Best for: Internal state changes, ordered operations
class SequencedOperations:
def __init__(self):
self.expected_seq = 1
self.results = {}
def execute(self, seq_num, operation):
# Already processed
if seq_num < self.expected_seq:
return self.results[seq_num]
# Out of order
if seq_num > self.expected_seq:
raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")
# Process and increment
result = operation()
self.results[seq_num] = result
self.expected_seq += 1
return result
Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.
Strategy 3: Time Window Deduplication
Best for: Best-effort deduplication, high-volume low-stakes operations
class TimeWindowDedup:
def __init__(self, window_seconds=300):
self.window = window_seconds
self.seen = {} # hash -> (timestamp, result)
def execute(self, request_hash, operation):
now = time.time()
# Check if seen within window
if request_hash in self.seen:
timestamp, result = self.seen[request_hash]
if now - timestamp < self.window:
return result # Within window, return cached
# Process and cache
result = operation()
self.seen[request_hash] = (now, result)
return result
Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.
Takeaway: pick the strategy by the cost of a duplicate. Payments → keys. Internal counters → sequence. Search-as-you-type → time window.
Error Classification
Not all errors should be retried. Getting this wrong causes cascading failures.
| HTTP status | Class | Retry? | Why |
|---|---|---|---|
429 Too Many Requests | Rate limit | yes | back off + jitter; honour Retry-After if present |
502 Bad Gateway / 503 Service Unavailable / 504 Gateway Timeout | Upstream | yes | infrastructure flake; jitter prevents thundering herd |
408 Request Timeout | Network | yes | client-side cutoff, same operation safe with idempotency key |
| Connection reset / DNS / TLS handshake | Network | yes | transient |
400 Bad Request / 422 Unprocessable Entity | Client | no | retry never succeeds — needs code or input fix |
401 Unauthorized / 403 Forbidden | Auth | no | needs token refresh or escalation, not retry |
404 Not Found | Client | no | resource doesn’t exist |
409 Conflict | State | no | logical conflict; retry without state change wastes effort |
from http import HTTPStatus
RETRY_STATUSES = {
HTTPStatus.REQUEST_TIMEOUT, # 408
HTTPStatus.TOO_MANY_REQUESTS, # 429
HTTPStatus.BAD_GATEWAY, # 502
HTTPStatus.SERVICE_UNAVAILABLE, # 503
HTTPStatus.GATEWAY_TIMEOUT, # 504
}
RETRY_EXCEPTIONS = (ConnectionResetError, TimeoutError)
def should_retry(error):
if isinstance(error, RETRY_EXCEPTIONS):
return True
status = getattr(error, "status_code", None)
return status in RETRY_STATUSES
The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).
Takeaway: 4xx errors (except 429) are bugs in your request, not blips in their service. Retrying a 400 is wasted budget.
Exponential Backoff with Full Jitter
Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.
Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.
Marc Brooker’s 2015 AWS Architecture Blog post [aws-jitter] gives the canonical formula:
sleep = random_between(0, min(cap, base * 2^attempt))
This is “full jitter”. Brooker’s testing showed full jitter uses substantially less client work and server load than no-jitter, and less work than equal-jitter — at the cost of slightly more wall-clock time vs equal-jitter. The win is amplification protection, not raw speed.
import random
import time
def retry_with_backoff(
operation,
max_retries=5,
base_delay=0.1,
max_delay=10.0,
idempotency_key=None,
):
"""Exponential backoff with full jitter (Brooker 2015)."""
for attempt in range(max_retries):
try:
return operation()
except Exception as e:
if not should_retry(e):
raise # permanent failure — surface it
if attempt == max_retries - 1:
raise # exhausted
# cap = max_delay, base = base_delay
ceiling = min(base_delay * (2 ** attempt), max_delay)
time.sleep(random.uniform(0, ceiling))
Why Full Jitter?
WITHOUT JITTER: Server fails at t=0 All 1000 clients retry at t=1 Server fails again All 1000 clients retry at t=2 Server fails again ... WITH FULL JITTER: Server fails at t=0 Client A retries at t=0.3 Client B retries at t=0.7 Client C retries at t=0.1 ... Load spreads across 0-1 second window Server can handle gradual recovery
Takeaway: without jitter you cause the outage you were trying to survive. The randomness is the feature.
Cascading Retry Storm
1. Payment service has 30-second outage 2. Order processing agents timeout, start retrying → 1000 agents × 3 retries = 3000 payment requests 3. Payment retries trigger inventory checks → Each payment retry calls inventory → 3000 inventory requests 4. Inventory service overwhelmed by traffic → Starts timing out → Agents retry inventory calls 5. Inventory retries trigger shipping checks → Cascade continues 6. Within 60 seconds: → 10x normal load across all services → Multiple services failing → Complete system degradation
Prevention: Circuit Breakers
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, operation):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = operation()
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
try:
result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
# Don't even try — circuit is open
return escalate_to_human("Payment service unavailable")
What happens when the breaker opens mid-cascade: payment service degrades at t=0. By t=15s, the breaker on payment_api has tripped to OPEN. But the agent’s order workflow doesn’t know that — it just sees CircuitOpenError and may retry the whole workflow from step 1, re-calling inventory_api, shipping_api, and notification_api. Without idempotency keys on each of those, you’ve now charged inventory and notification systems for an order that never paid. The fix: idempotency keys at every step, not just payments. And classify CircuitOpenError as NEVER_RETRY at the workflow level — it’s a state signal, not a transient blip.
Takeaway: idempotency keys protect each call. Circuit breakers protect the whole system from cascading retries when one dependency degrades. Both must agree on what “retry” means.
Framework-Specific Implementation
LangGraph (0.2.24+)
LangGraph’s RetryPolicy is wired via add_node’s retry_policy parameter, not a decorator. The RetryPolicy itself is a NamedTuple in langgraph.types (added in 0.2.24) [langgraph-docs]:
from langgraph.graph import StateGraph
from langgraph.types import RetryPolicy
from langgraph.checkpoint.postgres import PostgresSaver
def call_external_api(state):
# Stable key — survives retries because state fields don't change
key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
return api.call(idempotency_key=key)
graph = StateGraph(AgentState)
graph.add_node(
"call_external_api",
call_external_api,
retry_policy=RetryPolicy(
initial_interval=0.5, # seconds
backoff_factor=2.0,
max_interval=30.0,
max_attempts=3,
jitter=True, # full jitter is the default
),
)
# Checkpointing enables safe retry from last known state.
app = graph.compile(
checkpointer=PostgresSaver.from_conn_string(DATABASE_URL),
)
The jitter=True default means LangGraph applies full jitter without you doing anything — but the per-call idempotency key is still your responsibility.
Temporal (Python SDK 1.x)
Temporal activities are at-least-once. Your idempotency key turns at-least-once delivery into effectively-once business effect [temporal-docs]:
from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
return await api.book(flight_id, idempotency_key=idempotency_key)
@workflow.defn
class BookingWorkflow:
@workflow.run
async def run(self, request: BookingRequest) -> BookingResult:
return await workflow.execute_activity(
book_flight,
args=[
request.flight_id,
f"{request.user_id}:{request.booking_id}",
],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=30),
backoff_coefficient=2.0,
maximum_attempts=5,
non_retryable_error_types=["ValidationError", "AuthError"],
),
)
Takeaway: framework retry policies handle backoff + jitter; your idempotency key handles the business effect. You need both.
What to Copy, What to Skip
| Pattern | Copy when | Skip when |
|---|---|---|
| Stripe-style idempotency key | external API with side effects (payment, booking, email send) | pure reads, internal cache lookups, no-side-effect tools |
| Sequence-number dedup | single-writer ordered pipelines (e.g. event-sourced state mutations) | concurrent multi-writer flows — order assumption breaks |
| Time-window dedup (5 min) | high-volume, low-stakes (search-as-you-type, autocomplete telemetry) | money or anything regulated — duplicates after window expire = legal risk |
Full jitter (random(0, cap)) | every retry path. Default. | only if you have a contractual reason to be deterministic (and reconsider then) |
| Circuit breaker per dependency | downstream services with measurable failure rates | in-process pure-compute calls — overhead without payoff |
| 24h cache TTL on key→result | matches Stripe’s window; covers most retry storms | regulated environments where retry windows are mandated — pin to spec |
LangGraph node retry_policy= | LangGraph 0.2+ workflows where retry is per-step | older releases (<0.2) that lack the parameter — upgrade or wrap manually |
Temporal activity non_retryable_error_types | classify auth + validation errors as non-retryable at workflow level | for transient infra errors — let the default retry kick in |
Takeaway: every endorsed pattern has a break condition. Read the right column before copying the left.
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| Timestamp in key | Retries create duplicates | Use stable identifiers only |
| Key too broad | Different operations collide | Include operation type in key |
| Key too narrow | Same operation not deduplicated | Include all relevant context |
| No TTL on cache | Memory leak | Set 24-hour TTL (matches Stripe window) |
| Cached failures | Retrying failed ops returns the failure | Only cache successful results, or store (status, body) and check status |
| Retrying 400s | Wasted requests, never succeed | Classify errors — 429 retries, 400/422 does not |
| No jitter | Thundering herd on recovery | Always use full jitter — random(0, cap) |
| Key collision across users | User A’s retry returns User B’s result | Include user_id (or tenant id) in every key |
Takeaway: every gotcha here has been observed in production. Treat the table as a pre-merge review checklist for any tool that touches state.
The Idempotency Checklist
Before deploying an agent with external actions:
KEY GENERATION [ ] Keys use stable identifiers (user_id, task_id, step_id) [ ] Keys do NOT include timestamps or retry counts [ ] Keys include operation type to prevent collisions [ ] Keys are deterministic (same input = same key) ERROR HANDLING [ ] Errors classified as RETRY vs NEVER_RETRY [ ] 4xx errors (except 429) are not retried [ ] 5xx and network errors are retried [ ] Max retry limit is set BACKOFF [ ] Exponential backoff implemented [ ] Full jitter added to prevent thundering herd [ ] Max delay cap prevents infinite waits [ ] Base delay appropriate for the API CIRCUIT BREAKERS [ ] Circuit breaker on each external dependency [ ] Failure threshold tuned for the service [ ] Recovery timeout allows service to stabilize [ ] Open circuit has graceful fallback
Takeaway: ship none of these and your agent will charge a customer twice within the first week. Ship all four and retry storms become a non-event.
Closing
Your actions are now idempotent and your retries are jittered. Two things still go wrong: the agent crashes mid-task and loses its place, and the budget overruns when the loop won’t terminate. Both are next.
Testable question: can you describe the idempotency key for every external call your agent makes in one sentence each? If not, that’s the gap.
→ Part 2: State Persistence & Checkpointing — surviving the SIGKILL.
References
- [stripe-idempotency] Stripe — Idempotent Requests. Stripe API docs. Retrieved 2026-05-13. Source for 24-hour TTL, 255-char max, header convention, replay-including-500s guarantee.
- [aws-jitter] Marc Brooker (AWS Architecture Blog) — Exponential Backoff and Jitter, 2015-03-04. AWS Architecture Blog. Source for the full-jitter formula
sleep = random(0, min(cap, base * 2^attempt))and the empirical result that full jitter beats equal jitter under contention. - [aws-builders-retries] Marc Brooker (Amazon Builders’ Library) — Timeouts, Retries and Backoff with Jitter. Amazon Builders’ Library. Background on why retries amplify upstream stress.
- [langgraph-docs] LangChain — LangGraph low-level concepts:
add_node,RetryPolicy. LangGraph docs. Source forlanggraph.types.RetryPolicyshape andadd_node(retry_policy=...)wiring. - [temporal-docs] Temporal — Failure detection in Python. Temporal docs. Source for
temporalio.common.RetryPolicysignature andnon_retryable_error_types.