Idempotency & Safe Retries - The Stripe Pattern for Agents
Deep dive into idempotency: the single highest-leverage production requirement. Learn the Stripe pattern, error classification, jitter, and how to prevent cascading retry storms
Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.
Why This Matters
Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.
This isn’t a bug. This is correct retry logic meeting real-world latency.
Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.
The Numbers:
- 68% of teams hit budget overruns in first agent deployments
- 50% cite “runaway tool loops and recursive logic” as the cause
- API downtime surged 60% between Q1 2024 and Q1 2025
- More downtime = more retries = more duplicate operations
What Goes Wrong Without This:
What Idempotency Means
Idempotent: An operation that produces the same result when called multiple times with the same input.
The Stripe Pattern
Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard:
# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
# Key must be STABLE across retries
# Bad: f"{user_id}:{timestamp}" - different each retry
# Bad: f"{user_id}:{retry_count}" - different each retry
# Good: f"{user_id}:{task_id}:{step_id}" - same across retries
idempotency_key = f"{user_id}:{task_id}:{step_id}"
return api.book(
flight_id=flight_id,
idempotency_key=idempotency_key
)
# Server checks and stores
def handle_booking(request):
key = request.idempotency_key
# Check if we've processed this before
cached = cache.get(key)
if cached:
return cached # Return stored result, don't reprocess
# First time: process and store result
result = process_booking(request)
cache.set(key, result, ttl=timedelta(hours=24))
return result
Key Generation Rules
| Include | Exclude | Why |
|---|---|---|
user_id | timestamp | Timestamps change on retry |
task_id | retry_count | Retry count changes on retry |
step_id | random() | Random changes on retry |
operation_type | request_id (if regenerated) | Must be stable |
external_reference |
The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.
Three Idempotency Strategies
Strategy 1: Idempotency Keys (Stripe Pattern)
Best for: External APIs, payments, bookings
class IdempotentClient:
def __init__(self, cache):
self.cache = cache
def execute(self, operation, idempotency_key):
# Check cache
cached = self.cache.get(idempotency_key)
if cached:
return cached
# Execute and cache
result = operation()
self.cache.set(idempotency_key, result, ttl=86400) # 24 hours
return result
# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
operation=lambda: api.book_flight(flight_id),
idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)
Strategy 2: Sequence Numbers
Best for: Internal state changes, ordered operations
class SequencedOperations:
def __init__(self):
self.expected_seq = 1
self.results = {}
def execute(self, seq_num, operation):
# Already processed
if seq_num < self.expected_seq:
return self.results[seq_num]
# Out of order
if seq_num > self.expected_seq:
raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")
# Process and increment
result = operation()
self.results[seq_num] = result
self.expected_seq += 1
return result
Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.
Strategy 3: Time Window Deduplication
Best for: Best-effort deduplication, high-volume low-stakes operations
class TimeWindowDedup:
def __init__(self, window_seconds=300):
self.window = window_seconds
self.seen = {} # hash -> (timestamp, result)
def execute(self, request_hash, operation):
now = time.time()
# Check if seen within window
if request_hash in self.seen:
timestamp, result = self.seen[request_hash]
if now - timestamp < self.window:
return result # Within window, return cached
# Process and cache
result = operation()
self.seen[request_hash] = (now, result)
return result
Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.
Error Classification
Not all errors should be retried. Getting this wrong causes cascading failures.
from http import HTTPStatus
# These errors are transient — retry them
RETRY_ERRORS = [
ConnectionResetError, # Network blip
TimeoutError, # Slow response
HTTPStatus.TOO_MANY_REQUESTS, # 429 - Rate limited
HTTPStatus.SERVICE_UNAVAILABLE, # 503 - Server overloaded
HTTPStatus.GATEWAY_TIMEOUT, # 504 - Upstream timeout
HTTPStatus.BAD_GATEWAY, # 502 - Proxy error
]
# These errors are permanent — don't retry
NEVER_RETRY_ERRORS = [
HTTPStatus.BAD_REQUEST, # 400 - Invalid input
HTTPStatus.UNAUTHORIZED, # 401 - Auth failed
HTTPStatus.FORBIDDEN, # 403 - Not allowed
HTTPStatus.NOT_FOUND, # 404 - Doesn't exist
HTTPStatus.UNPROCESSABLE_ENTITY, # 422 - Business rule rejected
HTTPStatus.CONFLICT, # 409 - State conflict
]
def should_retry(error):
if isinstance(error, tuple(RETRY_ERRORS)):
return True
if hasattr(error, 'status_code'):
return error.status_code in [e.value for e in RETRY_ERRORS if hasattr(e, 'value')]
return False
The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).
Exponential Backoff with Full Jitter
Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.
Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.
import random
import time
def retry_with_backoff(
operation,
max_retries=5,
base_delay=0.1,
max_delay=10.0,
idempotency_key=None
):
"""
Exponential backoff with full jitter.
AWS research shows full jitter significantly reduces
synchronized retry storms during outages.
"""
for attempt in range(max_retries):
try:
return operation()
except Exception as e:
if not should_retry(e):
raise # Don't retry permanent errors
if attempt == max_retries - 1:
raise # Last attempt, give up
# Exponential backoff: 0.1, 0.2, 0.4, 0.8, 1.6... capped at max_delay
delay = min(base_delay * (2 ** attempt), max_delay)
# Full jitter: random value between 0 and delay
# This spreads retries across time, preventing thundering herd
jittered_delay = random.uniform(0, delay)
time.sleep(jittered_delay)
Why Full Jitter?
Cascading Retry Storm
The nightmare scenario:
Prevention: Circuit Breakers
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, operation):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = operation()
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
try:
result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
# Don't even try — circuit is open
return escalate_to_human("Payment service unavailable")
Framework-Specific Implementation
LangGraph
from langgraph.checkpoint.postgres import PostgresSaver
# LangGraph has built-in retry support
graph = StateGraph(AgentState)
# Configure per-node retry
@graph.node(retry_policy=RetryPolicy(max_attempts=3, backoff_factor=2))
def call_external_api(state):
# Idempotency key from state
key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
return api.call(idempotency_key=key)
# Checkpointing enables safe retry from last known state
app = graph.compile(checkpointer=PostgresSaver.from_conn_string(DATABASE_URL))
Temporal
from temporalio import activity, workflow
@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
"""
Temporal activities have at-least-once execution guarantee.
Your idempotent implementation provides no-more-than-once business effect.
Together = effective exactly-once execution.
"""
return await api.book(flight_id, idempotency_key=idempotency_key)
@workflow.defn
class BookingWorkflow:
@workflow.run
async def run(self, request: BookingRequest) -> BookingResult:
# Temporal handles retries with configurable policy
return await workflow.execute_activity(
book_flight,
args=[request.flight_id, f"{request.user_id}:{request.booking_id}"],
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=30),
backoff_coefficient=2.0,
maximum_attempts=5,
non_retryable_error_types=["ValidationError", "AuthError"]
)
)
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| Timestamp in key | Retries create duplicates | Use stable identifiers only |
| Key too broad | Different operations collide | Include operation type in key |
| Key too narrow | Same operation not deduplicated | Include all relevant context |
| No TTL on cache | Memory leak | Set 24-48 hour TTL |
| Caching failures | Retrying failed ops returns failure | Only cache successful results |
| Retrying 400s | Wasted requests, never succeeds | Classify errors properly |
| No jitter | Thundering herd on recovery | Always use full jitter |
The Idempotency Checklist
Before deploying an agent with external actions:
Key Takeaways
-
Idempotency is not optional. Every action with side effects needs deduplication strategy.
-
Keys must be stable. If the key changes on retry, it’s not idempotent.
-
Classify errors. Retry infrastructure errors. Don’t retry business errors.
-
Always use jitter. Without it, you’ll cause the outage you’re trying to survive.
-
Circuit breakers prevent cascades. One failing service shouldn’t take down everything.
Next Steps
Now that your actions are idempotent, what happens when your agent crashes mid-task?
→ Part 2: State Persistence & Checkpointing
Or jump to another topic:
- Part 3: Human-in-the-Loop — When to escalate to humans
- Part 4: Cost Control — Token budgets and circuit breakers