I/D/E · production-agents

Idempotency & Safe Retries - The Stripe Pattern for Agents

Summary

Deep dive into idempotency: the single highest-leverage production requirement. Learn the Stripe pattern, error classification, jitter, and how to prevent cascading retry storms

Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Idempotency: stable key turns at-least-once into effectively-once

Same retry pattern, two outcomes. The idempotency gate is the difference between a charged-back customer and a non-event.

Why This Matters

Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.

This isn’t a bug. This is correct retry logic meeting real-world latency.

Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.

What other content gets wrong: tutorials show retry=3 and call it done. The Stripe docs cover idempotency keys but stop short of key derivation across retries. The AWS Builders’ Library [aws-builders-retries] explains why retries amplify upstream stress; Brooker’s companion post [aws-jitter] gives the jitter math — neither covers the agent-loop case. This chapter ties all three together — and shows what happens when your idempotency key collides across users or your circuit breaker opens mid-cascade.

Takeaway: retry logic without idempotency keys is a charge-customers-twice machine.

What Goes Wrong Without This:

IDEMPOTENCY FAILURE PATTERNS
Symptom: Customer charged twice for the same order.
Cause:   Payment API timed out. Agent retried. Both charges processed.
       No idempotency key to deduplicate.

Symptom: User receives 47 copies of the same email.
Cause: Email send succeeded but response was slow. Agent assumed failure.
Retried. No deduplication on sends.

Symptom: Database has duplicate records with slight variations.
Cause: INSERT succeeded, network dropped response. Retry created second record.
No upsert or idempotency check.

Takeaway: every failure mode here is a retry meeting a missing dedup key. Stripe’s idempotency-keys design [stripe-idempotency] exists precisely to break this loop.


What Idempotency Means

Idempotent: An operation that produces the same result when called multiple times with the same input.

IDEMPOTENT vs NON-IDEMPOTENT OPERATIONS
Idempotent:
GET /user/123          Same user every time (safe to retry)
DELETE /file/abc       File deleted, stays deleted (safe to retry)
PUT /user/123 {name}   User updated to same value (safe to retry)

Not Idempotent:
POST /charge/$100  New charge every time (dangerous to retry)
POST /email/send  New email every time (dangerous to retry)
INSERT INTO orders  New row every time (dangerous to retry)

Made Idempotent:
POST /charge/$100 + idempotency_key=xyz123  Same charge on retry
POST /email/send + message_id=abc456  Same email, no duplicate
INSERT ... ON CONFLICT DO NOTHING  Same row, no duplicate

Takeaway: idempotency isn’t a property of the HTTP verb — it’s a property of how you derive the dedup key.


The Stripe Pattern

Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard [stripe-idempotency]:

  • Idempotency key in Idempotency-Key HTTP header on POST requests
  • Up to 255 characters; UUIDv4 or random string with enough entropy
  • Stripe stores the response status + body for at least 24 hours; same key in that window replays the stored result, including 500 errors
  • Avoid using emails or personal identifiers as keys — they leak into logs
# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
    # Key must be STABLE across retries
    # Bad:  f"{user_id}:{timestamp}"        - different each retry
    # Bad:  f"{user_id}:{retry_count}"      - different each retry
    # Good: f"{user_id}:{task_id}:{step_id}" - same across retries

    idempotency_key = f"{user_id}:{task_id}:{step_id}"

    return api.book(
        flight_id=flight_id,
        idempotency_key=idempotency_key
    )

# Server checks and stores
def handle_booking(request):
    key = request.idempotency_key

    # Check if we've processed this before
    cached = cache.get(key)
    if cached:
        return cached  # Return stored result, don't reprocess

    # First time: process and store result
    result = process_booking(request)
    cache.set(key, result, ttl=timedelta(hours=24))
    return result

Key Generation Rules

IncludeExcludeWhy
user_idtimestampTimestamps change on retry
task_idretry_countRetry count changes on retry
step_idrandom()Random changes on retry
operation_typerequest_id (if regenerated)Must be stable
external_reference

The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.

Takeaway: stable identifiers in, stable key out. Anything that changes per attempt (timestamp, retry count, random) breaks the contract.


Three Idempotency Strategies

Strategy 1: Idempotency Keys (Stripe Pattern)

Best for: External APIs, payments, bookings

class IdempotentClient:
    def __init__(self, cache):
        self.cache = cache

    def execute(self, operation, idempotency_key):
        # Check cache
        cached = self.cache.get(idempotency_key)
        if cached:
            return cached

        # Execute and cache
        result = operation()
        self.cache.set(idempotency_key, result, ttl=86400)  # 24 hours
        return result

# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
    operation=lambda: api.book_flight(flight_id),
    idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)

Strategy 2: Sequence Numbers

Best for: Internal state changes, ordered operations

class SequencedOperations:
    def __init__(self):
        self.expected_seq = 1
        self.results = {}

    def execute(self, seq_num, operation):
        # Already processed
        if seq_num < self.expected_seq:
            return self.results[seq_num]

        # Out of order
        if seq_num > self.expected_seq:
            raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")

        # Process and increment
        result = operation()
        self.results[seq_num] = result
        self.expected_seq += 1
        return result

Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.

Strategy 3: Time Window Deduplication

Best for: Best-effort deduplication, high-volume low-stakes operations

class TimeWindowDedup:
    def __init__(self, window_seconds=300):
        self.window = window_seconds
        self.seen = {}  # hash -> (timestamp, result)

    def execute(self, request_hash, operation):
        now = time.time()

        # Check if seen within window
        if request_hash in self.seen:
            timestamp, result = self.seen[request_hash]
            if now - timestamp < self.window:
                return result  # Within window, return cached

        # Process and cache
        result = operation()
        self.seen[request_hash] = (now, result)
        return result

Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.

Takeaway: pick the strategy by the cost of a duplicate. Payments → keys. Internal counters → sequence. Search-as-you-type → time window.


Error Classification

Not all errors should be retried. Getting this wrong causes cascading failures.

HTTP statusClassRetry?Why
429 Too Many RequestsRate limityesback off + jitter; honour Retry-After if present
502 Bad Gateway / 503 Service Unavailable / 504 Gateway TimeoutUpstreamyesinfrastructure flake; jitter prevents thundering herd
408 Request TimeoutNetworkyesclient-side cutoff, same operation safe with idempotency key
Connection reset / DNS / TLS handshakeNetworkyestransient
400 Bad Request / 422 Unprocessable EntityClientnoretry never succeeds — needs code or input fix
401 Unauthorized / 403 ForbiddenAuthnoneeds token refresh or escalation, not retry
404 Not FoundClientnoresource doesn’t exist
409 ConflictStatenological conflict; retry without state change wastes effort
from http import HTTPStatus

RETRY_STATUSES = {
    HTTPStatus.REQUEST_TIMEOUT,           # 408
    HTTPStatus.TOO_MANY_REQUESTS,         # 429
    HTTPStatus.BAD_GATEWAY,               # 502
    HTTPStatus.SERVICE_UNAVAILABLE,       # 503
    HTTPStatus.GATEWAY_TIMEOUT,           # 504
}

RETRY_EXCEPTIONS = (ConnectionResetError, TimeoutError)

def should_retry(error):
    if isinstance(error, RETRY_EXCEPTIONS):
        return True
    status = getattr(error, "status_code", None)
    return status in RETRY_STATUSES

The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).

Takeaway: 4xx errors (except 429) are bugs in your request, not blips in their service. Retrying a 400 is wasted budget.


Exponential Backoff with Full Jitter

Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.

Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.

Marc Brooker’s 2015 AWS Architecture Blog post [aws-jitter] gives the canonical formula:

sleep = random_between(0, min(cap, base * 2^attempt))

This is “full jitter”. Brooker’s testing showed full jitter uses substantially less client work and server load than no-jitter, and less work than equal-jitter — at the cost of slightly more wall-clock time vs equal-jitter. The win is amplification protection, not raw speed.

import random
import time

def retry_with_backoff(
    operation,
    max_retries=5,
    base_delay=0.1,
    max_delay=10.0,
    idempotency_key=None,
):
    """Exponential backoff with full jitter (Brooker 2015)."""
    for attempt in range(max_retries):
        try:
            return operation()
        except Exception as e:
            if not should_retry(e):
                raise  # permanent failure — surface it
            if attempt == max_retries - 1:
                raise  # exhausted

            # cap = max_delay, base = base_delay
            ceiling = min(base_delay * (2 ** attempt), max_delay)
            time.sleep(random.uniform(0, ceiling))

Why Full Jitter?

JITTER PREVENTS THUNDERING HERD
WITHOUT JITTER:
Server fails at t=0
All 1000 clients retry at t=1
Server fails again
All 1000 clients retry at t=2
Server fails again
...

WITH FULL JITTER:
Server fails at t=0
Client A retries at t=0.3
Client B retries at t=0.7
Client C retries at t=0.1
...
Load spreads across 0-1 second window
Server can handle gradual recovery

Takeaway: without jitter you cause the outage you were trying to survive. The randomness is the feature.


Cascading Retry Storm

CASCADING RETRY STORM
1. Payment service has 30-second outage

2. Order processing agents timeout, start retrying
  1000 agents × 3 retries = 3000 payment requests

3. Payment retries trigger inventory checks
  Each payment retry calls inventory
  3000 inventory requests

4. Inventory service overwhelmed by traffic
  Starts timing out
  Agents retry inventory calls

5. Inventory retries trigger shipping checks
  Cascade continues

6. Within 60 seconds:
  10x normal load across all services
  Multiple services failing
  Complete system degradation

Prevention: Circuit Breakers

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, operation):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = operation()
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

try:
    result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
    # Don't even try — circuit is open
    return escalate_to_human("Payment service unavailable")

What happens when the breaker opens mid-cascade: payment service degrades at t=0. By t=15s, the breaker on payment_api has tripped to OPEN. But the agent’s order workflow doesn’t know that — it just sees CircuitOpenError and may retry the whole workflow from step 1, re-calling inventory_api, shipping_api, and notification_api. Without idempotency keys on each of those, you’ve now charged inventory and notification systems for an order that never paid. The fix: idempotency keys at every step, not just payments. And classify CircuitOpenError as NEVER_RETRY at the workflow level — it’s a state signal, not a transient blip.

Takeaway: idempotency keys protect each call. Circuit breakers protect the whole system from cascading retries when one dependency degrades. Both must agree on what “retry” means.


Framework-Specific Implementation

LangGraph (0.2.24+)

LangGraph’s RetryPolicy is wired via add_node’s retry_policy parameter, not a decorator. The RetryPolicy itself is a NamedTuple in langgraph.types (added in 0.2.24) [langgraph-docs]:

from langgraph.graph import StateGraph
from langgraph.types import RetryPolicy
from langgraph.checkpoint.postgres import PostgresSaver

def call_external_api(state):
    # Stable key — survives retries because state fields don't change
    key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
    return api.call(idempotency_key=key)

graph = StateGraph(AgentState)
graph.add_node(
    "call_external_api",
    call_external_api,
    retry_policy=RetryPolicy(
        initial_interval=0.5,    # seconds
        backoff_factor=2.0,
        max_interval=30.0,
        max_attempts=3,
        jitter=True,             # full jitter is the default
    ),
)

# Checkpointing enables safe retry from last known state.
app = graph.compile(
    checkpointer=PostgresSaver.from_conn_string(DATABASE_URL),
)

The jitter=True default means LangGraph applies full jitter without you doing anything — but the per-call idempotency key is still your responsibility.

Temporal (Python SDK 1.x)

Temporal activities are at-least-once. Your idempotency key turns at-least-once delivery into effectively-once business effect [temporal-docs]:

from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    return await api.book(flight_id, idempotency_key=idempotency_key)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        return await workflow.execute_activity(
            book_flight,
            args=[
                request.flight_id,
                f"{request.user_id}:{request.booking_id}",
            ],
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"],
            ),
        )

Takeaway: framework retry policies handle backoff + jitter; your idempotency key handles the business effect. You need both.


What to Copy, What to Skip

PatternCopy whenSkip when
Stripe-style idempotency keyexternal API with side effects (payment, booking, email send)pure reads, internal cache lookups, no-side-effect tools
Sequence-number dedupsingle-writer ordered pipelines (e.g. event-sourced state mutations)concurrent multi-writer flows — order assumption breaks
Time-window dedup (5 min)high-volume, low-stakes (search-as-you-type, autocomplete telemetry)money or anything regulated — duplicates after window expire = legal risk
Full jitter (random(0, cap))every retry path. Default.only if you have a contractual reason to be deterministic (and reconsider then)
Circuit breaker per dependencydownstream services with measurable failure ratesin-process pure-compute calls — overhead without payoff
24h cache TTL on key→resultmatches Stripe’s window; covers most retry stormsregulated environments where retry windows are mandated — pin to spec
LangGraph node retry_policy=LangGraph 0.2+ workflows where retry is per-stepolder releases (<0.2) that lack the parameter — upgrade or wrap manually
Temporal activity non_retryable_error_typesclassify auth + validation errors as non-retryable at workflow levelfor transient infra errors — let the default retry kick in

Takeaway: every endorsed pattern has a break condition. Read the right column before copying the left.


Common Gotchas

GotchaSymptomFix
Timestamp in keyRetries create duplicatesUse stable identifiers only
Key too broadDifferent operations collideInclude operation type in key
Key too narrowSame operation not deduplicatedInclude all relevant context
No TTL on cacheMemory leakSet 24-hour TTL (matches Stripe window)
Cached failuresRetrying failed ops returns the failureOnly cache successful results, or store (status, body) and check status
Retrying 400sWasted requests, never succeedClassify errors — 429 retries, 400/422 does not
No jitterThundering herd on recoveryAlways use full jitter — random(0, cap)
Key collision across usersUser A’s retry returns User B’s resultInclude user_id (or tenant id) in every key

Takeaway: every gotcha here has been observed in production. Treat the table as a pre-merge review checklist for any tool that touches state.


The Idempotency Checklist

Before deploying an agent with external actions:

IDEMPOTENCY DEPLOYMENT CHECKLIST
KEY GENERATION
[ ] Keys use stable identifiers (user_id, task_id, step_id)
[ ] Keys do NOT include timestamps or retry counts
[ ] Keys include operation type to prevent collisions
[ ] Keys are deterministic (same input = same key)

ERROR HANDLING
[ ] Errors classified as RETRY vs NEVER_RETRY
[ ] 4xx errors (except 429) are not retried
[ ] 5xx and network errors are retried
[ ] Max retry limit is set

BACKOFF
[ ] Exponential backoff implemented
[ ] Full jitter added to prevent thundering herd
[ ] Max delay cap prevents infinite waits
[ ] Base delay appropriate for the API

CIRCUIT BREAKERS
[ ] Circuit breaker on each external dependency
[ ] Failure threshold tuned for the service
[ ] Recovery timeout allows service to stabilize
[ ] Open circuit has graceful fallback

Takeaway: ship none of these and your agent will charge a customer twice within the first week. Ship all four and retry storms become a non-event.


Closing

Your actions are now idempotent and your retries are jittered. Two things still go wrong: the agent crashes mid-task and loses its place, and the budget overruns when the loop won’t terminate. Both are next.

Testable question: can you describe the idempotency key for every external call your agent makes in one sentence each? If not, that’s the gap.

Part 2: State Persistence & Checkpointing — surviving the SIGKILL.


References

  • [stripe-idempotency] Stripe — Idempotent Requests. Stripe API docs. Retrieved 2026-05-13. Source for 24-hour TTL, 255-char max, header convention, replay-including-500s guarantee.
  • [aws-jitter] Marc Brooker (AWS Architecture Blog) — Exponential Backoff and Jitter, 2015-03-04. AWS Architecture Blog. Source for the full-jitter formula sleep = random(0, min(cap, base * 2^attempt)) and the empirical result that full jitter beats equal jitter under contention.
  • [aws-builders-retries] Marc Brooker (Amazon Builders’ Library) — Timeouts, Retries and Backoff with Jitter. Amazon Builders’ Library. Background on why retries amplify upstream stress.
  • [langgraph-docs] LangChain — LangGraph low-level concepts: add_node, RetryPolicy. LangGraph docs. Source for langgraph.types.RetryPolicy shape and add_node(retry_policy=...) wiring.
  • [temporal-docs] Temporal — Failure detection in Python. Temporal docs. Source for temporalio.common.RetryPolicy signature and non_retryable_error_types.
Production-agents Ch 2/9
  1. 1 Production Agents Overview - The Loop Is 20% of the Work 15m
  2. 2 Idempotency & Safe Retries - The Stripe Pattern for Agents 20m
  3. 3 State Persistence & Agent Memory - The Complete Domain 30m
  4. 4 Human-in-the-Loop Patterns - When Agents Need Judgment 25m
  5. 5 Cost Control & Token Budgets - Preventing $10K Surprises 20m
  6. 6 Observability & Silent Failures - Catching What Doesn't Crash 25m
  7. 7 Durable Execution Frameworks - Don't Reinvent the Wheel 25m
  8. 8 Security & Sandboxing - Containing Agent Risk 20m
  9. 9 Testing & Evaluation - Validating Agent Behavior 25m