Idempotency & Safe Retries - The Stripe Pattern for Agents | Intentional / Deliberate / Engineering

Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Left half: an agent retries an API three times because the first call timed out. Three duplicate charges land in a red ledger. A flashing red panel reads 'Customer charged 3x.' Right half: the same agent retries through an orange idempotency gate with cache + 24h TTL. Only the first call reaches the green payment API; calls two and three return cache hits with the same txn id. A green panel reads 'Customer charged once.' A divider in the middle reads 'no key → with key.' — Idempotency: stable key turns at-least-once into effectively-once

Why This Matters

Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.

This isn’t a bug. This is correct retry logic meeting real-world latency.

Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.

What other content gets wrong: tutorials show retry=3 and call it done. The Stripe docs cover idempotency keys but stop short of key derivation across retries. The AWS Builders’ Library [aws-builders-retries] explains why retries amplify upstream stress; Brooker’s companion post [aws-jitter] gives the jitter math — neither covers the agent-loop case. This chapter ties all three together — and shows what happens when your idempotency key collides across users or your circuit breaker opens mid-cascade.

Takeaway: retry logic without idempotency keys is a charge-customers-twice machine.

What Goes Wrong Without This:

IDEMPOTENCY FAILURE PATTERNS

Symptom: Customer charged twice for the same order.
Cause:   Payment API timed out. Agent retried. Both charges processed.
       No idempotency key to deduplicate.

Symptom: User receives 47 copies of the same email.
Cause: Email send succeeded but response was slow. Agent assumed failure.
Retried. No deduplication on sends.

Symptom: Database has duplicate records with slight variations.
Cause: INSERT succeeded, network dropped response. Retry created second record.
No upsert or idempotency check.

Takeaway: every failure mode here is a retry meeting a missing dedup key. Stripe’s idempotency-keys design [stripe-idempotency] exists precisely to break this loop.

What Idempotency Means

Idempotent: An operation that produces the same result when called multiple times with the same input.

IDEMPOTENT vs NON-IDEMPOTENT OPERATIONS

Idempotent:
GET /user/123         → Same user every time (safe to retry)
DELETE /file/abc      → File deleted, stays deleted (safe to retry)
PUT /user/123 {name}  → User updated to same value (safe to retry)

Not Idempotent:
POST /charge/$100 → New charge every time (dangerous to retry)
POST /email/send → New email every time (dangerous to retry)
INSERT INTO orders → New row every time (dangerous to retry)

Made Idempotent:
POST /charge/$100 + idempotency_key=xyz123 → Same charge on retry
POST /email/send + message_id=abc456 → Same email, no duplicate
INSERT ... ON CONFLICT DO NOTHING → Same row, no duplicate

Takeaway: idempotency isn’t a property of the HTTP verb — it’s a property of how you derive the dedup key.

The Stripe Pattern

Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard [stripe-idempotency]:

Idempotency key in Idempotency-Key HTTP header on POST requests
Up to 255 characters; UUIDv4 or random string with enough entropy
Stripe stores the response status + body for at least 24 hours; same key in that window replays the stored result, including 500 errors
Avoid using emails or personal identifiers as keys — they leak into logs

# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
    # Key must be STABLE across retries
    # Bad:  f"{user_id}:{timestamp}"        - different each retry
    # Bad:  f"{user_id}:{retry_count}"      - different each retry
    # Good: f"{user_id}:{task_id}:{step_id}" - same across retries

    idempotency_key = f"{user_id}:{task_id}:{step_id}"

    return api.book(
        flight_id=flight_id,
        idempotency_key=idempotency_key
    )

# Server checks and stores
def handle_booking(request):
    key = request.idempotency_key

    # Check if we've processed this before
    cached = cache.get(key)
    if cached:
        return cached  # Return stored result, don't reprocess

    # First time: process and store result
    result = process_booking(request)
    cache.set(key, result, ttl=timedelta(hours=24))
    return result

Key Generation Rules

Include	Exclude	Why
`user_id`	`timestamp`	Timestamps change on retry
`task_id`	`retry_count`	Retry count changes on retry
`step_id`	`random()`	Random changes on retry
`operation_type`	`request_id` (if regenerated)	Must be stable
`external_reference`

The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.

Takeaway: stable identifiers in, stable key out. Anything that changes per attempt (timestamp, retry count, random) breaks the contract.

Three Idempotency Strategies

Strategy 1: Idempotency Keys (Stripe Pattern)

Best for: External APIs, payments, bookings

class IdempotentClient:
    def __init__(self, cache):
        self.cache = cache

    def execute(self, operation, idempotency_key):
        # Check cache
        cached = self.cache.get(idempotency_key)
        if cached:
            return cached

        # Execute and cache
        result = operation()
        self.cache.set(idempotency_key, result, ttl=86400)  # 24 hours
        return result

# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
    operation=lambda: api.book_flight(flight_id),
    idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)

Strategy 2: Sequence Numbers

Best for: Internal state changes, ordered operations

class SequencedOperations:
    def __init__(self):
        self.expected_seq = 1
        self.results = {}

    def execute(self, seq_num, operation):
        # Already processed
        if seq_num < self.expected_seq:
            return self.results[seq_num]

        # Out of order
        if seq_num > self.expected_seq:
            raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")

        # Process and increment
        result = operation()
        self.results[seq_num] = result
        self.expected_seq += 1
        return result

Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.

Strategy 3: Time Window Deduplication

Best for: Best-effort deduplication, high-volume low-stakes operations

class TimeWindowDedup:
    def __init__(self, window_seconds=300):
        self.window = window_seconds
        self.seen = {}  # hash -> (timestamp, result)

    def execute(self, request_hash, operation):
        now = time.time()

        # Check if seen within window
        if request_hash in self.seen:
            timestamp, result = self.seen[request_hash]
            if now - timestamp < self.window:
                return result  # Within window, return cached

        # Process and cache
        result = operation()
        self.seen[request_hash] = (now, result)
        return result

Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.

Takeaway: pick the strategy by the cost of a duplicate. Payments → keys. Internal counters → sequence. Search-as-you-type → time window.

Error Classification

Not all errors should be retried. Getting this wrong causes cascading failures.

HTTP status	Class	Retry?	Why
`429 Too Many Requests`	Rate limit	yes	back off + jitter; honour `Retry-After` if present
`502 Bad Gateway` / `503 Service Unavailable` / `504 Gateway Timeout`	Upstream	yes	infrastructure flake; jitter prevents thundering herd
`408 Request Timeout`	Network	yes	client-side cutoff, same operation safe with idempotency key
Connection reset / DNS / TLS handshake	Network	yes	transient
`400 Bad Request` / `422 Unprocessable Entity`	Client	no	retry never succeeds — needs code or input fix
`401 Unauthorized` / `403 Forbidden`	Auth	no	needs token refresh or escalation, not retry
`404 Not Found`	Client	no	resource doesn’t exist
`409 Conflict`	State	no	logical conflict; retry without state change wastes effort

from http import HTTPStatus

RETRY_STATUSES = {
    HTTPStatus.REQUEST_TIMEOUT,           # 408
    HTTPStatus.TOO_MANY_REQUESTS,         # 429
    HTTPStatus.BAD_GATEWAY,               # 502
    HTTPStatus.SERVICE_UNAVAILABLE,       # 503
    HTTPStatus.GATEWAY_TIMEOUT,           # 504
}

RETRY_EXCEPTIONS = (ConnectionResetError, TimeoutError)

def should_retry(error):
    if isinstance(error, RETRY_EXCEPTIONS):
        return True
    status = getattr(error, "status_code", None)
    return status in RETRY_STATUSES

The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).

Takeaway: 4xx errors (except 429) are bugs in your request, not blips in their service. Retrying a 400 is wasted budget.

Exponential Backoff with Full Jitter

Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.

Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.

Marc Brooker’s 2015 AWS Architecture Blog post [aws-jitter] gives the canonical formula:

sleep = random_between(0, min(cap, base * 2^attempt))

This is “full jitter”. Brooker’s testing showed full jitter uses substantially less client work and server load than no-jitter, and less work than equal-jitter — at the cost of slightly more wall-clock time vs equal-jitter. The win is amplification protection, not raw speed.

import random
import time

def retry_with_backoff(
    operation,
    max_retries=5,
    base_delay=0.1,
    max_delay=10.0,
    idempotency_key=None,
):
    """Exponential backoff with full jitter (Brooker 2015)."""
    for attempt in range(max_retries):
        try:
            return operation()
        except Exception as e:
            if not should_retry(e):
                raise  # permanent failure — surface it
            if attempt == max_retries - 1:
                raise  # exhausted

            # cap = max_delay, base = base_delay
            ceiling = min(base_delay * (2 ** attempt), max_delay)
            time.sleep(random.uniform(0, ceiling))

Why Full Jitter?

JITTER PREVENTS THUNDERING HERD

WITHOUT JITTER:
Server fails at t=0
All 1000 clients retry at t=1
Server fails again
All 1000 clients retry at t=2
Server fails again
...

WITH FULL JITTER:
Server fails at t=0
Client A retries at t=0.3
Client B retries at t=0.7
Client C retries at t=0.1
...
Load spreads across 0-1 second window
Server can handle gradual recovery

Takeaway: without jitter you cause the outage you were trying to survive. The randomness is the feature.

Cascading Retry Storm

CASCADING RETRY STORM

1. Payment service has 30-second outage

2. Order processing agents timeout, start retrying
 → 1000 agents × 3 retries = 3000 payment requests

3. Payment retries trigger inventory checks
 → Each payment retry calls inventory
 → 3000 inventory requests

4. Inventory service overwhelmed by traffic
 → Starts timing out
 → Agents retry inventory calls

5. Inventory retries trigger shipping checks
 → Cascade continues

6. Within 60 seconds:
 → 10x normal load across all services
 → Multiple services failing
 → Complete system degradation

Prevention: Circuit Breakers

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, operation):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = operation()
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

try:
    result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
    # Don't even try — circuit is open
    return escalate_to_human("Payment service unavailable")

What happens when the breaker opens mid-cascade: payment service degrades at t=0. By t=15s, the breaker on payment_api has tripped to OPEN. But the agent’s order workflow doesn’t know that — it just sees CircuitOpenError and may retry the whole workflow from step 1, re-calling inventory_api, shipping_api, and notification_api. Without idempotency keys on each of those, you’ve now charged inventory and notification systems for an order that never paid. The fix: idempotency keys at every step, not just payments. And classify CircuitOpenError as NEVER_RETRY at the workflow level — it’s a state signal, not a transient blip.

Takeaway: idempotency keys protect each call. Circuit breakers protect the whole system from cascading retries when one dependency degrades. Both must agree on what “retry” means.

Framework-Specific Implementation

LangGraph (0.2.24+)

LangGraph’s RetryPolicy is wired via add_node’s retry_policy parameter, not a decorator. The RetryPolicy itself is a NamedTuple in langgraph.types (added in 0.2.24) [langgraph-docs]:

from langgraph.graph import StateGraph
from langgraph.types import RetryPolicy
from langgraph.checkpoint.postgres import PostgresSaver

def call_external_api(state):
    # Stable key — survives retries because state fields don't change
    key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
    return api.call(idempotency_key=key)

graph = StateGraph(AgentState)
graph.add_node(
    "call_external_api",
    call_external_api,
    retry_policy=RetryPolicy(
        initial_interval=0.5,    # seconds
        backoff_factor=2.0,
        max_interval=30.0,
        max_attempts=3,
        jitter=True,             # full jitter is the default
    ),
)

# Checkpointing enables safe retry from last known state.
app = graph.compile(
    checkpointer=PostgresSaver.from_conn_string(DATABASE_URL),
)

The jitter=True default means LangGraph applies full jitter without you doing anything — but the per-call idempotency key is still your responsibility.

Temporal (Python SDK 1.x)

Temporal activities are at-least-once. Your idempotency key turns at-least-once delivery into effectively-once business effect [temporal-docs]:

from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    return await api.book(flight_id, idempotency_key=idempotency_key)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        return await workflow.execute_activity(
            book_flight,
            args=[
                request.flight_id,
                f"{request.user_id}:{request.booking_id}",
            ],
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"],
            ),
        )

Takeaway: framework retry policies handle backoff + jitter; your idempotency key handles the business effect. You need both.

What to Copy, What to Skip

Pattern	Copy when	Skip when
Stripe-style idempotency key	external API with side effects (payment, booking, email send)	pure reads, internal cache lookups, no-side-effect tools
Sequence-number dedup	single-writer ordered pipelines (e.g. event-sourced state mutations)	concurrent multi-writer flows — order assumption breaks
Time-window dedup (5 min)	high-volume, low-stakes (search-as-you-type, autocomplete telemetry)	money or anything regulated — duplicates after window expire = legal risk
Full jitter (`random(0, cap)`)	every retry path. Default.	only if you have a contractual reason to be deterministic (and reconsider then)
Circuit breaker per dependency	downstream services with measurable failure rates	in-process pure-compute calls — overhead without payoff
24h cache TTL on key→result	matches Stripe’s window; covers most retry storms	regulated environments where retry windows are mandated — pin to spec
LangGraph node `retry_policy=`	LangGraph 0.2+ workflows where retry is per-step	older releases (`<0.2`) that lack the parameter — upgrade or wrap manually
Temporal activity `non_retryable_error_types`	classify auth + validation errors as non-retryable at workflow level	for transient infra errors — let the default retry kick in

Takeaway: every endorsed pattern has a break condition. Read the right column before copying the left.

Common Gotchas

Gotcha	Symptom	Fix
Timestamp in key	Retries create duplicates	Use stable identifiers only
Key too broad	Different operations collide	Include operation type in key
Key too narrow	Same operation not deduplicated	Include all relevant context
No TTL on cache	Memory leak	Set 24-hour TTL (matches Stripe window)
Cached failures	Retrying failed ops returns the failure	Only cache successful results, or store `(status, body)` and check status
Retrying 400s	Wasted requests, never succeed	Classify errors — `429` retries, `400/422` does not
No jitter	Thundering herd on recovery	Always use full jitter — `random(0, cap)`
Key collision across users	User A’s retry returns User B’s result	Include `user_id` (or tenant id) in every key

Takeaway: every gotcha here has been observed in production. Treat the table as a pre-merge review checklist for any tool that touches state.

The Idempotency Checklist

Before deploying an agent with external actions:

IDEMPOTENCY DEPLOYMENT CHECKLIST

KEY GENERATION
[ ] Keys use stable identifiers (user_id, task_id, step_id)
[ ] Keys do NOT include timestamps or retry counts
[ ] Keys include operation type to prevent collisions
[ ] Keys are deterministic (same input = same key)

ERROR HANDLING
[ ] Errors classified as RETRY vs NEVER_RETRY
[ ] 4xx errors (except 429) are not retried
[ ] 5xx and network errors are retried
[ ] Max retry limit is set

BACKOFF
[ ] Exponential backoff implemented
[ ] Full jitter added to prevent thundering herd
[ ] Max delay cap prevents infinite waits
[ ] Base delay appropriate for the API

CIRCUIT BREAKERS
[ ] Circuit breaker on each external dependency
[ ] Failure threshold tuned for the service
[ ] Recovery timeout allows service to stabilize
[ ] Open circuit has graceful fallback

Takeaway: ship none of these and your agent will charge a customer twice within the first week. Ship all four and retry storms become a non-event.

Closing

Your actions are now idempotent and your retries are jittered. Two things still go wrong: the agent crashes mid-task and loses its place, and the budget overruns when the loop won’t terminate. Both are next.

Testable question: can you describe the idempotency key for every external call your agent makes in one sentence each? If not, that’s the gap.

→ Part 2: State Persistence & Checkpointing — surviving the SIGKILL.

References

[stripe-idempotency] Stripe — Idempotent Requests. Stripe API docs. Retrieved 2026-05-13. Source for 24-hour TTL, 255-char max, header convention, replay-including-500s guarantee.
[aws-jitter] Marc Brooker (AWS Architecture Blog) — Exponential Backoff and Jitter, 2015-03-04. AWS Architecture Blog. Source for the full-jitter formula sleep = random(0, min(cap, base * 2^attempt)) and the empirical result that full jitter beats equal jitter under contention.
[aws-builders-retries] Marc Brooker (Amazon Builders’ Library) — Timeouts, Retries and Backoff with Jitter. Amazon Builders’ Library. Background on why retries amplify upstream stress.
[langgraph-docs] LangChain — LangGraph low-level concepts: add_node, RetryPolicy. LangGraph docs. Source for langgraph.types.RetryPolicy shape and add_node(retry_policy=...) wiring.
[temporal-docs] Temporal — Failure detection in Python. Temporal docs. Source for temporalio.common.RetryPolicy signature and non_retryable_error_types.