Skip to content

Production-agents Series

Idempotency & Safe Retries - The Stripe Pattern for Agents

Deep dive into idempotency: the single highest-leverage production requirement. Learn the Stripe pattern, error classification, jitter, and how to prevent cascading retry storms

Prerequisite: This is Part 1 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent calls book_flight(). The API takes 35 seconds to respond. Your timeout is 30 seconds. Agent retries. API processed both requests. Customer is charged twice.

This isn’t a bug. This is correct retry logic meeting real-world latency.

Idempotency is the single most critical production requirement for agents that perform actions with side effects. Without it, retries create duplicates — double bookings, duplicate emails, corrupted state.

The Numbers:

  • 68% of teams hit budget overruns in first agent deployments
  • 50% cite “runaway tool loops and recursive logic” as the cause
  • API downtime surged 60% between Q1 2024 and Q1 2025
  • More downtime = more retries = more duplicate operations

What Goes Wrong Without This:

IDEMPOTENCY FAILURE PATTERNS

What Idempotency Means

Idempotent: An operation that produces the same result when called multiple times with the same input.

IDEMPOTENT vs NON-IDEMPOTENT OPERATIONS

The Stripe Pattern

Stripe processes millions of payments. They can’t afford duplicates. Their pattern is the industry standard:

# Client generates a unique key
def book_flight(flight_id, user_id, task_id, step_id):
    # Key must be STABLE across retries
    # Bad:  f"{user_id}:{timestamp}"        - different each retry
    # Bad:  f"{user_id}:{retry_count}"      - different each retry
    # Good: f"{user_id}:{task_id}:{step_id}" - same across retries

    idempotency_key = f"{user_id}:{task_id}:{step_id}"

    return api.book(
        flight_id=flight_id,
        idempotency_key=idempotency_key
    )

# Server checks and stores
def handle_booking(request):
    key = request.idempotency_key

    # Check if we've processed this before
    cached = cache.get(key)
    if cached:
        return cached  # Return stored result, don't reprocess

    # First time: process and store result
    result = process_booking(request)
    cache.set(key, result, ttl=timedelta(hours=24))
    return result

Key Generation Rules

IncludeExcludeWhy
user_idtimestampTimestamps change on retry
task_idretry_countRetry count changes on retry
step_idrandom()Random changes on retry
operation_typerequest_id (if regenerated)Must be stable
external_reference

The test: If you retry the same logical operation, does the key stay the same? If not, it’s wrong.


Three Idempotency Strategies

Strategy 1: Idempotency Keys (Stripe Pattern)

Best for: External APIs, payments, bookings

class IdempotentClient:
    def __init__(self, cache):
        self.cache = cache

    def execute(self, operation, idempotency_key):
        # Check cache
        cached = self.cache.get(idempotency_key)
        if cached:
            return cached

        # Execute and cache
        result = operation()
        self.cache.set(idempotency_key, result, ttl=86400)  # 24 hours
        return result

# Usage
client = IdempotentClient(redis_cache)
result = client.execute(
    operation=lambda: api.book_flight(flight_id),
    idempotency_key=f"{user_id}:{task_id}:book_flight:{flight_id}"
)

Strategy 2: Sequence Numbers

Best for: Internal state changes, ordered operations

class SequencedOperations:
    def __init__(self):
        self.expected_seq = 1
        self.results = {}

    def execute(self, seq_num, operation):
        # Already processed
        if seq_num < self.expected_seq:
            return self.results[seq_num]

        # Out of order
        if seq_num > self.expected_seq:
            raise OutOfOrderError(f"Expected {self.expected_seq}, got {seq_num}")

        # Process and increment
        result = operation()
        self.results[seq_num] = result
        self.expected_seq += 1
        return result

Tradeoff: Simple but requires ordered processing. Doesn’t work well with concurrent clients.

Strategy 3: Time Window Deduplication

Best for: Best-effort deduplication, high-volume low-stakes operations

class TimeWindowDedup:
    def __init__(self, window_seconds=300):
        self.window = window_seconds
        self.seen = {}  # hash -> (timestamp, result)

    def execute(self, request_hash, operation):
        now = time.time()

        # Check if seen within window
        if request_hash in self.seen:
            timestamp, result = self.seen[request_hash]
            if now - timestamp < self.window:
                return result  # Within window, return cached

        # Process and cache
        result = operation()
        self.seen[request_hash] = (now, result)
        return result

Tradeoff: Allows some duplicates (if window expires), but prevents immediate retry storms.


Error Classification

Not all errors should be retried. Getting this wrong causes cascading failures.

from http import HTTPStatus

# These errors are transient — retry them
RETRY_ERRORS = [
    ConnectionResetError,           # Network blip
    TimeoutError,                   # Slow response
    HTTPStatus.TOO_MANY_REQUESTS,   # 429 - Rate limited
    HTTPStatus.SERVICE_UNAVAILABLE, # 503 - Server overloaded
    HTTPStatus.GATEWAY_TIMEOUT,     # 504 - Upstream timeout
    HTTPStatus.BAD_GATEWAY,         # 502 - Proxy error
]

# These errors are permanent — don't retry
NEVER_RETRY_ERRORS = [
    HTTPStatus.BAD_REQUEST,           # 400 - Invalid input
    HTTPStatus.UNAUTHORIZED,          # 401 - Auth failed
    HTTPStatus.FORBIDDEN,             # 403 - Not allowed
    HTTPStatus.NOT_FOUND,             # 404 - Doesn't exist
    HTTPStatus.UNPROCESSABLE_ENTITY,  # 422 - Business rule rejected
    HTTPStatus.CONFLICT,              # 409 - State conflict
]

def should_retry(error):
    if isinstance(error, tuple(RETRY_ERRORS)):
        return True
    if hasattr(error, 'status_code'):
        return error.status_code in [e.value for e in RETRY_ERRORS if hasattr(e, 'value')]
    return False

The rule: Retry infrastructure errors (network, timeout, overload). Don’t retry business errors (validation, auth, not found).


Exponential Backoff with Full Jitter

Naive retry: Wait 1s, retry. All clients retry at the same time. Server overwhelmed again.

Smart retry: Wait random time, increasing with each attempt. Clients spread out. Server recovers.

import random
import time

def retry_with_backoff(
    operation,
    max_retries=5,
    base_delay=0.1,
    max_delay=10.0,
    idempotency_key=None
):
    """
    Exponential backoff with full jitter.

    AWS research shows full jitter significantly reduces
    synchronized retry storms during outages.
    """
    for attempt in range(max_retries):
        try:
            return operation()
        except Exception as e:
            if not should_retry(e):
                raise  # Don't retry permanent errors

            if attempt == max_retries - 1:
                raise  # Last attempt, give up

            # Exponential backoff: 0.1, 0.2, 0.4, 0.8, 1.6... capped at max_delay
            delay = min(base_delay * (2 ** attempt), max_delay)

            # Full jitter: random value between 0 and delay
            # This spreads retries across time, preventing thundering herd
            jittered_delay = random.uniform(0, delay)

            time.sleep(jittered_delay)

Why Full Jitter?

JITTER PREVENTS THUNDERING HERD

Cascading Retry Storm

The nightmare scenario:

CASCADING RETRY STORM

Prevention: Circuit Breakers

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, operation):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = operation()
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

try:
    result = payment_breaker.call(lambda: payment_api.charge(amount))
except CircuitOpenError:
    # Don't even try — circuit is open
    return escalate_to_human("Payment service unavailable")

Framework-Specific Implementation

LangGraph

from langgraph.checkpoint.postgres import PostgresSaver

# LangGraph has built-in retry support
graph = StateGraph(AgentState)

# Configure per-node retry
@graph.node(retry_policy=RetryPolicy(max_attempts=3, backoff_factor=2))
def call_external_api(state):
    # Idempotency key from state
    key = f"{state['user_id']}:{state['task_id']}:{state['step']}"
    return api.call(idempotency_key=key)

# Checkpointing enables safe retry from last known state
app = graph.compile(checkpointer=PostgresSaver.from_conn_string(DATABASE_URL))

Temporal

from temporalio import activity, workflow

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    """
    Temporal activities have at-least-once execution guarantee.
    Your idempotent implementation provides no-more-than-once business effect.
    Together = effective exactly-once execution.
    """
    return await api.book(flight_id, idempotency_key=idempotency_key)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Temporal handles retries with configurable policy
        return await workflow.execute_activity(
            book_flight,
            args=[request.flight_id, f"{request.user_id}:{request.booking_id}"],
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"]
            )
        )

Common Gotchas

GotchaSymptomFix
Timestamp in keyRetries create duplicatesUse stable identifiers only
Key too broadDifferent operations collideInclude operation type in key
Key too narrowSame operation not deduplicatedInclude all relevant context
No TTL on cacheMemory leakSet 24-48 hour TTL
Caching failuresRetrying failed ops returns failureOnly cache successful results
Retrying 400sWasted requests, never succeedsClassify errors properly
No jitterThundering herd on recoveryAlways use full jitter

The Idempotency Checklist

Before deploying an agent with external actions:

IDEMPOTENCY DEPLOYMENT CHECKLIST

Key Takeaways

  1. Idempotency is not optional. Every action with side effects needs deduplication strategy.

  2. Keys must be stable. If the key changes on retry, it’s not idempotent.

  3. Classify errors. Retry infrastructure errors. Don’t retry business errors.

  4. Always use jitter. Without it, you’ll cause the outage you’re trying to survive.

  5. Circuit breakers prevent cascades. One failing service shouldn’t take down everything.


Next Steps

Now that your actions are idempotent, what happens when your agent crashes mid-task?

Part 2: State Persistence & Checkpointing

Or jump to another topic: