I/D/E · production-agents

Durable Execution Frameworks - Don't Reinvent the Wheel

Summary

Deep dive into durable execution frameworks for agents: Temporal, Inngest, Restate, Azure Durable Functions, AWS Step Functions. When to use each and how they solve agent production challenges

Prerequisite: This is Part 6 of the Production Agents Deep Dive series. Read Part 1: Idempotency and Part 2: State Persistence first — durable execution frameworks are the productised form of both.

Durable execution: from DIY plumbing to a single decorator

Every problem the prior chapters covered is already solved by Temporal, Inngest, Restate, AWS Step Functions, and Azure Durable Functions. The chapter is about which to pick and what each one costs.

Why This Matters

You’ve read about idempotency, checkpointing, retries, and state persistence. Here’s the secret: all of these problems have been solved before.

Durable execution frameworks handle:

  • State persistence automatically
  • Retries with exponential backoff built-in
  • Checkpointing at every step
  • Exactly-once semantics (your idempotent business effect + their at-least-once delivery)
  • Long-running workflows (hours to weeks)
  • Human-in-the-loop interrupts

If you’re writing your own checkpointing + retry + recovery logic, you’re probably reinventing a durable execution framework — badly.

What other content gets wrong: Temporal blog posts evangelise Temporal. Inngest blog posts evangelise Inngest. Each makes their framework sound like the inevitable answer. The honest take is that all five frameworks below solve the same five problems with very different operational cost models, deployment shapes, and language ergonomics. This chapter pins the canonical API for each as of 2026-05.

Takeaway: if you’re hand-writing the retry/checkpoint/idempotency triad, you’re rebuilding a framework that already exists. The only real questions are which one and at what cost.


What Durable Execution Means

A durable execution framework guarantees:

  1. State survives failures: If your process crashes, it resumes from the last step
  2. Exactly-once semantics: Even with retries, side effects happen once
  3. Automatic retries: Transient failures handled without your code knowing
  4. Long-running support: Workflows can pause for days, waiting for human input
TRADITIONAL vs DURABLE EXECUTION
Traditional Code:
Start  Execute  [Crash]  Start over from scratch

Durable Execution:
Start  Execute  [Crash]  Resume from last checkpoint

Takeaway: durable execution is replay + persistence + idempotency, productised. Same primitives as Ch01+Ch02, packaged for you.


Framework Comparison

FrameworkBest ForDeploymentLanguage Support
TemporalComplex workflows, enterpriseSelf-hosted or Temporal CloudGo, Java, Python, TypeScript
InngestEvent-driven, serverlessFully managedTypeScript, Python
RestateLow latency, lightweightSelf-hosted or Restate CloudTypeScript, Java, Kotlin, Go
Azure Durable FunctionsAzure-nativeAzure FunctionsC#, JavaScript, Python, PowerShell
AWS Step FunctionsAWS-native, visualAWS nativeJSON state machine (ASL), any via Lambda
GCP Cloud WorkflowsGCP-native, YAMLGCP nativeYAML config

Takeaway: the cells here are decisions, not slogans. Pick by deployment shape and language stack, then verify the operational model — managed vs self-hosted is a bigger cost driver than feature parity.


Temporal

The de-facto standard for complex, long-running workflows; published production users include Stripe, Snap, Datadog, and Netflix. Code below uses the Python SDK 1.x [temporal-py-docs].

Core Concepts

from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    """
    Activities are your side effects. Temporal handles retries automatically.
    Your idempotent implementation + Temporal's at-least-once = exactly-once semantics.
    """
    return await flight_api.book(flight_id, idempotency_key=idempotency_key)

@activity.defn
async def charge_payment(amount: float, idempotency_key: str) -> PaymentResult:
    return await payment_api.charge(amount, idempotency_key=idempotency_key)

@activity.defn
async def send_confirmation(email: str, booking: BookingResult) -> None:
    await email_service.send(email, template="booking_confirmation", data=booking)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Each step is automatically checkpointed
        # If we crash after book_flight, we resume at charge_payment

        booking = await workflow.execute_activity(
            book_flight,
            args=[request.flight_id, f"{request.user_id}:{request.booking_id}:book"],
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"]
            )
        )

        payment = await workflow.execute_activity(
            charge_payment,
            args=[booking.total_amount, f"{request.user_id}:{request.booking_id}:pay"],
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Fire and forget — don't block on email
        await workflow.execute_activity(
            send_confirmation,
            args=[request.email, booking],
            start_to_close_timeout=timedelta(minutes=5)
        )

        return booking

Human-in-the-Loop with Signals

@workflow.defn
class ApprovalWorkflow:
    def __init__(self):
        self.approved = None

    @workflow.signal
    async def approve(self, approved: bool, reason: str):
        """Human sends this signal to approve/reject"""
        self.approved = approved
        self.reason = reason

    @workflow.run
    async def run(self, request: ApprovalRequest) -> ApprovalResult:
        # Execute some work
        analysis = await workflow.execute_activity(analyze_request, args=[request])

        if analysis.needs_approval:
            # Wait for human signal (can wait days)
            await workflow.wait_condition(lambda: self.approved is not None)

            if not self.approved:
                return ApprovalResult(status="rejected", reason=self.reason)

        # Continue with approved workflow
        return await workflow.execute_activity(complete_request, args=[request])

When to Use Temporal

  • Complex multi-step workflows
  • Long-running processes (hours to weeks)
  • Enterprise requirements (audit trails, compliance)
  • Need strong consistency guarantees
  • Already have infrastructure team capacity (Temporal Server is non-trivial to self-host; Temporal Cloud removes that load at a higher per-action cost)

Takeaway: Temporal’s strength is the workflow-as-code abstraction with replay determinism; its cost is operational weight. Use it when you’d otherwise spend months building the same thing.


Inngest

Event-driven, fully managed, TypeScript-first.

Core Concepts

import { Inngest } from "inngest";

const inngest = new Inngest({ id: "my-agent" });

export const agentWorkflow = inngest.createFunction(
  {
    id: "process-customer-request",
    retries: 5, // Built-in retry
  },
  { event: "customer/request.received" },
  async ({ event, step }) => {
    // Each step is automatically checkpointed
    // If we crash after classify, we resume at route

    const classification = await step.run("classify", async () => {
      return await llm.classify(event.data.message);
    });

    const route = await step.run("route", async () => {
      if (classification.confidence < 0.7) {
        return "human";
      }
      return classification.intent;
    });

    if (route === "human") {
      // Wait for human response (can wait up to the configured timeout).
      // step.waitForEvent(stepId, options) — Inngest SDK v4+.
      const humanResponse = await step.waitForEvent("wait-for-human", {
        event: "human/responded",
        match: "data.request_id",
        timeout: "7d",
      });
      return humanResponse;
    }

    // Continue with automated handling
    const result = await step.run("execute", async () => {
      return await agent.execute(classification.intent, event.data);
    });

    return result;
  }
);

Token Budget with Inngest

export const budgetedAgent = inngest.createFunction(
  { id: "budgeted-agent" },
  { event: "agent/task.started" },
  async ({ event, step }) => {
    let tokensUsed = 0;
    const maxTokens = 100000;

    const plan = await step.run("plan", async () => {
      const result = await llm.plan(event.data.task);
      tokensUsed += result.usage.total_tokens;
      return result;
    });

    for (const action of plan.actions) {
      if (tokensUsed >= maxTokens) {
        // Graceful shutdown within budget
        return {
          status: "budget_exceeded",
          completed: plan.actions.indexOf(action),
        };
      }

      await step.run(`execute-${action.id}`, async () => {
        const result = await agent.execute(action);
        tokensUsed += result.usage?.total_tokens || 0;
        return result;
      });
    }

    return { status: "completed", tokensUsed };
  }
);

When to Use Inngest

  • Event-driven architectures
  • Serverless deployments
  • Want fully managed infrastructure
  • TypeScript/Node.js primary stack
  • Need fast iteration speed

step.waitForEvent is part of the Inngest TypeScript SDK v4+ [inngest-docs] — the match parameter is optional, timeout is required.

Takeaway: Inngest is the fastest path from prototype to durable, event-driven agent — at the cost of a managed-only deployment story and a smaller language footprint than Temporal.


Restate

Single-binary self-host, Rust runtime, low-latency journaling.

Core Concepts

// @restatedev/restate-sdk 1.x
import * as restate from "@restatedev/restate-sdk";

const agentService = restate.service({
  name: "agent",
  handlers: {
    processRequest: async (ctx: restate.Context, request: AgentRequest) => {
      // Each ctx.run() is automatically checkpointed
      // Idempotency is built-in via deterministic execution

      const classification = await ctx.run("classify", async () => {
        return await llm.classify(request.message);
      });

      if (classification.needs_approval) {
        // Await human approval (durable promise)
        const approval = await ctx.awakeable<ApprovalResult>();

        // This ID can be used by external system to complete the awakeable
        console.log(`Awaiting approval: ${approval.id}`);

        const result = await approval.promise;
        if (!result.approved) {
          return { status: "rejected" };
        }
      }

      const result = await ctx.run("execute", async () => {
        return await agent.execute(classification.intent, request);
      });

      return result;
    },
  },
});

// Complete awakeable from external system (e.g., webhook)
async function approveRequest(awakeableId: string, approved: boolean) {
  const restate = clients.connect("http://localhost:8080");
  await restate.resolveAwakeable(awakeableId, { approved });
}

Virtual Objects for Stateful Agents

const agentSession = restate.object({
  name: "agent-session",
  handlers: {
    // State is automatically persisted per session ID
    addMessage: async (ctx: restate.ObjectContext, message: Message) => {
      const history = (await ctx.get<Message[]>("history")) || [];
      history.push(message);
      ctx.set("history", history);

      const response = await ctx.run("generate", async () => {
        return await llm.chat(history);
      });

      history.push({ role: "assistant", content: response });
      ctx.set("history", history);

      return response;
    },

    getHistory: async (ctx: restate.ObjectContext) => {
      return (await ctx.get<Message[]>("history")) || [];
    },
  },
});

When to Use Restate

  • Low-latency requirements
  • Lightweight deployment (single binary)
  • Strong consistency without heavy infrastructure
  • TypeScript or JVM stack
  • Want to self-host easily

The TypeScript SDK exposes ctx.run, ctx.awakeable<T>(), and restate.object virtual objects with per-key state [restate-docs]. Restate’s “deterministic helpers” — ctx.rand.uuidv4(), ctx.date.now() — give you replay-safe randomness and time without separate libraries.

Takeaway: Restate is the lightest of the three self-hosted options; the price is a smaller ecosystem and a newer codebase. Use it when single-binary deployment matters more than vendor maturity.


Azure Durable Functions

Native Azure integration. Great for Azure-first shops [azure-durable-events].

Core Concepts

import azure.functions as func
import azure.durable_functions as df

# Orchestrator function — registered via df.Orchestrator.create wrapper
def orchestrator_function(context: df.DurableOrchestrationContext):
    request = context.get_input()

    # Each activity is checkpointed.
    classification = yield context.call_activity("classify_request", request)

    if classification["needs_approval"]:
        # Wait for external event (human approval). Returns the event payload.
        approval = yield context.wait_for_external_event("approval")

        if not approval["approved"]:
            return {"status": "rejected"}

    result = yield context.call_activity("execute_agent_action", classification)
    return result

main = df.Orchestrator.create(orchestrator_function)

# Activity functions — separate Function App entry points
def classify_request(request: dict) -> dict:
    return llm.classify(request["message"])

def execute_agent_action(classification: dict) -> dict:
    return agent.execute(classification["intent"])

Fan-out/Fan-in Pattern

import azure.functions as func
import azure.durable_functions as df

def parallel_research(context: df.DurableOrchestrationContext):
    queries = context.get_input()["queries"]

    # Fan out: run research tasks in parallel
    tasks = [context.call_activity("research_query", q) for q in queries]

    # Fan in: wait for all to complete
    results = yield context.task_all(tasks)

    # Synthesize results
    synthesis = yield context.call_activity("synthesize_results", results)

    return synthesis

main = df.Orchestrator.create(parallel_research)

When to Use Azure Durable Functions

  • Already on Azure
  • .NET or Python primary stack
  • Want serverless with durable state
  • Need tight Azure service integration
  • Cost optimization via consumption pricing

wait_for_external_event provides at-least-once delivery for human-approval events; include a unique event ID for manual deduplication, especially on the Azure Storage backend [azure-durable-events]. The Consumption Plan does not bill while an orchestrator is awaiting an external event.

Takeaway: Durable Functions is the right choice if you’re already paying for Azure; the python decorator pattern has shifted to the df.Orchestrator.create(fn) wrapper, so older @orchestrator_trigger examples need updating.


AWS Step Functions

Visual workflows, tight AWS integration. Best for teams who like declarative state machines [aws-stepfn-callback].

Core Concepts

{
  "Comment": "Agent workflow with human approval",
  "StartAt": "ClassifyRequest",
  "States": {
    "ClassifyRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:classify",
      "Next": "NeedsApproval"
    },
    "NeedsApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.needsApproval",
          "BooleanEquals": true,
          "Next": "WaitForApproval"
        }
      ],
      "Default": "ExecuteAction"
    },
    "WaitForApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
      "Parameters": {
        "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
        "MessageBody": {
          "TaskToken.$": "$$.Task.Token",
          "Request.$": "$"
        }
      },
      "Next": "CheckApproval"
    },
    "CheckApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.approved",
          "BooleanEquals": false,
          "Next": "Rejected"
        }
      ],
      "Default": "ExecuteAction"
    },
    "ExecuteAction": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:execute",
      "End": true
    },
    "Rejected": {
      "Type": "Fail",
      "Error": "ApprovalRejected",
      "Cause": "Human rejected the action"
    }
  }
}

When to Use Step Functions

  • Already on AWS
  • Visual workflow design preferred
  • Need built-in AWS service integrations
  • Long-running workflows with wait states (Standard Workflows support up to one-year execution lifetimes [aws-stepfn-callback])
  • Audit and compliance requirements

The .waitForTaskToken integration pattern + $$.Task.Token context-object accessor is the canonical human-in-the-loop mechanism — the external system completes the task via SendTaskSuccess / SendTaskFailure. Configure HeartbeatSeconds to avoid indefinite waits.

Takeaway: Step Functions is the right answer when your stack is AWS-native and the diagram-as-workflow model is a feature, not a constraint.


Decision Framework

If You Need…Use
Complex multi-step workflows with strong guaranteesTemporal
Fast iteration, event-driven, serverlessInngest
Low latency, lightweight, self-hostedRestate
Azure-native, serverless, cost optimizationAzure Durable Functions
AWS-native, visual workflows, service integrationsAWS Step Functions
GCP-native, simple YAML workflowsGCP Cloud Workflows

The “Build vs Buy” Decision Tree

BUILD vs BUY DECISION
Are you writing retry logic with exponential backoff?
 Consider a durable execution framework

Are you implementing checkpointing to survive crashes?
 Consider a durable execution framework

Are you building idempotency key management?
 Consider a durable execution framework

Are you handling human-in-the-loop with long waits?
 Consider a durable execution framework

If yes to 2+ of these, you're reinventing the wheel.

Takeaway: the decision table is the chapter’s most-reused artefact. The “build vs buy” question almost always points to buy — unless your operational model can’t host the chosen framework’s deployment shape.


What to Copy, What to Skip

PatternCopy whenSkip when
Temporal workflow-as-codeMulti-step workflows, replay-determinism matters, team can run Temporal Server or pay for CloudSingle-step event handlers — overkill
Inngest step.waitForEventEvent-driven serverless agents, TypeScript primary stack, want managed-everythingNeed self-hosted control plane or non-TS languages outside their support list
Restate ctx.awakeableLightweight durable HITL, single-binary self-host, low-latency needsEstablished ops team already on Temporal Cloud or AWS — adoption cost outweighs gain
Azure Durable Functions wait_for_external_eventAzure-native shop, want serverless billing on idle, Python or .NET primaryMulti-cloud architecture — couples you to Azure orchestration
AWS Step Functions .waitForTaskTokenAWS-native shop, visual state-machine design preferred, audit-trail requirementsHeavy compute logic — ASL is restrictive for arbitrary code; defer to Lambda
DIY checkpointer + retry loopTruly esoteric workloads where no framework fitsAnything resembling Ch01–Ch02 patterns — you’re reinventing
Same idempotency-key derivation across the workflow [pa-idempotency]Every framework — the key is yours, not the framework’sNever skip — non-idempotent activities + at-least-once delivery = double-charged customers
Cross-framework portability layerMulti-cloud or multi-framework teams — abstract the workflow API behind your own interfaceSingle-framework shops — premature abstraction

Takeaway: every pattern has a “skip when”. Frameworks are not interchangeable; the wrong one is more painful than building from scratch was supposed to be.


Common Gotchas

GotchaSymptomFix
Non-deterministic code inside workflowReplay fails mysteriously; activities run twice or wrong valuesMove all non-deterministic ops (HTTP, random, time) into activities (Temporal) or ctx.run blocks (Restate); use framework-provided deterministic helpers (ctx.rand, ctx.date)
At-least-once without idempotency keysCustomer charged 3x when the worker retries a side-effecting activityEvery external-side-effect activity takes an idempotency_key argument derived from stable workflow inputs (Ch01 pattern)
Indefinite waitForExternalEvent / waitForTaskTokenWorkflow stuck for the full one-year service quota when the human never respondsRace a durable timer against the event; on timeout, escalate or reject
Forgetting HeartbeatSeconds on AWS callback tasksTask waits up to one year even when the external system is deadSet HeartbeatSeconds and call SendTaskHeartbeat periodically from the external worker
Old Azure @orchestrator_trigger decorator copy-pasted from 2023 tutorialsFunction App doesn’t recognise the orchestrator on deployUse the current main = df.Orchestrator.create(orchestrator_function) wrapper pattern
Mixing framework versions across servicesWorkflow replay corrupts when worker SDK version diverges from history schemaPin SDK versions per service; gate workflow upgrades behind a version-aware routing strategy (Temporal workflow.versioning, Inngest function versions)

Takeaway: every gotcha here is a non-determinism or version-skew bug in disguise. Frameworks reward strict discipline at workflow boundaries; they punish “just one quick time.time() call” mercilessly.


Migration Path

If you have existing agent code, here’s how to migrate:

1. Identify Side Effects

# Before: Side effects scattered in code
def process_request(request):
    classification = llm.classify(request)  # LLM call
    if classification.needs_action:
        result = api.execute(classification)  # External API
        email.send(request.user, result)      # Email
    return result

2. Extract as Activities

# After: Side effects are activities
@activity.defn
async def classify(request): return llm.classify(request)

@activity.defn
async def execute_action(classification): return api.execute(classification)

@activity.defn
async def send_email(user, result): email.send(user, result)

@workflow.defn
class RequestWorkflow:
    @workflow.run
    async def run(self, request):
        classification = await workflow.execute_activity(classify, args=[request])
        if classification.needs_action:
            result = await workflow.execute_activity(execute_action, args=[classification])
            await workflow.execute_activity(send_email, args=[request.user, result])
        return result

Takeaway: migration is mechanical — name your side effects, wrap them as activities, compose them in a workflow. The patterns from Ch01 [pa-idempotency] (idempotency keys) and Ch02 [pa-state] (checkpoint timing) carry over unchanged; the framework handles the rest.


Closing

Durable execution productises everything in the prior chapters: retries become RetryPolicy, checkpoints become free at every step, human-in-the-loop becomes a signal, and at-least-once delivery + your idempotency key becomes effectively-once business effect. The five frameworks differ in operational shape, language ergonomics, and ecosystem maturity — not in what they solve.

Testable question: for your top three external API calls, can you name which framework primitive replaces the retry/checkpoint/idempotency code you’d otherwise hand-write? If not, the framework choice is theoretical.

Part 7: Security & Sandboxing — securing agents that execute code and call external APIs.


References

Production-agents Ch 7/9
  1. 1 Production Agents Overview - The Loop Is 20% of the Work 15m
  2. 2 Idempotency & Safe Retries - The Stripe Pattern for Agents 20m
  3. 3 State Persistence & Agent Memory - The Complete Domain 30m
  4. 4 Human-in-the-Loop Patterns - When Agents Need Judgment 25m
  5. 5 Cost Control & Token Budgets - Preventing $10K Surprises 20m
  6. 6 Observability & Silent Failures - Catching What Doesn't Crash 25m
  7. 7 Durable Execution Frameworks - Don't Reinvent the Wheel 25m
  8. 8 Security & Sandboxing - Containing Agent Risk 20m
  9. 9 Testing & Evaluation - Validating Agent Behavior 25m