Durable Execution Frameworks - Don't Reinvent the Wheel | Intentional / Deliberate / Engineering

Prerequisite: This is Part 6 of the Production Agents Deep Dive series. Read Part 1: Idempotency and Part 2: State Persistence first — durable execution frameworks are the productised form of both.

Left half: an in-house DIY stack with hand-written retry loops, custom checkpointer, ad-hoc idempotency keys, and a manual human-in-the-loop queue — months of debugging shown as red stack-trace lines. Right half: the same five capabilities provided by a durable execution framework — retry policies, automatic checkpoints, idempotency-by-replay, signal-based HITL — shipped behind a single workflow decorator. A divider in the middle reads 'roll your own → use a framework.' — Durable execution: from DIY plumbing to a single decorator

Why This Matters

You’ve read about idempotency, checkpointing, retries, and state persistence. Here’s the secret: all of these problems have been solved before.

Durable execution frameworks handle:

State persistence automatically
Retries with exponential backoff built-in
Checkpointing at every step
Exactly-once semantics (your idempotent business effect + their at-least-once delivery)
Long-running workflows (hours to weeks)
Human-in-the-loop interrupts

If you’re writing your own checkpointing + retry + recovery logic, you’re probably reinventing a durable execution framework — badly.

What other content gets wrong: Temporal blog posts evangelise Temporal. Inngest blog posts evangelise Inngest. Each makes their framework sound like the inevitable answer. The honest take is that all five frameworks below solve the same five problems with very different operational cost models, deployment shapes, and language ergonomics. This chapter pins the canonical API for each as of 2026-05.

Takeaway: if you’re hand-writing the retry/checkpoint/idempotency triad, you’re rebuilding a framework that already exists. The only real questions are which one and at what cost.

What Durable Execution Means

A durable execution framework guarantees:

State survives failures: If your process crashes, it resumes from the last step
Exactly-once semantics: Even with retries, side effects happen once
Automatic retries: Transient failures handled without your code knowing
Long-running support: Workflows can pause for days, waiting for human input

TRADITIONAL vs DURABLE EXECUTION

Traditional Code:
Start → Execute → [Crash] → Start over from scratch

Durable Execution:
Start → Execute → [Crash] → Resume from last checkpoint

Takeaway: durable execution is replay + persistence + idempotency, productised. Same primitives as Ch01+Ch02, packaged for you.

Framework Comparison

Framework	Best For	Deployment	Language Support
Temporal	Complex workflows, enterprise	Self-hosted or Temporal Cloud	Go, Java, Python, TypeScript
Inngest	Event-driven, serverless	Fully managed	TypeScript, Python
Restate	Low latency, lightweight	Self-hosted or Restate Cloud	TypeScript, Java, Kotlin, Go
Azure Durable Functions	Azure-native	Azure Functions	C#, JavaScript, Python, PowerShell
AWS Step Functions	AWS-native, visual	AWS native	JSON state machine (ASL), any via Lambda
GCP Cloud Workflows	GCP-native, YAML	GCP native	YAML config

Takeaway: the cells here are decisions, not slogans. Pick by deployment shape and language stack, then verify the operational model — managed vs self-hosted is a bigger cost driver than feature parity.

Temporal

The de-facto standard for complex, long-running workflows; published production users include Stripe, Snap, Datadog, and Netflix. Code below uses the Python SDK 1.x [temporal-py-docs].

Core Concepts

from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy

@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
    """
    Activities are your side effects. Temporal handles retries automatically.
    Your idempotent implementation + Temporal's at-least-once = exactly-once semantics.
    """
    return await flight_api.book(flight_id, idempotency_key=idempotency_key)

@activity.defn
async def charge_payment(amount: float, idempotency_key: str) -> PaymentResult:
    return await payment_api.charge(amount, idempotency_key=idempotency_key)

@activity.defn
async def send_confirmation(email: str, booking: BookingResult) -> None:
    await email_service.send(email, template="booking_confirmation", data=booking)

@workflow.defn
class BookingWorkflow:
    @workflow.run
    async def run(self, request: BookingRequest) -> BookingResult:
        # Each step is automatically checkpointed
        # If we crash after book_flight, we resume at charge_payment

        booking = await workflow.execute_activity(
            book_flight,
            args=[request.flight_id, f"{request.user_id}:{request.booking_id}:book"],
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                maximum_interval=timedelta(seconds=30),
                backoff_coefficient=2.0,
                maximum_attempts=5,
                non_retryable_error_types=["ValidationError", "AuthError"]
            )
        )

        payment = await workflow.execute_activity(
            charge_payment,
            args=[booking.total_amount, f"{request.user_id}:{request.booking_id}:pay"],
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Fire and forget — don't block on email
        await workflow.execute_activity(
            send_confirmation,
            args=[request.email, booking],
            start_to_close_timeout=timedelta(minutes=5)
        )

        return booking

Human-in-the-Loop with Signals

@workflow.defn
class ApprovalWorkflow:
    def __init__(self):
        self.approved = None

    @workflow.signal
    async def approve(self, approved: bool, reason: str):
        """Human sends this signal to approve/reject"""
        self.approved = approved
        self.reason = reason

    @workflow.run
    async def run(self, request: ApprovalRequest) -> ApprovalResult:
        # Execute some work
        analysis = await workflow.execute_activity(analyze_request, args=[request])

        if analysis.needs_approval:
            # Wait for human signal (can wait days)
            await workflow.wait_condition(lambda: self.approved is not None)

            if not self.approved:
                return ApprovalResult(status="rejected", reason=self.reason)

        # Continue with approved workflow
        return await workflow.execute_activity(complete_request, args=[request])

When to Use Temporal

Complex multi-step workflows
Long-running processes (hours to weeks)
Enterprise requirements (audit trails, compliance)
Need strong consistency guarantees
Already have infrastructure team capacity (Temporal Server is non-trivial to self-host; Temporal Cloud removes that load at a higher per-action cost)

Takeaway: Temporal’s strength is the workflow-as-code abstraction with replay determinism; its cost is operational weight. Use it when you’d otherwise spend months building the same thing.

Inngest

Event-driven, fully managed, TypeScript-first.

Core Concepts

import { Inngest } from "inngest";

const inngest = new Inngest({ id: "my-agent" });

export const agentWorkflow = inngest.createFunction(
  {
    id: "process-customer-request",
    retries: 5, // Built-in retry
  },
  { event: "customer/request.received" },
  async ({ event, step }) => {
    // Each step is automatically checkpointed
    // If we crash after classify, we resume at route

    const classification = await step.run("classify", async () => {
      return await llm.classify(event.data.message);
    });

    const route = await step.run("route", async () => {
      if (classification.confidence < 0.7) {
        return "human";
      }
      return classification.intent;
    });

    if (route === "human") {
      // Wait for human response (can wait up to the configured timeout).
      // step.waitForEvent(stepId, options) — Inngest SDK v4+.
      const humanResponse = await step.waitForEvent("wait-for-human", {
        event: "human/responded",
        match: "data.request_id",
        timeout: "7d",
      });
      return humanResponse;
    }

    // Continue with automated handling
    const result = await step.run("execute", async () => {
      return await agent.execute(classification.intent, event.data);
    });

    return result;
  }
);

Token Budget with Inngest

export const budgetedAgent = inngest.createFunction(
  { id: "budgeted-agent" },
  { event: "agent/task.started" },
  async ({ event, step }) => {
    let tokensUsed = 0;
    const maxTokens = 100000;

    const plan = await step.run("plan", async () => {
      const result = await llm.plan(event.data.task);
      tokensUsed += result.usage.total_tokens;
      return result;
    });

    for (const action of plan.actions) {
      if (tokensUsed >= maxTokens) {
        // Graceful shutdown within budget
        return {
          status: "budget_exceeded",
          completed: plan.actions.indexOf(action),
        };
      }

      await step.run(`execute-${action.id}`, async () => {
        const result = await agent.execute(action);
        tokensUsed += result.usage?.total_tokens || 0;
        return result;
      });
    }

    return { status: "completed", tokensUsed };
  }
);

When to Use Inngest

Event-driven architectures
Serverless deployments
Want fully managed infrastructure
TypeScript/Node.js primary stack
Need fast iteration speed

step.waitForEvent is part of the Inngest TypeScript SDK v4+ [inngest-docs] — the match parameter is optional, timeout is required.

Takeaway: Inngest is the fastest path from prototype to durable, event-driven agent — at the cost of a managed-only deployment story and a smaller language footprint than Temporal.

Restate

Single-binary self-host, Rust runtime, low-latency journaling.

Core Concepts

// @restatedev/restate-sdk 1.x
import * as restate from "@restatedev/restate-sdk";

const agentService = restate.service({
  name: "agent",
  handlers: {
    processRequest: async (ctx: restate.Context, request: AgentRequest) => {
      // Each ctx.run() is automatically checkpointed
      // Idempotency is built-in via deterministic execution

      const classification = await ctx.run("classify", async () => {
        return await llm.classify(request.message);
      });

      if (classification.needs_approval) {
        // Await human approval (durable promise)
        const approval = await ctx.awakeable<ApprovalResult>();

        // This ID can be used by external system to complete the awakeable
        console.log(`Awaiting approval: ${approval.id}`);

        const result = await approval.promise;
        if (!result.approved) {
          return { status: "rejected" };
        }
      }

      const result = await ctx.run("execute", async () => {
        return await agent.execute(classification.intent, request);
      });

      return result;
    },
  },
});

// Complete awakeable from external system (e.g., webhook)
async function approveRequest(awakeableId: string, approved: boolean) {
  const restate = clients.connect("http://localhost:8080");
  await restate.resolveAwakeable(awakeableId, { approved });
}

Virtual Objects for Stateful Agents

const agentSession = restate.object({
  name: "agent-session",
  handlers: {
    // State is automatically persisted per session ID
    addMessage: async (ctx: restate.ObjectContext, message: Message) => {
      const history = (await ctx.get<Message[]>("history")) || [];
      history.push(message);
      ctx.set("history", history);

      const response = await ctx.run("generate", async () => {
        return await llm.chat(history);
      });

      history.push({ role: "assistant", content: response });
      ctx.set("history", history);

      return response;
    },

    getHistory: async (ctx: restate.ObjectContext) => {
      return (await ctx.get<Message[]>("history")) || [];
    },
  },
});

When to Use Restate

Low-latency requirements
Lightweight deployment (single binary)
Strong consistency without heavy infrastructure
TypeScript or JVM stack
Want to self-host easily

The TypeScript SDK exposes ctx.run, ctx.awakeable<T>(), and restate.object virtual objects with per-key state [restate-docs]. Restate’s “deterministic helpers” — ctx.rand.uuidv4(), ctx.date.now() — give you replay-safe randomness and time without separate libraries.

Takeaway: Restate is the lightest of the three self-hosted options; the price is a smaller ecosystem and a newer codebase. Use it when single-binary deployment matters more than vendor maturity.

Azure Durable Functions

Native Azure integration. Great for Azure-first shops [azure-durable-events].

Core Concepts

import azure.functions as func
import azure.durable_functions as df

# Orchestrator function — registered via df.Orchestrator.create wrapper
def orchestrator_function(context: df.DurableOrchestrationContext):
    request = context.get_input()

    # Each activity is checkpointed.
    classification = yield context.call_activity("classify_request", request)

    if classification["needs_approval"]:
        # Wait for external event (human approval). Returns the event payload.
        approval = yield context.wait_for_external_event("approval")

        if not approval["approved"]:
            return {"status": "rejected"}

    result = yield context.call_activity("execute_agent_action", classification)
    return result

main = df.Orchestrator.create(orchestrator_function)

# Activity functions — separate Function App entry points
def classify_request(request: dict) -> dict:
    return llm.classify(request["message"])

def execute_agent_action(classification: dict) -> dict:
    return agent.execute(classification["intent"])

Fan-out/Fan-in Pattern

import azure.functions as func
import azure.durable_functions as df

def parallel_research(context: df.DurableOrchestrationContext):
    queries = context.get_input()["queries"]

    # Fan out: run research tasks in parallel
    tasks = [context.call_activity("research_query", q) for q in queries]

    # Fan in: wait for all to complete
    results = yield context.task_all(tasks)

    # Synthesize results
    synthesis = yield context.call_activity("synthesize_results", results)

    return synthesis

main = df.Orchestrator.create(parallel_research)

When to Use Azure Durable Functions

Already on Azure
.NET or Python primary stack
Want serverless with durable state
Need tight Azure service integration
Cost optimization via consumption pricing

wait_for_external_event provides at-least-once delivery for human-approval events; include a unique event ID for manual deduplication, especially on the Azure Storage backend [azure-durable-events]. The Consumption Plan does not bill while an orchestrator is awaiting an external event.

Takeaway: Durable Functions is the right choice if you’re already paying for Azure; the python decorator pattern has shifted to the df.Orchestrator.create(fn) wrapper, so older @orchestrator_trigger examples need updating.

AWS Step Functions

Visual workflows, tight AWS integration. Best for teams who like declarative state machines [aws-stepfn-callback].

Core Concepts

{
  "Comment": "Agent workflow with human approval",
  "StartAt": "ClassifyRequest",
  "States": {
    "ClassifyRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:classify",
      "Next": "NeedsApproval"
    },
    "NeedsApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.needsApproval",
          "BooleanEquals": true,
          "Next": "WaitForApproval"
        }
      ],
      "Default": "ExecuteAction"
    },
    "WaitForApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
      "Parameters": {
        "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
        "MessageBody": {
          "TaskToken.$": "$$.Task.Token",
          "Request.$": "$"
        }
      },
      "Next": "CheckApproval"
    },
    "CheckApproval": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.approved",
          "BooleanEquals": false,
          "Next": "Rejected"
        }
      ],
      "Default": "ExecuteAction"
    },
    "ExecuteAction": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:execute",
      "End": true
    },
    "Rejected": {
      "Type": "Fail",
      "Error": "ApprovalRejected",
      "Cause": "Human rejected the action"
    }
  }
}

When to Use Step Functions

Already on AWS
Visual workflow design preferred
Need built-in AWS service integrations
Long-running workflows with wait states (Standard Workflows support up to one-year execution lifetimes [aws-stepfn-callback])
Audit and compliance requirements

The .waitForTaskToken integration pattern + $$.Task.Token context-object accessor is the canonical human-in-the-loop mechanism — the external system completes the task via SendTaskSuccess / SendTaskFailure. Configure HeartbeatSeconds to avoid indefinite waits.

Takeaway: Step Functions is the right answer when your stack is AWS-native and the diagram-as-workflow model is a feature, not a constraint.

Decision Framework

If You Need…	Use
Complex multi-step workflows with strong guarantees	Temporal
Fast iteration, event-driven, serverless	Inngest
Low latency, lightweight, self-hosted	Restate
Azure-native, serverless, cost optimization	Azure Durable Functions
AWS-native, visual workflows, service integrations	AWS Step Functions
GCP-native, simple YAML workflows	GCP Cloud Workflows

The “Build vs Buy” Decision Tree

BUILD vs BUY DECISION

Are you writing retry logic with exponential backoff?
→ Consider a durable execution framework

Are you implementing checkpointing to survive crashes?
→ Consider a durable execution framework

Are you building idempotency key management?
→ Consider a durable execution framework

Are you handling human-in-the-loop with long waits?
→ Consider a durable execution framework

If yes to 2+ of these, you're reinventing the wheel.

Takeaway: the decision table is the chapter’s most-reused artefact. The “build vs buy” question almost always points to buy — unless your operational model can’t host the chosen framework’s deployment shape.

What to Copy, What to Skip

Pattern	Copy when	Skip when
Temporal workflow-as-code	Multi-step workflows, replay-determinism matters, team can run Temporal Server or pay for Cloud	Single-step event handlers — overkill
Inngest `step.waitForEvent`	Event-driven serverless agents, TypeScript primary stack, want managed-everything	Need self-hosted control plane or non-TS languages outside their support list
Restate `ctx.awakeable`	Lightweight durable HITL, single-binary self-host, low-latency needs	Established ops team already on Temporal Cloud or AWS — adoption cost outweighs gain
Azure Durable Functions `wait_for_external_event`	Azure-native shop, want serverless billing on idle, Python or .NET primary	Multi-cloud architecture — couples you to Azure orchestration
AWS Step Functions `.waitForTaskToken`	AWS-native shop, visual state-machine design preferred, audit-trail requirements	Heavy compute logic — ASL is restrictive for arbitrary code; defer to Lambda
DIY checkpointer + retry loop	Truly esoteric workloads where no framework fits	Anything resembling Ch01–Ch02 patterns — you’re reinventing
Same idempotency-key derivation across the workflow [pa-idempotency]	Every framework — the key is yours, not the framework’s	Never skip — non-idempotent activities + at-least-once delivery = double-charged customers
Cross-framework portability layer	Multi-cloud or multi-framework teams — abstract the workflow API behind your own interface	Single-framework shops — premature abstraction

Takeaway: every pattern has a “skip when”. Frameworks are not interchangeable; the wrong one is more painful than building from scratch was supposed to be.

Common Gotchas

Gotcha	Symptom	Fix
Non-deterministic code inside workflow	Replay fails mysteriously; activities run twice or wrong values	Move all non-deterministic ops (HTTP, random, time) into activities (Temporal) or `ctx.run` blocks (Restate); use framework-provided deterministic helpers (`ctx.rand`, `ctx.date`)
At-least-once without idempotency keys	Customer charged 3x when the worker retries a side-effecting activity	Every external-side-effect activity takes an `idempotency_key` argument derived from stable workflow inputs (Ch01 pattern)
Indefinite `waitForExternalEvent` / `waitForTaskToken`	Workflow stuck for the full one-year service quota when the human never responds	Race a durable timer against the event; on timeout, escalate or reject
Forgetting `HeartbeatSeconds` on AWS callback tasks	Task waits up to one year even when the external system is dead	Set `HeartbeatSeconds` and call `SendTaskHeartbeat` periodically from the external worker
Old Azure `@orchestrator_trigger` decorator copy-pasted from 2023 tutorials	Function App doesn’t recognise the orchestrator on deploy	Use the current `main = df.Orchestrator.create(orchestrator_function)` wrapper pattern
Mixing framework versions across services	Workflow replay corrupts when worker SDK version diverges from history schema	Pin SDK versions per service; gate workflow upgrades behind a version-aware routing strategy (Temporal `workflow.versioning`, Inngest function versions)

Takeaway: every gotcha here is a non-determinism or version-skew bug in disguise. Frameworks reward strict discipline at workflow boundaries; they punish “just one quick time.time() call” mercilessly.

Migration Path

If you have existing agent code, here’s how to migrate:

1. Identify Side Effects

# Before: Side effects scattered in code
def process_request(request):
    classification = llm.classify(request)  # LLM call
    if classification.needs_action:
        result = api.execute(classification)  # External API
        email.send(request.user, result)      # Email
    return result

2. Extract as Activities

# After: Side effects are activities
@activity.defn
async def classify(request): return llm.classify(request)

@activity.defn
async def execute_action(classification): return api.execute(classification)

@activity.defn
async def send_email(user, result): email.send(user, result)

@workflow.defn
class RequestWorkflow:
    @workflow.run
    async def run(self, request):
        classification = await workflow.execute_activity(classify, args=[request])
        if classification.needs_action:
            result = await workflow.execute_activity(execute_action, args=[classification])
            await workflow.execute_activity(send_email, args=[request.user, result])
        return result

Takeaway: migration is mechanical — name your side effects, wrap them as activities, compose them in a workflow. The patterns from Ch01 [pa-idempotency] (idempotency keys) and Ch02 [pa-state] (checkpoint timing) carry over unchanged; the framework handles the rest.

Closing

Durable execution productises everything in the prior chapters: retries become RetryPolicy, checkpoints become free at every step, human-in-the-loop becomes a signal, and at-least-once delivery + your idempotency key becomes effectively-once business effect. The five frameworks differ in operational shape, language ergonomics, and ecosystem maturity — not in what they solve.

Testable question: for your top three external API calls, can you name which framework primitive replaces the retry/checkpoint/idempotency code you’d otherwise hand-write? If not, the framework choice is theoretical.

→ Part 7: Security & Sandboxing — securing agents that execute code and call external APIs.

References

[temporal-py-docs] Temporal — Python SDK: Failure Detection & Retries. docs.temporal.io/develop/python/failure-detection. Source for temporalio.common.RetryPolicy, non_retryable_error_types, and the at-least-once activity contract.
[inngest-docs] Inngest — step.waitForEvent API reference. inngest.com/docs/reference/functions/step-wait-for-event. Source for the TypeScript SDK v4+ match + timeout parameters and the Promise<null | EventPayload> return type.
[restate-docs] Restate — Journaling Results in TypeScript. docs.restate.dev/develop/ts/journaling-results. Source for ctx.run, deterministic helpers (ctx.rand, ctx.date), and the no-nesting constraint for context calls.
[azure-durable-events] Microsoft Learn — Handle External Events in Durable Orchestrations, updated 2026-05-06. learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-external-events. Source for context.wait_for_external_event semantics, df.Orchestrator.create wrapper, and the at-least-once delivery guarantee.
[aws-stepfn-callback] AWS — Step Functions: Discover Service Integration Patterns. docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html. Source for .waitForTaskToken, $$.Task.Token context-object access, SendTaskSuccess/SendTaskFailure, HeartbeatSeconds timeout, and the one-year execution lifetime ceiling.
[pa-idempotency] Production Agents — Part 1: Idempotency & Safe Retries. Source for the idempotency-key derivation that turns at-least-once delivery into effectively-once business effect.
[pa-state] Production Agents — Part 2: State Persistence & Checkpointing. The checkpoint timing patterns that durable execution frameworks automate.