Prerequisite: This is Part 6 of the Production Agents Deep Dive series. Read Part 1: Idempotency and Part 2: State Persistence first — durable execution frameworks are the productised form of both.
Every problem the prior chapters covered is already solved by Temporal, Inngest, Restate, AWS Step Functions, and Azure Durable Functions. The chapter is about which to pick and what each one costs.
Why This Matters
You’ve read about idempotency, checkpointing, retries, and state persistence. Here’s the secret: all of these problems have been solved before.
Durable execution frameworks handle:
- State persistence automatically
- Retries with exponential backoff built-in
- Checkpointing at every step
- Exactly-once semantics (your idempotent business effect + their at-least-once delivery)
- Long-running workflows (hours to weeks)
- Human-in-the-loop interrupts
If you’re writing your own checkpointing + retry + recovery logic, you’re probably reinventing a durable execution framework — badly.
What other content gets wrong: Temporal blog posts evangelise Temporal. Inngest blog posts evangelise Inngest. Each makes their framework sound like the inevitable answer. The honest take is that all five frameworks below solve the same five problems with very different operational cost models, deployment shapes, and language ergonomics. This chapter pins the canonical API for each as of 2026-05.
Takeaway: if you’re hand-writing the retry/checkpoint/idempotency triad, you’re rebuilding a framework that already exists. The only real questions are which one and at what cost.
What Durable Execution Means
A durable execution framework guarantees:
- State survives failures: If your process crashes, it resumes from the last step
- Exactly-once semantics: Even with retries, side effects happen once
- Automatic retries: Transient failures handled without your code knowing
- Long-running support: Workflows can pause for days, waiting for human input
Traditional Code: Start → Execute → [Crash] → Start over from scratch Durable Execution: Start → Execute → [Crash] → Resume from last checkpoint
Takeaway: durable execution is replay + persistence + idempotency, productised. Same primitives as Ch01+Ch02, packaged for you.
Framework Comparison
| Framework | Best For | Deployment | Language Support |
|---|---|---|---|
| Temporal | Complex workflows, enterprise | Self-hosted or Temporal Cloud | Go, Java, Python, TypeScript |
| Inngest | Event-driven, serverless | Fully managed | TypeScript, Python |
| Restate | Low latency, lightweight | Self-hosted or Restate Cloud | TypeScript, Java, Kotlin, Go |
| Azure Durable Functions | Azure-native | Azure Functions | C#, JavaScript, Python, PowerShell |
| AWS Step Functions | AWS-native, visual | AWS native | JSON state machine (ASL), any via Lambda |
| GCP Cloud Workflows | GCP-native, YAML | GCP native | YAML config |
Takeaway: the cells here are decisions, not slogans. Pick by deployment shape and language stack, then verify the operational model — managed vs self-hosted is a bigger cost driver than feature parity.
Temporal
The de-facto standard for complex, long-running workflows; published production users include Stripe, Snap, Datadog, and Netflix. Code below uses the Python SDK 1.x [temporal-py-docs].
Core Concepts
from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
@activity.defn
async def book_flight(flight_id: str, idempotency_key: str) -> BookingResult:
"""
Activities are your side effects. Temporal handles retries automatically.
Your idempotent implementation + Temporal's at-least-once = exactly-once semantics.
"""
return await flight_api.book(flight_id, idempotency_key=idempotency_key)
@activity.defn
async def charge_payment(amount: float, idempotency_key: str) -> PaymentResult:
return await payment_api.charge(amount, idempotency_key=idempotency_key)
@activity.defn
async def send_confirmation(email: str, booking: BookingResult) -> None:
await email_service.send(email, template="booking_confirmation", data=booking)
@workflow.defn
class BookingWorkflow:
@workflow.run
async def run(self, request: BookingRequest) -> BookingResult:
# Each step is automatically checkpointed
# If we crash after book_flight, we resume at charge_payment
booking = await workflow.execute_activity(
book_flight,
args=[request.flight_id, f"{request.user_id}:{request.booking_id}:book"],
retry_policy=RetryPolicy(
initial_interval=timedelta(seconds=1),
maximum_interval=timedelta(seconds=30),
backoff_coefficient=2.0,
maximum_attempts=5,
non_retryable_error_types=["ValidationError", "AuthError"]
)
)
payment = await workflow.execute_activity(
charge_payment,
args=[booking.total_amount, f"{request.user_id}:{request.booking_id}:pay"],
retry_policy=RetryPolicy(maximum_attempts=3)
)
# Fire and forget — don't block on email
await workflow.execute_activity(
send_confirmation,
args=[request.email, booking],
start_to_close_timeout=timedelta(minutes=5)
)
return booking
Human-in-the-Loop with Signals
@workflow.defn
class ApprovalWorkflow:
def __init__(self):
self.approved = None
@workflow.signal
async def approve(self, approved: bool, reason: str):
"""Human sends this signal to approve/reject"""
self.approved = approved
self.reason = reason
@workflow.run
async def run(self, request: ApprovalRequest) -> ApprovalResult:
# Execute some work
analysis = await workflow.execute_activity(analyze_request, args=[request])
if analysis.needs_approval:
# Wait for human signal (can wait days)
await workflow.wait_condition(lambda: self.approved is not None)
if not self.approved:
return ApprovalResult(status="rejected", reason=self.reason)
# Continue with approved workflow
return await workflow.execute_activity(complete_request, args=[request])
When to Use Temporal
- Complex multi-step workflows
- Long-running processes (hours to weeks)
- Enterprise requirements (audit trails, compliance)
- Need strong consistency guarantees
- Already have infrastructure team capacity (Temporal Server is non-trivial to self-host; Temporal Cloud removes that load at a higher per-action cost)
Takeaway: Temporal’s strength is the workflow-as-code abstraction with replay determinism; its cost is operational weight. Use it when you’d otherwise spend months building the same thing.
Inngest
Event-driven, fully managed, TypeScript-first.
Core Concepts
import { Inngest } from "inngest";
const inngest = new Inngest({ id: "my-agent" });
export const agentWorkflow = inngest.createFunction(
{
id: "process-customer-request",
retries: 5, // Built-in retry
},
{ event: "customer/request.received" },
async ({ event, step }) => {
// Each step is automatically checkpointed
// If we crash after classify, we resume at route
const classification = await step.run("classify", async () => {
return await llm.classify(event.data.message);
});
const route = await step.run("route", async () => {
if (classification.confidence < 0.7) {
return "human";
}
return classification.intent;
});
if (route === "human") {
// Wait for human response (can wait up to the configured timeout).
// step.waitForEvent(stepId, options) — Inngest SDK v4+.
const humanResponse = await step.waitForEvent("wait-for-human", {
event: "human/responded",
match: "data.request_id",
timeout: "7d",
});
return humanResponse;
}
// Continue with automated handling
const result = await step.run("execute", async () => {
return await agent.execute(classification.intent, event.data);
});
return result;
}
);
Token Budget with Inngest
export const budgetedAgent = inngest.createFunction(
{ id: "budgeted-agent" },
{ event: "agent/task.started" },
async ({ event, step }) => {
let tokensUsed = 0;
const maxTokens = 100000;
const plan = await step.run("plan", async () => {
const result = await llm.plan(event.data.task);
tokensUsed += result.usage.total_tokens;
return result;
});
for (const action of plan.actions) {
if (tokensUsed >= maxTokens) {
// Graceful shutdown within budget
return {
status: "budget_exceeded",
completed: plan.actions.indexOf(action),
};
}
await step.run(`execute-${action.id}`, async () => {
const result = await agent.execute(action);
tokensUsed += result.usage?.total_tokens || 0;
return result;
});
}
return { status: "completed", tokensUsed };
}
);
When to Use Inngest
- Event-driven architectures
- Serverless deployments
- Want fully managed infrastructure
- TypeScript/Node.js primary stack
- Need fast iteration speed
step.waitForEvent is part of the Inngest TypeScript SDK v4+ [inngest-docs] — the match parameter is optional, timeout is required.
Takeaway: Inngest is the fastest path from prototype to durable, event-driven agent — at the cost of a managed-only deployment story and a smaller language footprint than Temporal.
Restate
Single-binary self-host, Rust runtime, low-latency journaling.
Core Concepts
// @restatedev/restate-sdk 1.x
import * as restate from "@restatedev/restate-sdk";
const agentService = restate.service({
name: "agent",
handlers: {
processRequest: async (ctx: restate.Context, request: AgentRequest) => {
// Each ctx.run() is automatically checkpointed
// Idempotency is built-in via deterministic execution
const classification = await ctx.run("classify", async () => {
return await llm.classify(request.message);
});
if (classification.needs_approval) {
// Await human approval (durable promise)
const approval = await ctx.awakeable<ApprovalResult>();
// This ID can be used by external system to complete the awakeable
console.log(`Awaiting approval: ${approval.id}`);
const result = await approval.promise;
if (!result.approved) {
return { status: "rejected" };
}
}
const result = await ctx.run("execute", async () => {
return await agent.execute(classification.intent, request);
});
return result;
},
},
});
// Complete awakeable from external system (e.g., webhook)
async function approveRequest(awakeableId: string, approved: boolean) {
const restate = clients.connect("http://localhost:8080");
await restate.resolveAwakeable(awakeableId, { approved });
}
Virtual Objects for Stateful Agents
const agentSession = restate.object({
name: "agent-session",
handlers: {
// State is automatically persisted per session ID
addMessage: async (ctx: restate.ObjectContext, message: Message) => {
const history = (await ctx.get<Message[]>("history")) || [];
history.push(message);
ctx.set("history", history);
const response = await ctx.run("generate", async () => {
return await llm.chat(history);
});
history.push({ role: "assistant", content: response });
ctx.set("history", history);
return response;
},
getHistory: async (ctx: restate.ObjectContext) => {
return (await ctx.get<Message[]>("history")) || [];
},
},
});
When to Use Restate
- Low-latency requirements
- Lightweight deployment (single binary)
- Strong consistency without heavy infrastructure
- TypeScript or JVM stack
- Want to self-host easily
The TypeScript SDK exposes ctx.run, ctx.awakeable<T>(), and restate.object virtual objects with per-key state [restate-docs]. Restate’s “deterministic helpers” — ctx.rand.uuidv4(), ctx.date.now() — give you replay-safe randomness and time without separate libraries.
Takeaway: Restate is the lightest of the three self-hosted options; the price is a smaller ecosystem and a newer codebase. Use it when single-binary deployment matters more than vendor maturity.
Azure Durable Functions
Native Azure integration. Great for Azure-first shops [azure-durable-events].
Core Concepts
import azure.functions as func
import azure.durable_functions as df
# Orchestrator function — registered via df.Orchestrator.create wrapper
def orchestrator_function(context: df.DurableOrchestrationContext):
request = context.get_input()
# Each activity is checkpointed.
classification = yield context.call_activity("classify_request", request)
if classification["needs_approval"]:
# Wait for external event (human approval). Returns the event payload.
approval = yield context.wait_for_external_event("approval")
if not approval["approved"]:
return {"status": "rejected"}
result = yield context.call_activity("execute_agent_action", classification)
return result
main = df.Orchestrator.create(orchestrator_function)
# Activity functions — separate Function App entry points
def classify_request(request: dict) -> dict:
return llm.classify(request["message"])
def execute_agent_action(classification: dict) -> dict:
return agent.execute(classification["intent"])
Fan-out/Fan-in Pattern
import azure.functions as func
import azure.durable_functions as df
def parallel_research(context: df.DurableOrchestrationContext):
queries = context.get_input()["queries"]
# Fan out: run research tasks in parallel
tasks = [context.call_activity("research_query", q) for q in queries]
# Fan in: wait for all to complete
results = yield context.task_all(tasks)
# Synthesize results
synthesis = yield context.call_activity("synthesize_results", results)
return synthesis
main = df.Orchestrator.create(parallel_research)
When to Use Azure Durable Functions
- Already on Azure
- .NET or Python primary stack
- Want serverless with durable state
- Need tight Azure service integration
- Cost optimization via consumption pricing
wait_for_external_event provides at-least-once delivery for human-approval events; include a unique event ID for manual deduplication, especially on the Azure Storage backend [azure-durable-events]. The Consumption Plan does not bill while an orchestrator is awaiting an external event.
Takeaway: Durable Functions is the right choice if you’re already paying for Azure; the python decorator pattern has shifted to the df.Orchestrator.create(fn) wrapper, so older @orchestrator_trigger examples need updating.
AWS Step Functions
Visual workflows, tight AWS integration. Best for teams who like declarative state machines [aws-stepfn-callback].
Core Concepts
{
"Comment": "Agent workflow with human approval",
"StartAt": "ClassifyRequest",
"States": {
"ClassifyRequest": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:classify",
"Next": "NeedsApproval"
},
"NeedsApproval": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.needsApproval",
"BooleanEquals": true,
"Next": "WaitForApproval"
}
],
"Default": "ExecuteAction"
},
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
"MessageBody": {
"TaskToken.$": "$$.Task.Token",
"Request.$": "$"
}
},
"Next": "CheckApproval"
},
"CheckApproval": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.approved",
"BooleanEquals": false,
"Next": "Rejected"
}
],
"Default": "ExecuteAction"
},
"ExecuteAction": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:execute",
"End": true
},
"Rejected": {
"Type": "Fail",
"Error": "ApprovalRejected",
"Cause": "Human rejected the action"
}
}
}
When to Use Step Functions
- Already on AWS
- Visual workflow design preferred
- Need built-in AWS service integrations
- Long-running workflows with wait states (Standard Workflows support up to one-year execution lifetimes [aws-stepfn-callback])
- Audit and compliance requirements
The .waitForTaskToken integration pattern + $$.Task.Token context-object accessor is the canonical human-in-the-loop mechanism — the external system completes the task via SendTaskSuccess / SendTaskFailure. Configure HeartbeatSeconds to avoid indefinite waits.
Takeaway: Step Functions is the right answer when your stack is AWS-native and the diagram-as-workflow model is a feature, not a constraint.
Decision Framework
| If You Need… | Use |
|---|---|
| Complex multi-step workflows with strong guarantees | Temporal |
| Fast iteration, event-driven, serverless | Inngest |
| Low latency, lightweight, self-hosted | Restate |
| Azure-native, serverless, cost optimization | Azure Durable Functions |
| AWS-native, visual workflows, service integrations | AWS Step Functions |
| GCP-native, simple YAML workflows | GCP Cloud Workflows |
The “Build vs Buy” Decision Tree
Are you writing retry logic with exponential backoff? → Consider a durable execution framework Are you implementing checkpointing to survive crashes? → Consider a durable execution framework Are you building idempotency key management? → Consider a durable execution framework Are you handling human-in-the-loop with long waits? → Consider a durable execution framework If yes to 2+ of these, you're reinventing the wheel.
Takeaway: the decision table is the chapter’s most-reused artefact. The “build vs buy” question almost always points to buy — unless your operational model can’t host the chosen framework’s deployment shape.
What to Copy, What to Skip
| Pattern | Copy when | Skip when |
|---|---|---|
| Temporal workflow-as-code | Multi-step workflows, replay-determinism matters, team can run Temporal Server or pay for Cloud | Single-step event handlers — overkill |
Inngest step.waitForEvent | Event-driven serverless agents, TypeScript primary stack, want managed-everything | Need self-hosted control plane or non-TS languages outside their support list |
Restate ctx.awakeable | Lightweight durable HITL, single-binary self-host, low-latency needs | Established ops team already on Temporal Cloud or AWS — adoption cost outweighs gain |
Azure Durable Functions wait_for_external_event | Azure-native shop, want serverless billing on idle, Python or .NET primary | Multi-cloud architecture — couples you to Azure orchestration |
AWS Step Functions .waitForTaskToken | AWS-native shop, visual state-machine design preferred, audit-trail requirements | Heavy compute logic — ASL is restrictive for arbitrary code; defer to Lambda |
| DIY checkpointer + retry loop | Truly esoteric workloads where no framework fits | Anything resembling Ch01–Ch02 patterns — you’re reinventing |
| Same idempotency-key derivation across the workflow [pa-idempotency] | Every framework — the key is yours, not the framework’s | Never skip — non-idempotent activities + at-least-once delivery = double-charged customers |
| Cross-framework portability layer | Multi-cloud or multi-framework teams — abstract the workflow API behind your own interface | Single-framework shops — premature abstraction |
Takeaway: every pattern has a “skip when”. Frameworks are not interchangeable; the wrong one is more painful than building from scratch was supposed to be.
Common Gotchas
| Gotcha | Symptom | Fix |
|---|---|---|
| Non-deterministic code inside workflow | Replay fails mysteriously; activities run twice or wrong values | Move all non-deterministic ops (HTTP, random, time) into activities (Temporal) or ctx.run blocks (Restate); use framework-provided deterministic helpers (ctx.rand, ctx.date) |
| At-least-once without idempotency keys | Customer charged 3x when the worker retries a side-effecting activity | Every external-side-effect activity takes an idempotency_key argument derived from stable workflow inputs (Ch01 pattern) |
Indefinite waitForExternalEvent / waitForTaskToken | Workflow stuck for the full one-year service quota when the human never responds | Race a durable timer against the event; on timeout, escalate or reject |
Forgetting HeartbeatSeconds on AWS callback tasks | Task waits up to one year even when the external system is dead | Set HeartbeatSeconds and call SendTaskHeartbeat periodically from the external worker |
Old Azure @orchestrator_trigger decorator copy-pasted from 2023 tutorials | Function App doesn’t recognise the orchestrator on deploy | Use the current main = df.Orchestrator.create(orchestrator_function) wrapper pattern |
| Mixing framework versions across services | Workflow replay corrupts when worker SDK version diverges from history schema | Pin SDK versions per service; gate workflow upgrades behind a version-aware routing strategy (Temporal workflow.versioning, Inngest function versions) |
Takeaway: every gotcha here is a non-determinism or version-skew bug in disguise. Frameworks reward strict discipline at workflow boundaries; they punish “just one quick time.time() call” mercilessly.
Migration Path
If you have existing agent code, here’s how to migrate:
1. Identify Side Effects
# Before: Side effects scattered in code
def process_request(request):
classification = llm.classify(request) # LLM call
if classification.needs_action:
result = api.execute(classification) # External API
email.send(request.user, result) # Email
return result
2. Extract as Activities
# After: Side effects are activities
@activity.defn
async def classify(request): return llm.classify(request)
@activity.defn
async def execute_action(classification): return api.execute(classification)
@activity.defn
async def send_email(user, result): email.send(user, result)
@workflow.defn
class RequestWorkflow:
@workflow.run
async def run(self, request):
classification = await workflow.execute_activity(classify, args=[request])
if classification.needs_action:
result = await workflow.execute_activity(execute_action, args=[classification])
await workflow.execute_activity(send_email, args=[request.user, result])
return result
Takeaway: migration is mechanical — name your side effects, wrap them as activities, compose them in a workflow. The patterns from Ch01 [pa-idempotency] (idempotency keys) and Ch02 [pa-state] (checkpoint timing) carry over unchanged; the framework handles the rest.
Closing
Durable execution productises everything in the prior chapters: retries become RetryPolicy, checkpoints become free at every step, human-in-the-loop becomes a signal, and at-least-once delivery + your idempotency key becomes effectively-once business effect. The five frameworks differ in operational shape, language ergonomics, and ecosystem maturity — not in what they solve.
Testable question: for your top three external API calls, can you name which framework primitive replaces the retry/checkpoint/idempotency code you’d otherwise hand-write? If not, the framework choice is theoretical.
→ Part 7: Security & Sandboxing — securing agents that execute code and call external APIs.
References
- [temporal-py-docs] Temporal — Python SDK: Failure Detection & Retries. docs.temporal.io/develop/python/failure-detection. Source for
temporalio.common.RetryPolicy,non_retryable_error_types, and the at-least-once activity contract. - [inngest-docs] Inngest —
step.waitForEventAPI reference. inngest.com/docs/reference/functions/step-wait-for-event. Source for the TypeScript SDK v4+match+timeoutparameters and thePromise<null | EventPayload>return type. - [restate-docs] Restate — Journaling Results in TypeScript. docs.restate.dev/develop/ts/journaling-results. Source for
ctx.run, deterministic helpers (ctx.rand,ctx.date), and the no-nesting constraint for context calls. - [azure-durable-events] Microsoft Learn — Handle External Events in Durable Orchestrations, updated 2026-05-06. learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-external-events. Source for
context.wait_for_external_eventsemantics,df.Orchestrator.createwrapper, and the at-least-once delivery guarantee. - [aws-stepfn-callback] AWS — Step Functions: Discover Service Integration Patterns. docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html. Source for
.waitForTaskToken,$$.Task.Tokencontext-object access,SendTaskSuccess/SendTaskFailure,HeartbeatSecondstimeout, and the one-year execution lifetime ceiling. - [pa-idempotency] Production Agents — Part 1: Idempotency & Safe Retries. Source for the idempotency-key derivation that turns at-least-once delivery into effectively-once business effect.
- [pa-state] Production Agents — Part 2: State Persistence & Checkpointing. The checkpoint timing patterns that durable execution frameworks automate.