State Persistence & Agent Memory - The Complete Domain | Intentional / Deliberate / Engineering

Prerequisite: This is Part 2 of the Production Agents Deep Dive series. Start with Part 0: Overview for context.

Why This Matters

Your agent is 45 minutes into a complex research task. User closes their browser. Server restarts. All progress lost.

Or worse: agent crashes mid-booking. User refreshes. Agent starts over. Now there’s an orphaned booking in your system.

The Core Problem (from Anthropic, November 2025):

“The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before.”

Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. That’s what happens when context windows fill up or processes crash.

What Goes Wrong Without This:

STATE PERSISTENCE FAILURE PATTERNS

Why Context Windows Aren’t Enough

Even with 200k token windows (Claude) or 1M tokens (Gemini):

Complex tasks overflow: Software development, research, financial modeling require more context than any window holds
Token costs scale linearly: Keeping everything in context = expensive
Latency increases: Larger context = slower inference
Attention degrades: Very long contexts hurt model performance

Most production tasks require work across many sessions.

The Three Challenges

Challenge	Problem	Solution
Persistence	State lost on crash/restart	Checkpoint to durable storage
Recovery	Don’t know what completed	Track progress explicitly
Context Bridging	New session lacks context	Progress files, structured handoff

Agent Memory Systems: The Complete Picture

State management is really about memory. Understanding the different types of memory helps you design robust agents.

The Memory Taxonomy

AGENT MEMORY SYSTEMS

Memory Type Comparison

Memory Type	Persistence	Scope	Storage	Retrieval
Working	None (context window)	Current turn	LLM context	Automatic
Episodic	Session	Current task	Checkpointer (Postgres)	By thread_id
Semantic	Permanent	All tasks	Vector DB	Similarity search
Procedural	Permanent	All tasks	Prompts / Fine-tuning	Always loaded

How They Map to Implementation

class AgentMemory:
    def __init__(self):
        # Working Memory: Current context window
        self.working_memory = []  # Just conversation turns

        # Episodic Memory: Checkpointed session state
        self.episodic = PostgresSaver.from_conn_string(DB_URL)

        # Semantic Memory: Long-term learned knowledge
        self.semantic = VectorDB(embedding_model="text-embedding-3-small")

        # Procedural Memory: Baked into the system prompt
        self.procedural = load_system_prompt("agent_instructions.md")

    def process_turn(self, user_input, thread_id):
        # 1. Load episodic memory (session state)
        session_state = self.episodic.load(thread_id)

        # 2. Query semantic memory (relevant long-term knowledge)
        relevant_knowledge = self.semantic.search(user_input, k=3)

        # 3. Build working memory (context for this turn)
        self.working_memory = [
            {"role": "system", "content": self.procedural},
            *session_state.get("conversation_history", []),
            {"role": "context", "content": format_knowledge(relevant_knowledge)},
            {"role": "user", "content": user_input}
        ]

        # 4. Get response
        response = llm.chat(self.working_memory)

        # 5. Update episodic memory
        session_state["conversation_history"].append(
            {"role": "user", "content": user_input}
        )
        session_state["conversation_history"].append(
            {"role": "assistant", "content": response}
        )
        self.episodic.save(thread_id, session_state)

        # 6. Optionally update semantic memory with learned insights
        if self.should_memorize(response):
            self.semantic.insert(extract_insight(response))

        return response

The Context Management Problem

The core tradeoff: More context = better understanding, but also:

Higher token costs
Increased latency
Attention degradation on very long contexts

The solution hierarchy:

CONTEXT MANAGEMENT STRATEGIES

Memory Flow Diagram

MEMORY ORCHESTRATION FLOW

Common Memory Anti-patterns

Anti-pattern	Problem	Fix
Everything in context	Token explosion, attention degradation	Use semantic memory for stable knowledge
No session continuity	Agent forgets mid-conversation	Checkpoint episodic memory
Context as database	Slow, expensive, fragile	Store data externally, retrieve what’s needed
No memory pruning	Unbounded growth	TTL on episodic, compaction on working
Ignoring procedural	Agent reinvents wheels	Bake patterns into system prompt

Solution 1: LangGraph Checkpointers

LangGraph is the industry standard for agent state management. Here’s how to use it in production.

Basic Setup

from langgraph.checkpoint.postgres import PostgresSaver

# Production: PostgreSQL for durability
checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@host:5432/db"
)

# Create graph with checkpointing
graph = StateGraph(AgentState)
graph.add_node("think", think_node)
graph.add_node("act", act_node)
# ... add edges ...

app = graph.compile(checkpointer=checkpointer)

# Execute with thread_id for persistence
config = {"configurable": {"thread_id": "user-123-task-456"}}
result = app.invoke({"input": "Book flight to NYC"}, config)

# Later: resume from checkpoint
# Same thread_id = same state
result = app.invoke({"input": "Make it morning flight"}, config)

What StateSnapshot Captures

# Every checkpoint stores:
{
    "channel_values": {...},     # Current state data
    "next_nodes": ["act"],       # What to execute next
    "config": {...},             # Configuration
    "metadata": {
        "writes": {...},         # Recent modifications
        "step": 5                # Progress counter
    },
    "pending_tasks": [...]       # Incomplete work
}

Storage Options

Storage	Use Case	Tradeoffs
`MemorySaver`	Development	Fast, lost on restart
`SQLiteSaver`	Single-node	Local persistence, limited scale
`PostgresSaver`	Production	Multi-node, ACID guarantees
S3	Archival	Long-term storage, slower access

Production rule: Always use PostgresSaver (or equivalent) in production. MemorySaver is for local development only.

Solution 2: Checkpoint Timing

This is where most teams get it wrong. The timing of checkpoints matters.

Wrong: Checkpoint After Execution

# WRONG: If crash happens between execute and checkpoint,
# you don't know if step ran
def execute_step(self, step):
    result = step.run()           # Execute
    self.state['completed'].append(step.id)
    self.checkpoint()             # Save state
    # ^ If crash happens before checkpoint, step ran but state doesn't show it
    return result

Right: Checkpoint Before AND After

def execute_step(self, step):
    # BEFORE: Mark intent (crash here = know step was attempted)
    self.state['in_progress'] = step.id
    self.checkpoint()

    # Execute
    result = step.run()

    # AFTER: Mark completion
    self.state['completed'].append(step.id)
    del self.state['in_progress']
    self.state['last_result'] = result
    self.checkpoint()

    return result

Resume Logic

def resume(self):
    state = self.load_checkpoint()

    if 'in_progress' in state:
        # Crashed during execution
        step_id = state['in_progress']

        # Check if step actually completed (idempotent read)
        if self.check_step_completed_externally(step_id):
            # Step ran, just didn't checkpoint
            state['completed'].append(step_id)
            del state['in_progress']
            self.checkpoint()
        else:
            # Step didn't complete — re-execute with idempotency key
            step = self.get_step(step_id)
            self.execute_step(step)

    # Continue from last known good state
    return state

Why this works: If you crash between the two checkpoints, the in_progress marker tells you exactly what was happening. You can check if it completed and act accordingly.

Solution 3: Progress Tracking Files (Anthropic Pattern)

For multi-session tasks, explicit progress files bridge context gaps.

The Two-Agent Pattern

# Initializer Agent (first run only)
def initialize_project(task):
    # Set up environment
    setup_environment()

    # Create progress file
    progress = {
        "goal": task.description,
        "completed_steps": [],
        "blockers": [],
        "next_action": "Analyze requirements",
        "context": {"files": [], "apis": []}
    }

    write_file("claude-progress.txt", format_progress(progress))
    git_commit("Initial project setup")

# Coding Agent (every session)
def continue_work():
    # Read progress from last session
    progress = read_file("claude-progress.txt")

    # Make incremental progress
    result = work_on_next_action(progress)

    # Update progress for next session
    progress["completed_steps"].append(result.action)
    progress["next_action"] = result.next_step

    write_file("claude-progress.txt", format_progress(progress))
    git_commit(f"Completed: {result.action}")

Progress File Structure

# Progress: Book Flight to NYC

## Current Goal

Book morning flight to NYC for tomorrow

## Completed Steps

1. [x] Parsed user intent: destination=NYC, date=tomorrow
2. [x] Inferred departure: SFO (from calendar)
3. [x] Searched flights: 47 options found
4. [x] User clarified: wants LaGuardia, not JFK
5. [x] Filtered to LGA: 18 options

## Current Blocker

8am flight sold out while user was deciding

## Next Action

Present 9am alternative ($12 more)

## Context

- User prefers aisle seats
- Corporate travel policy: max $500
- Departure: SFO
- Arrival: LGA

Why this works: New session reads progress file first. Immediate context on what’s done, what’s blocked, what’s next. No wasted tokens re-discovering state.

Solution 4: Hybrid Memory

For sophisticated agents, combine short-term checkpoints with long-term vector memory.

class HybridMemory:
    def __init__(self, checkpointer, vector_db):
        self.checkpointer = checkpointer  # Short-term
        self.vector_db = vector_db        # Long-term

    def save_session_state(self, thread_id, state):
        """Short-term: current conversation, active task"""
        self.checkpointer.save(thread_id, state)

    def save_insight(self, insight):
        """Long-term: learned patterns, preferences"""
        embedding = embed(insight)
        self.vector_db.insert(embedding, insight)

    def recall_relevant(self, query, k=5):
        """Retrieve relevant long-term memories"""
        return self.vector_db.search(embed(query), k=k)

    def load_context(self, thread_id, current_input):
        """Combine short-term state + relevant long-term memories"""
        state = self.checkpointer.load(thread_id)
        memories = self.recall_relevant(current_input)
        return {**state, "relevant_memories": memories}

When to Use Each

Memory Type	Use For	Don’t Use For
Short-term (Checkpointer)	Current conversation, active task state	Preferences learned months ago
Long-term (Vector DB)	User preferences, learned patterns, domain knowledge	Ephemeral conversation turns

Key insight: Query long-term memory as a tool (retrieve when needed), don’t jam everything into context.

Observation Masking

For software engineering agents, most tokens in a turn are observation (test output, file contents). This explodes context fast.

def compact_history(history):
    compacted = []
    for turn in history:
        if turn.type == "observation":
            # Compress verbose output
            compacted.append({
                "type": "observation_summary",
                "content": summarize(turn.content, max_tokens=100)
            })
        else:
            # Keep action/reasoning in full
            compacted.append(turn)
    return compacted

# Before: 50k tokens of test output
# After: 100 token summary of test results

Result: Targets the token-heavy part while preserving decision history.

Common Gotchas

Gotcha	Symptom	Fix
Checkpoint too large	Save/load becomes bottleneck	Prune old observations, limit history depth
Checkpoint corruption	State lost or inconsistent	Atomic writes, versioning, backup checkpoints
Session resume confusion	Agent repeats completed tasks	Explicit progress files, structured state schema
No checkpoint before execution	Can’t tell if step ran on crash	Checkpoint intent BEFORE execution
No atomic writes	Partial checkpoint on crash	Use database transactions, write-ahead logging

Multi-Agent State (Still Fragile)

2025 Reality Check (from research):

“Multi-agent systems are not yet capable of engaging in long-context, proactive discourse with significantly more reliability than a single agent.”

Why Multi-Agent State Is Hard:

Context fragmentation across agents
Synchronization overhead
Network latency disrupts state updates
Error compounding from fragmented information

Claude Code’s Solution: Single-threaded subtasking

Spawns subtasks but never runs parallel work
Main agent retains comprehensive context
Prevents error compounding from fragmented state

Recommendation: Start with single-agent, add multi-agent only when necessary.

The Checkpointing Checklist

Before deploying an agent with persistent state:

CHECKPOINTING DEPLOYMENT CHECKLIST

Key Takeaways

Context windows aren’t enough. Complex tasks require state that survives sessions.
Checkpoint timing matters. Checkpoint BEFORE execution to know what was attempted. Checkpoint AFTER to know what succeeded.
Progress files bridge sessions. New session reads progress first. No wasted tokens rediscovering state.
Hybrid memory separates concerns. Short-term state in checkpointer. Long-term knowledge in vector DB.
Multi-agent state is fragile. Start single-agent. Add complexity only when necessary.

Next Steps

State persists. But what happens when the agent needs human judgment?

→ Part 3: Human-in-the-Loop Patterns

Or jump to another topic:

Part 4: Cost Control — Token budgets and circuit breakers
Part 6: Durable Execution — Temporal, Inngest, Restate frameworks

State Persistence & Agent Memory - The Complete Domain

Why This Matters

Why Context Windows Aren’t Enough

The Three Challenges

Agent Memory Systems: The Complete Picture

The Memory Taxonomy

Memory Type Comparison

How They Map to Implementation

The Context Management Problem

Memory Flow Diagram

Common Memory Anti-patterns

Solution 1: LangGraph Checkpointers

Basic Setup

What StateSnapshot Captures

Storage Options

Solution 2: Checkpoint Timing

Wrong: Checkpoint After Execution

Right: Checkpoint Before AND After

Resume Logic

Solution 3: Progress Tracking Files (Anthropic Pattern)

The Two-Agent Pattern

Progress File Structure

Solution 4: Hybrid Memory

When to Use Each

Observation Masking

Common Gotchas

Multi-Agent State (Still Fragile)

The Checkpointing Checklist

Key Takeaways

Next Steps

Concepts covered in this article

Table of Contents

Production-agents Series

Why This Matters

Why Context Windows Aren’t Enough

The Three Challenges

Agent Memory Systems: The Complete Picture

The Memory Taxonomy

Memory Type Comparison

How They Map to Implementation

The Context Management Problem

Memory Flow Diagram

Common Memory Anti-patterns

Solution 1: LangGraph Checkpointers

Basic Setup

What StateSnapshot Captures

Storage Options

Solution 2: Checkpoint Timing

Wrong: Checkpoint After Execution

Right: Checkpoint Before AND After

Resume Logic

Solution 3: Progress Tracking Files (Anthropic Pattern)

The Two-Agent Pattern

Progress File Structure

Solution 4: Hybrid Memory

When to Use Each

Observation Masking

Common Gotchas

Multi-Agent State (Still Fragile)

The Checkpointing Checklist

Key Takeaways

Next Steps

Concepts covered in this article

Table of Contents