State Persistence & Agent Memory - The Complete Domain | Intentional / Deliberate / Engineering

Prerequisite: This is Part 2 of the Production Agents Deep Dive series. Start with Part 1: Idempotency & Safe Retries — checkpoint timing depends on the idempotency-key contract from that chapter.

Checkpointing: durable state turns crash-restart-from-zero into crash-resume-from-step

Same agent, same crash. Without checkpoints, the work is lost. With checkpoints written before AND after each step, the resume is mechanical.

Why This Matters

Your agent is 45 minutes into a research task. User closes their browser. Server restarts. All progress lost.

Or worse: agent crashes mid-booking. User refreshes. Agent starts over. Now there’s an orphaned booking in your system.

Anthropic’s Applied AI team named the underlying mechanic in their September 2025 context-engineering post [anthropic-context2025]:

“Long-horizon tasks require agents to maintain coherence, context, and goal-directed behavior over sequences of actions where the token count exceeds the LLM’s context window.”

Their proposed answer is structured note-taking — “a technique where the agent regularly writes notes persisted to memory outside of the context window” — so that, in their phrasing, “after context resets, the agent reads its own notes and continues multi-hour training sequences or dungeon explorations.” This chapter is the production version of that idea: checkpointers + progress files + hybrid memory, with the failure modes the framework docs gloss over.

What other content gets wrong: LangGraph tutorials show checkpointer=MemorySaver() and call it durable [langgraph-persistence]. They never show what happens when from_conn_string is used without the with-context-manager and you leak Postgres connections across requests. The Anthropic post names the mechanic but doesn’t pin a framework. This chapter bridges both — and shows why checkpoint timing (before and after execute) is what separates resumable agents from agents that resume into corrupt state.

Takeaway: state in memory only is state you don’t have. The checkpoint is the agent.

What Goes Wrong Without This:

STATE PERSISTENCE FAILURE PATTERNS

Symptom: Agent starts over after user closes browser.
Cause:   State lived only in memory. Browser close = process kill = state gone.
       No checkpointing to durable storage.

Symptom: Agent repeats completed tasks in new session.
Cause: No explicit progress tracking. New session doesn't know what's done.
Agent re-does work, wastes time and tokens.

Symptom: Can't debug failed agent runs from yesterday.
Cause: State was ephemeral. Once process died, context was lost.
No audit trail, no replay capability.

Takeaway: every symptom here is a missing durable write. The fix is the same shape: checkpoint after every state mutation, separate intent from completion, give the resume path enough to act on.

Why Context Windows Aren’t Enough

Even with the long-context models on offer in mid-2026 (200k–2M tokens depending on model — check each provider’s current model card for exact figures):

Complex tasks overflow: software development, research, financial modeling require more context than any window holds
Token costs scale linearly: keeping everything in context is the most expensive way to remember
Latency increases: larger context, slower inference, slower agents
Attention degrades: Chroma’s Context Rot (Hong, Troynikov, Huber, 2025) shows performance drops non-uniformly as input length grows — even on simple tasks [chroma-rot]

Most production tasks require work across many sessions. The window is the wrong unit.

Takeaway: bigger windows don’t fix persistence; they just delay the problem by one order of magnitude.

The Three Challenges

Challenge	Problem	Solution
Persistence	State lost on crash/restart	Checkpoint to durable storage (PostgreSQL, not in-memory)
Recovery	Don’t know what completed	Track progress explicitly with intent + completion markers
Context bridging	New session lacks context	Progress files, structured handoff, hybrid memory

Takeaway: durable, recoverable, bridgeable — three properties, three patterns. Lose any one and the agent is fragile.

Agent Memory Systems: The Complete Picture

State management is really about memory. Understanding the different types of memory helps you design robust agents.

The Memory Taxonomy

AGENT MEMORY SYSTEMS

┌──────────────────────────────────────────────────────────────────────┐
│                     AGENT MEMORY SYSTEMS                             │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  WORKING MEMORY (In-Context)                                         │
│  ├── Current conversation turns                                      │
│  ├── Active task state                                               │
│  ├── Immediate observations                                          │
│  └── Token-limited, ephemeral                                        │
│                                                                      │
│  EPISODIC MEMORY (Session State)                                     │
│  ├── Conversation history                                            │
│  ├── Actions taken and results                                       │
│  ├── Decisions made and why                                          │
│  └── Checkpointed, survives crashes                                  │
│                                                                      │
│  SEMANTIC MEMORY (Long-term Knowledge)                               │
│  ├── User preferences                                                │
│  ├── Learned patterns                                                │
│  ├── Domain knowledge                                                │
│  └── Vector DB, persists across sessions                             │
│                                                                      │
│  PROCEDURAL MEMORY (How-to Knowledge)                                │
│  ├── Tool usage patterns                                             │
│  ├── Workflow sequences                                              │
│  ├── Successful strategies                                           │
│  └── Embedded in prompts/fine-tuning                                 │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Memory Type Comparison

Memory Type	Persistence	Scope	Storage	Retrieval
Working	None (context window)	Current turn	LLM context	Automatic
Episodic	Session	Current task	Checkpointer (Postgres)	By thread_id
Semantic	Permanent	All tasks	Vector DB	Similarity search
Procedural	Permanent	All tasks	Prompts / Fine-tuning	Always loaded

How They Map to Implementation

This is a conceptual sketch showing how the four memory types compose — see Solution 1 below for the correct with-wired PostgresSaver setup. The .load(...) / .save(...) method names here are illustrative; the real BaseCheckpointSaver API is get_tuple(config) / put(config, ...) with thread_id nested in the config dict.

class AgentMemory:
    def __init__(self, checkpointer):
        # Working Memory: Current context window
        self.working_memory = []  # Just conversation turns

        # Episodic Memory: Pre-built checkpointer injected by DI
        self.episodic = checkpointer  # See Solution 1 for construction

        # Semantic Memory: Long-term learned knowledge
        self.semantic = VectorDB(embedding_model="text-embedding-3-small")

        # Procedural Memory: Baked into the system prompt
        self.procedural = load_system_prompt("agent_instructions.md")

    def process_turn(self, user_input, thread_id):
        # 1. Load episodic memory (session state)
        session_state = self.episodic.load(thread_id)

        # 2. Query semantic memory (relevant long-term knowledge)
        relevant_knowledge = self.semantic.search(user_input, k=3)

        # 3. Build working memory (context for this turn)
        self.working_memory = [
            {"role": "system", "content": self.procedural},
            *session_state.get("conversation_history", []),
            {"role": "context", "content": format_knowledge(relevant_knowledge)},
            {"role": "user", "content": user_input}
        ]

        # 4. Get response
        response = llm.chat(self.working_memory)

        # 5. Update episodic memory
        session_state["conversation_history"].append(
            {"role": "user", "content": user_input}
        )
        session_state["conversation_history"].append(
            {"role": "assistant", "content": response}
        )
        self.episodic.save(thread_id, session_state)

        # 6. Optionally update semantic memory with learned insights
        if self.should_memorize(response):
            self.semantic.insert(extract_insight(response))

        return response

The Context Management Problem

The core tradeoff: More context = better understanding, but also:

Higher token costs
Increased latency
Attention degradation on very long contexts

The solution hierarchy:

CONTEXT MANAGEMENT STRATEGIES

STRATEGY 1: Keep it small (preferred)
└── Only put what's needed for THIS turn in context

STRATEGY 2: Summarize when growing
└── Compress old conversation turns
└── Keep recent turns verbatim

STRATEGY 3: Externalize to retrieval
└── Store knowledge in vector DB
└── Retrieve relevant chunks per turn

STRATEGY 4: Multi-session handoff
└── End session with progress file
└── New session starts fresh with progress context

Memory Flow Diagram

MEMORY ORCHESTRATION FLOW

User Request
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│                    MEMORY ORCHESTRATION                     │
│                                                             │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                │
│   │ WORKING  │   │ EPISODIC │   │ SEMANTIC │                │
│   │  MEMORY  │◄──│  MEMORY  │◄──│  MEMORY  │                │
│   │ (context)│   │ (session)│   │ (vector) │                │
│   └────┬─────┘   └────┬─────┘   └────┬─────┘                │
│        │              │              │                      │
│        │    ┌─────────▼─────────┐    │                      │
│        └───►│  BUILD CONTEXT    │◄───┘                      │
│             │  (select relevant │                           │
│             │   from each type) │                           │
│             └─────────┬─────────┘                           │
│                       │                                     │
│                       ▼                                     │
│              ┌────────────────┐                             │
│              │    LLM CALL    │                             │
│              └────────┬───────┘                             │
│                       │                                     │
│                       ▼                                     │
│              ┌────────────────┐                             │
│              │ UPDATE MEMORIES│                             │
│              │ - Episodic: +turn                            │
│              │ - Semantic: +insight                         │
│              └────────────────┘                             │
└─────────────────────────────────────────────────────────────┘

Common Memory Anti-patterns

Anti-pattern	Problem	Fix
Everything in context	Token explosion, attention degradation	Use semantic memory for stable knowledge
No session continuity	Agent forgets mid-conversation	Checkpoint episodic memory
Context as database	Slow, expensive, fragile	Store data externally, retrieve what’s needed
No memory pruning	Unbounded growth	TTL on episodic, compaction on working
Ignoring procedural	Agent reinvents wheels	Bake patterns into system prompt

Takeaway: working memory is the LLM’s RAM. Episodic memory is the session log. Semantic memory is the long-term knowledge base. Mixing them collapses both performance and clarity.

Solution 1: LangGraph Checkpointers

LangGraph is the most-deployed framework for agent state management today. Here’s how to use it in production [langgraph-persistence].

Basic Setup

PostgresSaver.from_conn_string returns a context manager (it manages the connection lifecycle), so the canonical wiring uses with:

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

DB_URI = "postgres://user:pass@host:5432/db?sslmode=disable"

with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    # First-run only: provision the tables.
    checkpointer.setup()

    graph = StateGraph(AgentState)
    graph.add_node("think", think_node)
    graph.add_node("act", act_node)
    # ... add edges ...

    app = graph.compile(checkpointer=checkpointer)

    # Execute with thread_id for persistence
    config = {"configurable": {"thread_id": "user-123-task-456"}}
    result = app.invoke({"input": "Book flight to NYC"}, config)

    # Later: resume from checkpoint with the same thread_id
    result = app.invoke({"input": "Make it morning flight"}, config)

If you instantiate PostgresSaver outside a with block you’ll leak the underlying psycopg connection — the from_conn_string factory is explicitly typed as @contextmanager [langgraph-postgres-source]. For long-running servers, wrap the connection pool in your app’s startup/shutdown lifecycle and reuse the checkpointer across requests. checkpointer.setup() is idempotent but expensive — run it once in a migration step, not on every app boot.

What StateSnapshot Captures

# Every checkpoint stores:
{
    "channel_values": {...},     # Current state data
    "next_nodes": ["act"],       # What to execute next
    "config": {...},             # Configuration
    "metadata": {
        "writes": {...},         # Recent modifications
        "step": 5                # Progress counter
    },
    "pending_tasks": [...]       # Incomplete work
}

Storage Options

Storage	Use case	Tradeoffs
`MemorySaver` (alias of `InMemorySaver`)	Development	Fast, lost on restart, no concurrency
`SqliteSaver`	Single-node, single-process	Local persistence, fine for prototypes, doesn’t scale horizontally
`PostgresSaver`	Production	Multi-node, ACID, requires `with` context manager or pool
Custom (S3, Redis, your own)	Archival or specialised	Implement the `BaseCheckpointSaver` protocol — straight-line if you need it

Production rule: use PostgresSaver (or a custom subclass of BaseCheckpointSaver backed by something durable). MemorySaver is for local development only — agents that “work in tests” but lose state in prod almost always trace back to it.

Takeaway: MemorySaver is the lint-in-the-codebase. PostgresSaver is the contract.

Solution 2: Checkpoint Timing

Most teams get checkpoint timing wrong.

Wrong: Checkpoint After Execution

# WRONG: If crash happens between execute and checkpoint,
# you don't know if step ran
def execute_step(self, step):
    result = step.run()           # Execute
    self.state['completed'].append(step.id)
    self.checkpoint()             # Save state
    # ^ If crash happens before checkpoint, step ran but state doesn't show it
    return result

Right: Checkpoint Before AND After

def execute_step(self, step):
    # BEFORE: Mark intent (crash here = know step was attempted)
    self.state['in_progress'] = step.id
    self.checkpoint()

    # Execute
    result = step.run()

    # AFTER: Mark completion
    self.state['completed'].append(step.id)
    del self.state['in_progress']
    self.state['last_result'] = result
    self.checkpoint()

    return result

Resume Logic

def resume(self):
    state = self.load_checkpoint()

    if 'in_progress' in state:
        # Crashed during execution
        step_id = state['in_progress']

        # Check if step actually completed (idempotent read)
        if self.check_step_completed_externally(step_id):
            # Step ran, just didn't checkpoint
            state['completed'].append(step_id)
            del state['in_progress']
            self.checkpoint()
        else:
            # Step didn't complete — re-execute with idempotency key
            step = self.get_step(step_id)
            self.execute_step(step)

    # Continue from last known good state
    return state

Why this works: If you crash between the two checkpoints, the in_progress marker tells you exactly what was happening. You can check if it completed and act accordingly. The “check if completed externally” path is where the Ch01 idempotency key pulls its weight [pa-idempotency]: the same key that prevents double-charging on retry also tells you whether a half-checkpointed step landed downstream.

Takeaway: checkpoint intent before execute. Checkpoint completion after. The interval between them is where crashes happen; that’s exactly what in_progress is for.

Solution 3: Progress Tracking Files (Anthropic Pattern)

For multi-session tasks, explicit progress files bridge context gaps. Anthropic calls this “structured note-taking” [anthropic-context2025]: agents write notes to durable files between sessions and reload them on session start.

The Two-Agent Pattern

# Initializer Agent (first run only)
def initialize_project(task):
    # Set up environment
    setup_environment()

    # Create progress file
    progress = {
        "goal": task.description,
        "completed_steps": [],
        "blockers": [],
        "next_action": "Analyze requirements",
        "context": {"files": [], "apis": []}
    }

    write_file("claude-progress.txt", format_progress(progress))
    git_commit("Initial project setup")

# Coding Agent (every session)
def continue_work():
    # Read progress from last session
    progress = read_file("claude-progress.txt")

    # Make incremental progress
    result = work_on_next_action(progress)

    # Update progress for next session
    progress["completed_steps"].append(result.action)
    progress["next_action"] = result.next_step

    write_file("claude-progress.txt", format_progress(progress))
    git_commit(f"Completed: {result.action}")

Progress File Structure

# Progress: Book Flight to NYC

## Current Goal

Book morning flight to NYC for tomorrow

## Completed Steps

1. [x] Parsed user intent: destination=NYC, date=tomorrow
2. [x] Inferred departure: SFO (from calendar)
3. [x] Searched flights: 47 options found
4. [x] User clarified: wants LaGuardia, not JFK
5. [x] Filtered to LGA: 18 options

## Current Blocker

8am flight sold out while user was deciding

## Next Action

Present 9am alternative ($12 more)

## Context

- User prefers aisle seats
- Corporate travel policy: max $500
- Departure: SFO
- Arrival: LGA

Why this works: New session reads progress file first. Immediate context on what’s done, what’s blocked, what’s next. No wasted tokens re-discovering state. Git-commit the file after every update — you get free version control on agent state plus a debuggable audit trail.

Takeaway: progress files are the boring, durable, version-controlled answer that beats clever in-context state every time.

Solution 4: Hybrid Memory

For sophisticated agents, combine short-term checkpoints with long-term vector memory.

class HybridMemory:
    def __init__(self, checkpointer, vector_db):
        self.checkpointer = checkpointer  # Short-term
        self.vector_db = vector_db        # Long-term

    def save_session_state(self, thread_id, state):
        """Short-term: current conversation, active task"""
        self.checkpointer.save(thread_id, state)

    def save_insight(self, insight):
        """Long-term: learned patterns, preferences"""
        embedding = embed(insight)
        self.vector_db.insert(embedding, insight)

    def recall_relevant(self, query, k=5):
        """Retrieve relevant long-term memories"""
        return self.vector_db.search(embed(query), k=k)

    def load_context(self, thread_id, current_input):
        """Combine short-term state + relevant long-term memories"""
        state = self.checkpointer.load(thread_id)
        memories = self.recall_relevant(current_input)
        return {**state, "relevant_memories": memories}

When to Use Each

Memory Type	Use For	Don’t Use For
Short-term (Checkpointer)	Current conversation, active task state	Preferences learned months ago
Long-term (Vector DB)	User preferences, learned patterns, domain knowledge	Ephemeral conversation turns

Key insight: Query long-term memory as a tool (retrieve when needed), don’t jam everything into context.

Takeaway: the checkpointer holds the session; the vector DB holds what you’ve learned. One “memory” blob hides both.

Observation Masking

For software engineering agents, most tokens in a turn are observation (test output, file contents). This explodes context fast — exactly the “context rot” failure Chroma’s research documented [chroma-rot].

def compact_history(history):
    compacted = []
    for turn in history:
        if turn.type == "observation":
            # Compress verbose output
            compacted.append({
                "type": "observation_summary",
                "content": summarize(turn.content, max_tokens=100)
            })
        else:
            # Keep action/reasoning in full
            compacted.append(turn)
    return compacted

# Before: 50k tokens of test output
# After: 100 token summary of test results

Result: Targets the token-heavy part while preserving decision history. In production, only compact observations above a token threshold (e.g. >2,000 tokens) — short observations are cheap to keep verbatim and aggressive lossy compression on small outputs throws away signal.

Takeaway: actions stay verbatim, observations get summarised above a threshold. You’re preserving the reasoning trace, not the raw output it consumed.

What to Copy, What to Skip

Pattern	Copy when	Skip when
`PostgresSaver` + `with from_conn_string`	any multi-process production agent. Default.	local prototypes — use `MemorySaver` and add `PostgresSaver` before deploy
`MemorySaver` / `InMemorySaver`	unit tests, prototypes, single-process scripts	anything that survives a restart
Checkpoint before + after execute	every step with side effects	pure-compute steps where re-running is free
Anthropic-style progress file (`claude-progress.txt`)	multi-session tasks where a human (or another agent) needs to pick up where you left off	one-shot tasks that fit in a single LLM call
Git-commit after every progress update	engineering / coding agents — gives audit trail and rollback	telemetry-heavy agents where commit-noise drowns signal
Hybrid memory (checkpointer + vector DB)	agents that need both session continuity and long-term preferences	single-session conversational bots — overkill
Observation masking / summary	code-execution and file-reading agents where observations are 80%+ of tokens	reasoning-heavy agents where every word matters
Single-orchestrator with subagent fan-out	most production cases — Claude Code’s pattern (orchestrator single-threaded, subagents parallel, returns only via final messages)	distinct domains that genuinely need shared mutable state across agents (rare; rethink the design)

Takeaway: every endorsed pattern has a “skip when” column. Read it before copying.

Common Gotchas

Gotcha	Symptom	Fix
Checkpoint too large	Save/load becomes the bottleneck	Prune old observations, limit history depth, mask non-essential payloads
Checkpoint corruption	State lost or inconsistent on resume	Atomic writes via DB transactions, schema-versioning, backup snapshots
Session resume confusion	Agent repeats completed tasks	Explicit progress files, structured state schema, intent + completion markers
No checkpoint before execution	Can’t tell if a step ran on crash	Checkpoint intent BEFORE execution; only the `in_progress` marker tells you what was happening
No atomic writes	Partial checkpoint on crash	Database transactions, write-ahead logging, never mutate state files in place
`from_conn_string` outside `with`	Postgres connection leaks across requests, eventual `too many clients`	Use the context manager, or pool the connection in your app’s startup/shutdown lifecycle
Putting tool outputs in checkpoint	Each step’s `state` grows by megabytes; save/load slow	Mask observations to summaries before persistence (see Observation Masking)

Takeaway: every gotcha here has been observed in production. Treat the table as a pre-merge review checklist for any agent that touches compile(checkpointer=...).

Multi-Agent State (Still Fragile)

Multi-agent systems share state across processes — and that’s where most production bugs concentrate. The single-agent-with-subtasks pattern is still the most reliable in production.

Why multi-agent state is hard:

Context fragmentation across agents
Synchronization overhead
Network latency disrupts state updates
Error compounding from fragmented information

Claude Code’s approach: a single-threaded orchestrator loop that can fan out to parallel subagents but keeps comprehensive context in the parent. The orchestrator never multitasks on its own thread; parallel work happens via isolated subagent contexts whose only return is a final message. This is why most state-sharing pathologies don’t reach the parent — the subagents can’t write back into its context except through their return value.

Recommendation: start with single-agent. Add multi-agent only when you have a domain reason — distinct context boundaries, isolation requirements, or independent failure domains — not because parallelism sounds nice.

Takeaway: multi-agent is a state-coordination problem dressed up as a concurrency feature. Default to single-agent + subtasks until you can name the boundary that forces otherwise.

The Checkpointing Checklist

Before deploying an agent with persistent state:

CHECKPOINTING DEPLOYMENT CHECKLIST

CHECKPOINT STORAGE
[ ] Using PostgreSQL (not MemorySaver) in production
[ ] Connection pooling configured
[ ] Backup strategy defined
[ ] TTL on old checkpoints to prevent unbounded growth

CHECKPOINT TIMING
[ ] Checkpoint BEFORE execution (mark intent)
[ ] Checkpoint AFTER execution (mark completion)
[ ] Resume logic handles in_progress state
[ ] Idempotent external checks for crash recovery

PROGRESS TRACKING
[ ] Explicit progress file for multi-session tasks
[ ] Git commits after significant progress (audit trail)
[ ] Clear next_action for new sessions

CONTEXT MANAGEMENT
[ ] Observation masking for verbose outputs
[ ] History pruning strategy
[ ] Long-term memory separate from session state

Takeaway: ship none of these and the first server restart loses every in-flight agent task. Ship all four and the restart becomes a non-event.

Closing

State now survives crashes, sessions, and humans. Two things still go wrong: the agent burns through your budget mid-loop, and a human needs to step in for the calls the agent can’t make. Both are next.

Testable question: kill your agent’s process with SIGKILL mid-task. Restart it. Does it resume from the last completed step, or start over? If it starts over, the checkpoint is theatre.

→ Part 3: Human-in-the-Loop Patterns — when the agent should escalate.

References

[anthropic-context2025] Anthropic Applied AI (Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield) — Effective Context Engineering for AI Agents, 2025-09-29. anthropic.com/engineering/effective-context-engineering-for-ai-agents. Source for structured-note-taking definition, compaction definition, long-horizon task framing, and the “agent reads its own notes after context resets” pattern.
[langgraph-persistence] LangChain — LangGraph Persistence: Checkpoints & Threads. langchain-ai.github.io/langgraph/concepts/persistence. Source for PostgresSaver.from_conn_string context-manager API, thread_id config pattern, and StateSnapshot field semantics. MemorySaver is the public alias for InMemorySaver in langgraph.checkpoint.memory.
[langgraph-postgres-source] LangChain — PostgresSaver source. github.com/langchain-ai/langgraph/blob/main/libs/checkpoint-postgres/langgraph/checkpoint/postgres/__init__.py. Source for from_conn_string signature and context-manager wiring.
[chroma-rot] Kelly Hong, Anton Troynikov, Jeff Huber — Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, 2025-07-14. trychroma.com/research/context-rot. Source for the attention-degradation-at-long-context finding (performance degrades non-uniformly as input length grows, even on simple tasks).
[pa-idempotency] Production Agents — Part 1: Idempotency & Safe Retries. Cross-reference for the idempotency_key contract that resume-after-crash logic relies on.