Connected Patterns: Understanding Agents Through State That Survives Failure
“A long-running agent without checkpoints is a short-running agent in disguise.”
An agent that runs for five minutes can afford to be careless. If it crashes, you rerun it.
Flagship Router PickQuad-Band WiFi 7 Gaming RouterASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
A flagship gaming router angle for pages about latency, wired priority, and high-end home networking for gaming setups.
- Quad-band WiFi 7
- 320MHz channel support
- Dual 10G ports
- Quad 2.5G ports
- Game acceleration features
Why it stands out
- Very strong wired and wireless spec sheet
- Premium port selection
- Useful for enthusiast gaming networks
Things to know
- Expensive
- Overkill for simpler home networks
An agent that runs for five hours cannot.
Long tasks fail for normal reasons:
Network blips.
API timeouts.
Rate limits.
Process restarts.
Model server hiccups.
A human approval that arrives later than expected.
A tool that returns a partial response.
A machine that gets patched and rebooted.
If your agent loses its place every time any of that happens, it will never be trusted with real work. People will not hand important tasks to something that collapses the moment the world behaves like the world.
Checkpoints and resumability are how you turn an agent run into an operable pipeline.
What resumability really means
Resumability is not “save the chat.”
Resumability means:
The agent can stop at any step and later continue without losing commitments, constraints, or evidence.
The agent can replay tool calls safely without duplicating side effects.
The agent can justify what it has already done and what remains.
The agent can survive restarts without drifting into a different task.
A checkpoint is a promise: when you restart, you get the same run back, not a new run with the same name.
The checkpoint inside the story of production
Resumability is a reliability primitive. It turns unpredictable environments into bounded progress.
| Failure | Without checkpoints | With checkpoints |
|---|---|---|
| Process restart | Full restart and repeated work | Resume from last safe boundary |
| Tool timeout | Agent loops or guesses | Retry safely with recorded context |
| Human delay | Agent idles or forgets | Pause, persist, and continue later |
| Long tasks | Memory grows until it breaks | Compact state and write snapshots |
| Side effects | Duplicate actions on retry | Idempotent execution tied to state |
A resumable agent is not more intelligent. It is more durable.
What belongs in checkpointed state
A checkpoint should be structured. It should be small enough to store often and precise enough to restore deterministically.
A strong checkpoint state usually includes:
Goal and success criteria
- The target outcome.
- The definition of “done.”
- Stop rules and acceptable failure modes.
Constraints and commitments
- Permissions and boundaries.
- Risk tier and approval requirements.
- Decisions already made that must not be revisited unless explicitly reopened.
Plan and progress
- Current plan items.
- Completed items with timestamps.
- Remaining items with dependencies.
Working memory
- Key facts discovered.
- Open questions and blocked items.
- Task-specific vocabulary and entity IDs.
Evidence bundle pointers
- Source IDs, hashes, timestamps.
- Tool outputs referenced by later steps.
- Citations or excerpts used in claims.
Budgets and counters
- Tokens used.
- Tool calls used.
- Retry counts per tool and per step.
Audit trail pointers
- Link to the event log.
- Approval tokens and reviewer decisions.
The checkpoint is not a transcript. It is a state machine snapshot.
Snapshot versus event sourcing
There are two common approaches to resumable state.
Snapshot-first
- You store the full structured state at a checkpoint boundary.
- You restore the latest snapshot and continue.
Event-sourced
- You store a stream of events and rebuild state by replaying events.
- You can reconstruct any point in time.
Many teams end up with a hybrid:
Events for the detailed audit trail.
Snapshots for fast recovery.
The key is consistency. Whichever method you use, it must restore a coherent state that does not change meaning under replay.
Checkpoint boundaries that prevent corruption
A checkpoint should only be written at safe boundaries.
Safe boundaries are points where:
A step finished.
All tool calls in that step completed or failed decisively.
Side effects are either committed with an idempotency key or not attempted at all.
The agent has a clear next step.
Unsafe boundaries are points where:
A tool call is in flight.
A side effect is partially applied.
The agent has an ambiguous plan.
The agent’s “next action” depends on transient context that is not stored.
If you checkpoint at unsafe boundaries, you will resume into contradictions.
Idempotency is part of resumability
If an agent can create side effects, resumability must protect against duplication.
That means:
Every side effect is tied to an idempotency key.
The idempotency key is stored in state before the side effect is attempted.
The tool is called with that key.
The result is recorded so that retries can detect “already done.”
A useful mental model is:
The checkpoint defines what the agent intends to have happened.
The idempotency system ensures that intent is safe to replay.
Without idempotency, “resume” becomes “repeat.”
A practical resumability protocol
A resumable run often follows a simple protocol.
- Start run and write an initial checkpoint with goals and constraints.
- For each step:
- Write step_started event.
- Execute tool calls.
- Validate outputs.
- Update structured state.
- Write checkpoint_written event and store snapshot.
- If a high-risk action is required:
- Request approval.
- Persist the plan and the approval request.
- Pause.
- On approval, resume and execute the approved step using idempotency keys.
- On failure:
- Record failure event.
- Retry within caps.
- If still failing, pause with a clear stop reason and preserved state.
This protocol feels boring, which is a compliment. Boring systems are the ones that run.
Schema versioning matters more than you think
Your checkpoint state is a contract between today and tomorrow.
When your state schema changes, you must handle:
Migrations
- Old snapshots must be upgraded to new schema versions.
Compatibility
- A newer agent should be able to read older state or refuse clearly.
Validation
- State should be validated on restore, not assumed correct.
If you skip this, you will eventually create a run you cannot resume because you changed a field name.
Resuming without drift
One subtle failure is drift after resume.
The run resumes, but the agent reinterprets the goal and chooses a different path.
To prevent this, store:
A concise goal statement and “what not to do.”
A list of commitments already made.
A list of assumptions that were accepted.
A “next action” pointer that is concrete.
A strong resume prompt is not a long chat history. It is:
Here is the run.
Here is where we are.
Here is the next step.
Here are the constraints we must not violate.
That clarity is what keeps the agent from treating resume as a new conversation.
Checkpoints and context compaction work together
As runs get longer, you cannot keep everything in working context. Checkpointing lets you compact context without losing meaning.
A useful pattern is:
Store full evidence and logs outside the model context.
Store a compact state snapshot inside the checkpoint.
On resume, load only the snapshot plus the minimal evidence needed for the next step.
This is how you keep the agent stable without bloating context.
A state table you can use
| State component | Why it exists | Example |
|---|---|---|
| Goal and success criteria | Prevent drift | “Produce a verified run report with citations and stop reasons.” |
| Constraints | Prevent unsafe actions | “No external messages without approval.” |
| Plan and progress | Maintain momentum | “Completed: tool contract validation; Next: checkpoint write.” |
| Evidence pointers | Defend claims | “Source hash for page excerpt used in decision.” |
| Budgets and counters | Prevent runaway | “Retries: search tool 2 of 3; Tokens: 18k of 30k.” |
| Approval tokens | Preserve authority | “Approved by on-call at 14:32 UTC.” |
This table is small, but it contains what breaks most resumability attempts.
The payoff: long tasks become normal
When checkpoints and resumability are done well, long tasks stop being scary.
The agent can:
Pause for humans without forgetting.
Survive restarts without drama.
Resume with the same intent.
Prove what it did and why.
Avoid duplicating side effects.
That is what it means to move from a demo to an operator-grade system.
Keep Exploring Reliable Agent Systems
• Context Compaction for Long-Running Agents
https://ai-rng.com/context-compaction-for-long-running-agents/
• Agent Logging That Makes Failures Reproducible
https://ai-rng.com/agent-logging-that-makes-failures-reproducible/
• Reliable Retries and Fallbacks in Agent Systems
https://ai-rng.com/reliable-retries-and-fallbacks-in-agent-systems/
• Human Approval Gates for High-Risk Agent Actions
https://ai-rng.com/human-approval-gates-for-high-risk-agent-actions/
• Agent Memory: What to Store and What to Recompute
https://ai-rng.com/agent-memory-what-to-store-and-what-to-recompute/
• Multi-Step Planning Without Infinite Loops
https://ai-rng.com/multi-step-planning-without-infinite-loops/
Books by Drew Higgins
Bible Study / Spiritual Warfare
Ephesians 6 Field Guide: Spiritual Warfare and the Full Armor of God
Spiritual warfare is real—but it was never meant to turn your life into panic, obsession, or…
