Connected Patterns: Recovering From Failure Without Making It Worse
“Retries are not reliability. Retries are risk unless they are disciplined.”
When an agent system fails, the instinct is to try again.
High-End Prebuilt PickRGB Prebuilt Gaming TowerPanorama XL RTX 5080 Gaming PC Desktop – AMD Ryzen 7 9700X Processor, 32GB DDR5 RAM, 2TB NVMe Gen4 SSD, WiFi 7, Windows 11 Pro
Panorama XL RTX 5080 Gaming PC Desktop – AMD Ryzen 7 9700X Processor, 32GB DDR5 RAM, 2TB NVMe Gen4 SSD, WiFi 7, Windows 11 Pro
A premium prebuilt gaming PC option for roundup pages that target buyers who want a powerful tower without building from scratch.
- Ryzen 7 9700X processor
- GeForce RTX 5080 graphics
- 32GB DDR5 RAM
- 2TB NVMe Gen4 SSD
- WiFi 7 and Windows 11 Pro
Why it stands out
- Strong all-in-one tower setup
- Good for gaming, streaming, and creator workloads
- No DIY build time
Things to know
- Premium price point
- Exact port mix can vary by listing
That instinct is natural. It is also one of the fastest ways to turn a small issue into a large incident.
A single tool call times out.
The agent retries immediately.
The next call also times out.
The agent retries again, faster than the upstream can recover.
Soon you have a retry storm: rising load, rising errors, growing cost, and no progress.
Retries and fallbacks are not a minor implementation detail. They are the reliability core of any tool-using agent.
The question is not whether you retry. The question is whether your retry policy is safe, bounded, and idempotent.
The Failure Modes Retries Create
Retries create three common failure classes.
Retry storms
An agent that retries aggressively increases load precisely when upstream services are already struggling. This can turn a transient glitch into sustained downtime.
Duplicate side effects
Many agent actions have side effects: sending messages, writing files, updating records, creating tickets. If you retry blindly, you create duplicates.
The agent may be “doing its best,” but your users experience spam, double-writes, and confusing states.
Masked root causes
Retries can hide the real problem. The system eventually succeeds, so nobody investigates. Then the same failure returns at a worse moment.
A mature system uses retries as a controlled tool, not as a substitute for diagnosis.
The Pattern Inside the Story of Reliable Systems
The broader world of distributed systems has learned a few rules the hard way. Agents need those rules because agents amplify mistakes by acting repeatedly and confidently.
Here are the core patterns translated into agent terms.
| Pattern | What it does | What it prevents |
|---|---|---|
| Exponential backoff with jitter | Spreads retries over time and avoids synchronized spikes | Storms and cascading failures |
| Retry caps | Limits attempts per action and per run | Infinite loops and runaway cost |
| Idempotency keys | Makes repeated commits safe | Duplicate emails and double-writes |
| Circuit breakers | Stops calling a failing dependency temporarily | Wasting time and worsening outages |
| Timeouts | Defines how long to wait before declaring failure | Hanging runs and deadlocks |
| Fallback chains | Provides alternate routes when a tool fails | Total run failure from a single dependency |
| Checkpointed progress | Persists safe milestones | Restarting from zero after failure |
A reliable agent harness implements these patterns centrally so every tool call inherits them.
Designing Retries That Do Not Hurt You
A safe retry strategy is shaped by the kind of action being attempted.
Separate read actions from write actions
Read actions are typically safe to retry. Write actions may not be.
Examples:
• Read: fetch a web page, query a database read-only, list files
• Write: send an email, submit a form, create a record, update a document
The harness should tag tools and actions with a side-effect level. Routing can then apply stricter rules to higher-risk actions.
Make commits idempotent or gate them
If an action can cause an external change, you need one of these:
• Idempotency: the same action with the same key produces the same result once
• Gating: a human approval step before the write
Idempotency can be implemented by attaching a unique key to each intended commit and having the receiving system dedupe. If you cannot guarantee dedupe, do not allow automatic retries for that action.
Use exponential backoff with jitter
Backoff means each retry waits longer than the last. Jitter means the wait time includes randomness so many agents do not retry at the same moment.
This is a reliability gift to your dependencies and to yourself. It reduces load and increases the chance of recovery.
Apply retry caps per failure class
Not all failures should be retried. The harness should classify failures:
• Transient: timeouts, temporary network issues, 5xx errors
• Persistent: 4xx errors, permission denied, invalid input
• Unknown: malformed responses, unexpected formats
Transient failures can be retried with backoff and a cap. Persistent failures should stop quickly and surface the cause. Unknown failures should trigger a verification gate or escalation, not endless retries.
Add circuit breakers around unstable tools
A circuit breaker tracks recent failures. If a tool fails repeatedly, the circuit opens and the harness stops calling it for a cooling period.
This prevents the agent from thrashing and forces it to consider fallbacks. It also makes incidents visible: if circuits open often, you have a dependency problem that needs attention.
Fallbacks That Preserve Correctness
Fallbacks can be even more dangerous than retries because they can change semantics. A fallback that returns “something” is not helpful if it returns the wrong thing.
A safe fallback chain has two rules:
• Each fallback must declare what it can and cannot guarantee.
• The harness must verify that the fallback output still meets requirements.
Examples of fallbacks:
• If a web source is blocked, switch to an alternate authoritative source.
• If a primary tool is down, switch to a read-only cache.
• If a structured tool fails, switch to a simpler tool plus a validation step.
When fallbacks change confidence, the harness should surface that change in the run report.
Checkpoints: The Quiet Partner of Retries
Retries are often a symptom of missing checkpoints.
If an agent loses progress after a tool hiccup, it may repeat steps that were already done, increasing load and risk. A checkpointed system can resume from the last safe milestone.
A good pattern is:
• Draft work is allowed to be messy.
• Verified work is checkpointed.
• Committed work is tracked with idempotency keys.
This lets the agent be persistent without being reckless.
Idempotency in Practice: The Difference Between Safe and Spam
Idempotency sounds abstract until you watch an agent send the same message five times.
In practice, you implement idempotency by making every intended commit uniquely identifiable.
A simple approach:
• Before a side-effecting action, the harness generates a commit key tied to the run ID and the action intent.
• The tool call includes that key.
• The receiver stores the key and refuses to apply the same key twice.
• The harness records the key in state so a restart does not generate a new identity for the same intent.
If you control the receiving system, this is straightforward. If you do not, you can still approximate safety by adding a “preflight read” before the write.
Example: before creating a ticket, search for an existing ticket with the same fingerprint. Before sending a message, check whether a message with the same subject and timestamp window already exists. These checks are not perfect, but they move you from “guaranteed duplicates” to “rare duplicates,” which is often the difference between acceptable and unusable.
Idempotency also shapes your retry caps. A write action that is idempotent can be retried more safely than one that is not.
| Action type | Idempotency available | Retry posture |
|---|---|---|
| Read-only query | Not needed | Retry with backoff and a cap |
| Write with idempotency key | Yes | Retry cautiously, report dedupe events |
| Write with preflight check | Partial | Retry sparingly, prefer escalation |
| Write without protection | No | Do not auto-retry, require approval |
A reliable harness makes this classification explicit so the agent cannot “decide” to gamble.
Monitoring and Alarms for Retry Behavior
Retry logic that is not monitored becomes invisible until it becomes expensive.
Your agent system should track:
• Retry counts per tool
• Time spent in backoff
• Circuit breaker open rates
• Duplicate prevention hits (idempotency dedupe events)
• Fallback usage frequency
• Runs that stop due to retry caps
These are not vanity metrics. They are the heartbeat of reliability.
If you see a rise in fallback usage, upstream reliability may be slipping. If you see repeated dedupe hits, your system is behaving safely but may be stuck in repeated attempts. Both signal where to invest.
Reliability Without Panic
The goal of retries and fallbacks is not to “never fail.” The goal is to fail in a way that is safe, bounded, and explainable.
A disciplined retry policy creates a calmer system:
• Fewer runaway loops
• Fewer duplicate side effects
• Faster escalation when failure is persistent
• Lower costs under stress
• More trust in automation
When an agent fails after a well-designed retry strategy, it fails with a clear reason and a clear record. That is a success condition in its own right.
A team that can trust failure reports can improve quickly. A team that only sees silent retries and intermittent success will live in confusion, because the system never tells the truth about how brittle it is.
Keep Exploring Reliability Patterns
• Sandbox Design for Agent Tools
https://ai-rng.com/sandbox-design-for-agent-tools/
• Monitoring Agents: Quality, Safety, Cost, Drift
https://ai-rng.com/monitoring-agents-quality-safety-cost-drift/
• Designing Tool Contracts for Agents
https://ai-rng.com/designing-tool-contracts-for-agents/
• Agent Checkpoints and Resumability
https://ai-rng.com/agent-checkpoints-and-resumability/
• Verification Gates for Tool Outputs
https://ai-rng.com/verification-gates-for-tool-outputs/
• The Agent That Wouldn’t Stop: A Failure Story and the Fix
https://ai-rng.com/the-agent-that-wouldnt-stop-a-failure-story-and-the-fix/
Books by Drew Higgins
Christian Living / Encouragement
God’s Promises in the Bible for Difficult Times
A Scripture-based reminder of God’s promises for believers walking through hardship and uncertainty.
Bible Study / Spiritual Warfare
Ephesians 6 Field Guide: Spiritual Warfare and the Full Armor of God
Spiritual warfare is real—but it was never meant to turn your life into panic, obsession, or…
