The Agent That Wouldn’t Stop: A Failure Story and the Fix

Connected Patterns: Understanding Agents Through Failure-Resistant Design
“A good agent is not the one that never fails. It is the one that cannot run away.”

Runaway agents do not begin as disasters.

Smart TV Pick
55-inch 4K Fire TV

INSIGNIA 55-inch Class F50 Series LED 4K UHD Smart Fire TV

INSIGNIA • F50 Series 55-inch • Smart Television
INSIGNIA 55-inch Class F50 Series LED 4K UHD Smart Fire TV
A broader mainstream TV recommendation for home entertainment and streaming-focused pages

A general-audience television pick for entertainment pages, living-room guides, streaming roundups, and practical smart-TV recommendations.

  • 55-inch 4K UHD display
  • HDR10 support
  • Built-in Fire TV platform
  • Alexa voice remote
  • HDMI eARC and DTS Virtual:X support
View TV on Amazon
Check Amazon for the live price, stock status, app support, and current television bundle details.

Why it stands out

  • General-audience television recommendation
  • Easy fit for streaming and living-room pages
  • Combines 4K TV and smart platform in one pick

Things to know

  • TV pricing and stock can change often
  • Platform preferences vary by buyer
See Amazon for current availability
As an Amazon Associate I earn from qualifying purchases.

They begin as a quiet success: a small automation that saves ten minutes, a helper that drafts a reply, a tool caller that fetches a few facts. Then one day the same loop meets a messy reality: a flaky API, an ambiguous instruction, a deadline, a human who is busy, and a system that does not know how to stop.

That is when the agent keeps going.

It retries the same action until rate limits harden into a wall.
It “helpfully” creates duplicates because it cannot prove what already happened.
It keeps searching because the answer is never quite confident enough.
It keeps writing because it cannot tell the difference between progress and motion.

A runaway agent is rarely a model intelligence problem. It is almost always a harness design problem. The loop is missing explicit boundaries.

This is a failure story, but it is also good news. When you understand why an agent would not stop, you can build the simple constraints that make the same class of failure nearly impossible.

The Failure Story

A team built an agent to triage support tickets.

The agent’s job was straightforward:

  • Read a new ticket
  • Pull account context from internal tools
  • Suggest a response
  • If the ticket looked risky, request human approval before sending

In early testing, it worked. The tool calls returned quickly. The agent produced clean drafts. Approvals came back in minutes.

Then the real world arrived.

One evening, an internal account tool started timing out intermittently. The agent would request context, wait, and then retry. When it did get data, the data was sometimes incomplete. So the agent would retry again, hoping for the full picture.

At the same time, the ticket queue was rising, and the approval reviewer was away from their desk.

The agent did what it was trained to do: it tried to be helpful.

It retried tool calls until rate limits kicked in.
It created multiple draft responses for the same ticket because it lost track of which draft was “the draft.”
It escalated more tickets than necessary because the partial context increased uncertainty.
It then retried escalation messages when it did not see an acknowledgment.
It kept going through the night, producing a pile of noise.

The next morning, the team saw the damage:

  • Tool usage costs had spiked
  • Rate limits were exhausted
  • Internal logs were hard to interpret
  • Duplicate artifacts were scattered across systems
  • The agent had not shipped better outcomes, it had shipped motion

The agent did not stop because the system never gave it a clear definition of done, a budget it could not exceed, or a safe way to pause.

Why Agents Don’t Stop

When an agent runs away, it is tempting to treat it as a single bug. In practice, it is usually the overlap of several missing constraints.

No “Done” Predicate

A loop that cannot prove completion will keep trying to complete.

Agents are often given objectives like “resolve the ticket,” “collect the information,” or “write the report.” Those are goals, but they are not stopping rules.

A stopping rule is something an agent can evaluate mechanically:

  • The response has been drafted and queued for human review
  • The tool output matches the schema and passes validation checks
  • A human approval token has been received, or the approval window has expired and the run is paused
  • The run has produced the required artifacts and a final status summary

Without a done predicate, the agent replaces certainty with more attempts.

Ambiguity Without Escalation

Ambiguity is normal in real workflows. The failure is not ambiguity. The failure is having no safe action when ambiguity remains.

If an agent faces conflicting signals, it needs a defined branch:

  • Ask the user a clarifying question
  • Escalate to a human
  • Pause the run with a clear reason and a compact state snapshot

If none of these exist, the agent will invent a fourth option: keep working.

Retries Without Idempotency

Retries are not the problem. Retries without idempotency become duplication.

If “send escalation message” is retried, it must either:

  • Be idempotent by design, meaning a repeat does not create a new side effect
  • Or be guarded by a check that proves the message already exists

If neither is true, a retry is a duplication machine.

No Budget, No Backoff, No Circuit Breaker

An agent that can spend infinite tokens and infinite tool calls will eventually do so.

Budgets and circuit breakers are not about stinginess. They are about safety.

  • A maximum number of tool calls per run prevents a loop from turning into an outage
  • Exponential backoff prevents retry storms
  • A circuit breaker turns repeated failures into a deliberate pause and a clear alert

The model cannot invent these reliably at runtime. The harness must enforce them.

No Pause State

Many runaway loops are really “I should pause” loops.

If a human approval is pending, the correct behavior is to stop doing new actions and wait. If an external system is unhealthy, the correct behavior is to stop doing new actions and wait.

If the agent does not have a real pause state with saved context and a resume path, it keeps trying to make progress anyway.

The Fix: Constrain the Loop, Then Teach It to Work Inside the Box

You fix runaway behavior by putting the agent in a box that has hard edges, and then you help it succeed inside those edges.

This is the core shift:

  • From “try until solved”
  • To “attempt within constraints, then stop with a trustworthy report”

A production agent earns trust by being stoppable.

Define the Run Contract

Every run should have a contract that is visible to humans and enforceable by the system.

A run contract answers:

  • What counts as success
  • What artifacts must be produced
  • What counts as failure
  • What counts as “paused, waiting for external input”
  • What budgets apply
  • What actions require approval

When the agent is uncertain, the run contract gives it a safe default: pause, summarize, and ask.

Add an Explicit Stop Ladder

A stop ladder is a small set of ordered outcomes the agent can land on.

Typical ladder outcomes:

  • Completed: response drafted and queued
  • Completed: response drafted and sent with approval token
  • Paused: human approval pending
  • Paused: external dependency unhealthy
  • Failed: validation errors or missing required fields
  • Aborted: budget exceeded or stop signal received

The key is that “paused” is a success state for safety. It is not a failure. It is the correct behavior under uncertainty.

Enforce Budgets at the Harness Level

Budgets must be enforced by the harness, not politely requested in the prompt.

Budgets that matter:

  • Max tool calls per run
  • Max total tokens per run
  • Max wall-clock time per run
  • Max retries per tool call
  • Max consecutive failures before circuit break

If the agent hits a budget, it must stop and produce a run report that explains exactly what happened.

Make Side Effects Idempotent

Any tool that causes an external change should accept an idempotency key and be safe to repeat.

If a tool cannot be made idempotent, the harness needs a preflight check:

  • Does the artifact already exist
  • Was this ticket already updated
  • Is this message already posted
  • Does the system already have a record of this side effect

An agent should never “assume” a side effect succeeded. It should verify.

Add a Health Gate and Circuit Breaker

If a dependency is failing, your best move is to stop asking it for help.

A simple health gate:

  • Track tool failures by tool name
  • If failures cross a threshold in a window, open the circuit
  • When the circuit is open, do not call the tool
  • Pause the run with an explanation and a next check time

This protects the dependency and protects your budget.

A Practical Diagnostic Table

When you see a runaway agent, map symptoms to missing constraints.

What you observeThe likely missing constraintThe fix that works
The agent keeps retrying the same toolNo retry cap, no backoff, no circuit breakerCap retries, exponential backoff, circuit breaker
Duplicate messages or duplicated artifactsNo idempotency key, no preflight “already done” checkIdempotency keys and verification checks
The agent keeps searching foreverNo done predicate, no confidence threshold, no escalation pathDone rule plus “ask or pause” branch
The agent escalates everythingNo uncertainty policy, no risk grading, partial context handlingRisk rubric, partial data strategy, human gate
The agent does work while waiting for approvalNo pause state, no workflow stage machinePause state with resumable checkpoints
The system costs spike overnightNo budgets, no alerts, no stop ladderHarness budgets plus monitoring and stop outcomes

The Moment Your Agent Should Stop, Not Try Harder

There is a simple principle that prevents most runaways.

When progress is blocked by missing information, external failure, or pending human judgment, the agent should stop, not grind.

Stopping should look like:

  • A compact state snapshot: what is known, what was attempted, what remains
  • A clear reason for pause
  • A minimal set of next actions for a human to approve or correct
  • A safe resume token and a resume plan

That is not quitting. That is reliability.

A Minimal “No Runaway” Checklist

Before you let an agent run unattended, confirm these are true:

  • Every run has a done predicate the harness can evaluate
  • Every tool call has capped retries and a backoff policy
  • Every side effect tool is idempotent or guarded by verification checks
  • Every run has budgets and a stop ladder with a real paused state
  • Human approvals pause the run, they do not create loops
  • Tool failures can open a circuit breaker and halt further calls
  • Every run produces a run report that a different person can audit

If those are true, the agent can still fail, but it cannot run away.

Keep Exploring Agent Reliability

If you want to go deeper on the ideas connected to this topic, these posts will help you build the full mental model.

• Guardrails for Tool-Using Agents
https://ai-rng.com/guardrails-for-tool-using-agents/

• Reliable Retries and Fallbacks in Agent Systems
https://ai-rng.com/reliable-retries-and-fallbacks-in-agent-systems/

• Multi-Step Planning Without Infinite Loops
https://ai-rng.com/multi-step-planning-without-infinite-loops/

• Preventing Task Drift in Agents
https://ai-rng.com/preventing-task-drift-in-agents/

• Monitoring Agents: Quality, Safety, Cost, Drift
https://ai-rng.com/monitoring-agents-quality-safety-cost-drift/

• Sandbox Design for Agent Tools
https://ai-rng.com/sandbox-design-for-agent-tools/

Books by Drew Higgins