Agent Error Taxonomy: The Failures You Will Actually See

Connected Patterns: Turning “It Failed” Into Actionable Fixes
“Most agent failures repeat. They only feel random because you are not classifying them.”

When an agent fails, teams often describe the event with vague language:

Popular Streaming Pick
4K Streaming Stick with Wi-Fi 6

Amazon Fire TV Stick 4K Plus Streaming Device

Amazon • Fire TV Stick 4K Plus • Streaming Stick
Amazon Fire TV Stick 4K Plus Streaming Device
A broad audience fit for pages about streaming, smart TVs, apps, and living-room entertainment setups

A mainstream streaming-stick pick for entertainment pages, TV guides, living-room roundups, and simple streaming setup recommendations.

  • Advanced 4K streaming
  • Wi-Fi 6 support
  • Dolby Vision, HDR10+, and Dolby Atmos
  • Alexa voice search
  • Cloud gaming support with Xbox Game Pass
View Fire TV Stick on Amazon
Check Amazon for the live price, stock, app access, and current cloud-gaming or bundle details.

Why it stands out

  • Broad consumer appeal
  • Easy fit for streaming and TV pages
  • Good entry point for smart-TV upgrades

Things to know

  • Exact offer pricing can change often
  • App and ecosystem preference varies by buyer
See Amazon for current availability
As an Amazon Associate I earn from qualifying purchases.

The model got confused.
The tool call went weird.
The agent hallucinated.
The run drifted.

Those phrases may be emotionally accurate, but they are operationally useless. If you cannot name a failure mode, you cannot prevent it. If you cannot separate failure families, you cannot measure improvement.

An error taxonomy is a practical map of what breaks in agent systems and what to do about it. The goal is not academic completeness. The goal is to reduce the number of times a human has to manually rescue a run.

Why Agents Fail Differently Than Traditional Software

Traditional software fails when code is wrong.

Agent systems fail when behavior is wrong. Behavior is shaped by prompts, tools, retrieval, budgets, and environment. This creates failure modes that look like “reasoning” but are really system design flaws.

A useful taxonomy recognizes that many failures are not model failures at all. They are missing contracts, missing verification, missing budgets, or missing guardrails.

The Core Failure Families

Most production failures fall into a small set of families. If you can tag runs with these families, you can measure reliability in a meaningful way.

  • Target failures: the agent misunderstood or silently changed the goal.
  • Retrieval failures: the agent fetched the wrong information or trusted a bad source.
  • Tool failures: the tool returned malformed output, partial output, or unsafe behavior.
  • State failures: the agent lost constraints, forgot decisions, or carried stale memory forward.
  • Planning failures: the agent looped, over-planned, or never committed to execution.
  • Safety failures: the agent attempted an action outside approved boundaries.
  • Integration failures: timeouts, rate limits, concurrency conflicts, or environment mismatch.
  • Human-interface failures: unclear requests, missing approvals, or ambiguous acceptance criteria.

These families sound broad, but they become concrete when you attach symptoms and remedies.

A Taxonomy You Can Use in Run Reviews

A taxonomy is only valuable if it changes what you do after a failure.

A practical run review asks:

  • What was the first observable symptom?
  • Which family does that symptom belong to?
  • What upstream condition made the failure likely?
  • What guardrail would have detected it earlier?
  • What contract or policy change prevents recurrence?

The table below is intentionally biased toward failures you will see repeatedly.

Failure modeWhat it looks likeRoot cause you can fixMitigation pattern
Silent goal swapOutput is “good,” but not what was askedNo explicit success criteriaRestate target each phase, require acceptance checklist
Constraint lossAgent ignores a requirement mid-runNo durable state snapshotCheckpoints, constraint reminders, compaction policy
Confident wrong factAgent states something unverifiableMissing retrieval gateTool routing: search or ask, cite evidence
Fabricated citationSource link does not support claimNo citation validationStore evidence snippets, require URL verification
Retry stormTool called repeatedly with same failureNo retry cap or backoffRetry policy with typed errors and idempotency
Duplicate side effectSame action executed twiceNo idempotency contractIdempotency keys, dry runs, commit step
Infinite loopAgent keeps planning or recheckingNo stop ruleStep budgets, stop conditions, done definition
Partial results hiddenAgent presents incomplete work as completeNo partial flagContracted partial markers and run report format
“Tool succeeded” but wrong outputSchema fits but semantics wrongWeak validation invariantsOutput invariants, cross-checks, sanity rules
Unsafe action attemptedAgent tries to delete, send, or purchaseMissing guardrailsApproval gates, read-only defaults, sandboxing

Detecting Failures Before They Become Incidents

Most failures have early signals. The reason teams miss them is that they do not log the right things.

Early signals you can capture without expensive instrumentation:

  • The agent repeatedly asks the same question in slightly different words.
  • The agent’s tool calls oscillate between two tools without producing new evidence.
  • The run’s “open questions” list grows while deliverables remain unchanged.
  • Tool outputs contain warnings that never appear in the final response.
  • The agent’s state snapshot stops changing even though steps continue.

If you store these signals, you can trigger stop rules automatically:

  • Pause and request human input when the agent repeats a step pattern.
  • Reduce tool permissions when warnings accumulate.
  • Force a “progress summary” checkpoint when the deliverable does not advance.
  • Escalate when the agent attempts a side effect outside the planned action set.

The goal is not to punish the agent. The goal is to prevent silent failure from becoming expensive failure.

A Failure Story That Shows the Value of Classification

Imagine an agent assigned to “compile a weekly operations report from logs and tickets.” It starts by retrieving tickets, then pulls a dashboard screenshot from an internal system. The screenshot is stale, but the agent does not know that. It writes a report confidently, and a manager makes a staffing decision based on the wrong numbers.

If you only say “the agent hallucinated,” you will reach for prompt tweaks.

If you classify the failure, the fix becomes obvious:

  • Primary family: retrieval failure.
  • Contributing family: tool failure because the dashboard tool did not return a timestamp.
  • Preventive guardrail: contract requires every metric payload to include an “as_of” time.
  • Verification gate: cross-check the dashboard metric against the ticket system counts.

The next run becomes safer because the system learned a rule, not a vibe.

The Most Underestimated Category: State Failures

Teams often think memory is a model feature. In practice, memory is a systems feature.

State failures include:

  • The agent forgets a decision it made earlier and contradicts itself.
  • The agent continues with a stale assumption after conditions change.
  • The agent carries forward an outdated summary that overwrote important nuance.
  • The agent bloats context until it loses the thread entirely.

These failures get misdiagnosed as “the model is not smart enough.” The fix is usually better state design: what to store, how to compact it, and how to validate it.

Retrieval Failures Are Usually Policy Failures

When agents retrieve from the web or a private knowledge base, the failure is not simply “bad search.”

Common retrieval problems are policy problems:

  • The agent accepts the first source instead of cross-checking.
  • The agent pulls an outdated page and treats it as current.
  • The agent mixes sources and does not resolve contradictions.
  • The agent cannot separate primary sources from commentary.

These problems are prevented by a retrieval policy, not by asking the model to “be careful.” Policies make carefulness enforceable.

Planning Failures: The Agent That Can’t Commit

Many agents can plan. Fewer can finish.

Planning failures show up as:

  • Endless decomposition into sub-tasks.
  • Replanning the same plan because uncertainty never drops.
  • Optimizing the plan rather than executing.
  • Rewriting deliverables repeatedly without shipping.

The fix is to treat planning like a bounded phase with budgets and stop rules, then commit to execution with verification gates.

Tool Failures: When the System Blames the Model

Tool failures often get blamed on the model because the tool output was ambiguous.

But tool failures are predictable:

  • Unstructured errors.
  • Missing fields.
  • Rate limits.
  • Partial returns without labels.
  • Side effects hidden behind friendly names.

If a tool can fail in a way that makes the agent guess, the tool is unsafe for automation. A contract envelope and typed errors are the fastest path to reliability.

Safety Failures: The Cost of Implicit Permission

Safety failures occur when an agent assumes it is allowed to act.

A safe system makes permission explicit:

  • The agent defaults to read-only actions.
  • The agent uses dry runs to preview changes.
  • The agent requires human approval for high-risk actions.
  • The system logs every side effect with traceable intent.

If you cannot explain why an action was permitted, the permission model is broken.

Making the Taxonomy Operational

A taxonomy becomes real when it is embedded into your platform.

Practical steps:

  • Tag every failed run with a primary failure family and a secondary contributor.
  • Add those tags to dashboards so you can see dominant failure patterns.
  • For each dominant pattern, define one policy change and one tool change.
  • Create a “known issues” playbook with the mitigation for each category.
  • Require run reports that include: what happened, evidence, and what remains uncertain.

When you run this loop, reliability improves quickly because you are fixing repeatable problems rather than arguing about “model intelligence.”

Keep Exploring Reliable Agent Workflows

• Agent Logging That Makes Failures Reproducible
https://ai-rng.com/agent-logging-that-makes-failures-reproducible/

• Reliable Retries and Fallbacks in Agent Systems
https://ai-rng.com/reliable-retries-and-fallbacks-in-agent-systems/

• Preventing Task Drift in Agents
https://ai-rng.com/preventing-task-drift-in-agents/

• Monitoring Agents: Quality, Safety, Cost, Drift
https://ai-rng.com/monitoring-agents-quality-safety-cost-drift/

• The Agent That Wouldn’t Stop: A Failure Story and the Fix
https://ai-rng.com/the-agent-that-wouldnt-stop-a-failure-story-and-the-fix/

Books by Drew Higgins