Agent Evaluation

Articles in This Topic

Agent Evaluation: Task Success, Cost, Latency

Agent Evaluation: Task Success, Cost, Latency Agent systems can look impressive in a demo while failing quietly in production. The gap is not only model quality. It is evaluation discipline. A deployed agent is a workflow engine that reads, plans, calls tools, and produces outcomes under constraints. Evaluating an agent means evaluating the workflow, not […]

Agent Handoff Design: Clarity of Responsibility

Agent Handoff Design: Clarity of Responsibility Handoffs are where agent systems either become trustworthy infrastructure or become a source of quiet risk. A handoff happens whenever responsibility moves from one actor to another: from agent to human, from agent to another service, from agent to a different role, or from one stage of a workflow […]

Memory Systems: Short-Term, Long-Term, Episodic, Semantic

Memory Systems: Short-Term, Long-Term, Episodic, Semantic Memory is the difference between an agent that answers questions and an agent that can carry work across time. It is also the difference between a system that quietly accumulates risk and a system that stays accountable. “Memory” is not a single feature. It is a set of storage […]

Planning Patterns: Decomposition, Checklists, Loops

Planning Patterns: Decomposition, Checklists, Loops An agent that takes action without a plan is fast until it is wrong. An agent that plans without acting is safe until it is useless. The practical craft is not “planning” as a philosophical concept, but planning as a set of patterns that keep multi-step work inside budgets while […]

Subtopics

No subtopics yet.

Core Topics

Agent Evaluation: Task Success, Cost, Latency

Related Topics

Failure Recovery Patterns

Error Recovery: Resume Points and Compensating Actions

Guardrails and Policies

Guardrails: Policies, Constraints, Refusal Boundaries

Human-in-the-Loop Design

Agents and Orchestration

Tool-using systems, planning, memory, orchestration, and operational guardrails.

Failure Recovery Patterns

Concepts, patterns, and practical guidance on Failure Recovery Patterns within Agents and Orchestration.

Guardrails and Policies

Concepts, patterns, and practical guidance on Guardrails and Policies within Agents and Orchestration.

Human-in-the-Loop Design

Concepts, patterns, and practical guidance on Human-in-the-Loop Design within Agents and Orchestration.

Memory and State

Concepts, patterns, and practical guidance on Memory and State within Agents and Orchestration.

Multi-Agent Coordination

Concepts, patterns, and practical guidance on Multi-Agent Coordination within Agents and Orchestration.

Multi-Step Reliability

Concepts, patterns, and practical guidance on Multi-Step Reliability within Agents and Orchestration.

Planning and Task Decomposition

Concepts, patterns, and practical guidance on Planning and Task Decomposition within Agents and Orchestration.

Sandbox and Permissions

Concepts, patterns, and practical guidance on Sandbox and Permissions within Agents and Orchestration.

Tool Use Patterns

Concepts, patterns, and practical guidance on Tool Use Patterns within Agents and Orchestration.

AI Foundations and Concepts

Core concepts and measurement discipline that keep AI claims grounded in reality.

AI Product and UX

Design patterns that turn capability into useful, trustworthy user experiences.

Business, Strategy, and Adoption

Adoption strategy, economics, governance, and organizational change driven by AI.

Data, Retrieval, and Knowledge

Data pipelines, retrieval systems, and grounding techniques for trustworthy outputs.

Hardware, Compute, and Systems

Compute, hardware constraints, and systems engineering behind AI at scale.

AI

A structured directory of AI topics, organized around innovation and the infrastructure shift shaping what comes next.