For more information or if you need help retrieving your data, please contact Weights & Biases Customer Support at support@wandb.com
How do we evaluate what these agents are doing, not just in terms of technical accuracy, but also independence, safety, and real business value?
As shown in the matrix above, a Level 3 tool-calling agent can be assigned to Oversight Levels 1, 2, 3, 4, or 5, depending on risk, not capability.
| Agent Level | What to Measure | Autonomy Range | Common Failure Modes | Executive Metric |
|---|---|---|---|---|
| Level 1 — Basic Responder | Relevance, factual accuracy, latency, token efficiency | Intelligent Automation | Hallucination, repetition loops, prompt injection | Customer satisfaction, cost per interaction |
| Level 2 — Router | Precision/recall/F1, confusion matrix, multi-intent detection, fallback handling | Conditional Autonomy | Intent misclassification, over-confident routing | First-contact resolution rate |
| Level 3 — Tool-Calling Agent | Tool selection accuracy, parameter extraction, error recovery, cost optimization | Conditional → High Autonomy | Wrong tool selection, parameter hallucination, cascading failures | Task completion rate, automation ROI |
| Level 4 — Multi-Agent System | Orchestration efficiency, handoff success, system-level goal completion, credit assignment | High Autonomy | Deadlocks, communication overhead, diffusion of responsibility | End-to-end process efficiency |
| Level 5 — Autonomous Agent | Independent goal achievement, cross-domain generalization, adaptation rate, novel solutions | Full Autonomy (theoretical/high-risk) | Goal misalignment, value drift, overconfidence | Human intervention hours saved |
| Aspect | End-to-End Evaluation | Component-Level Evaluation |
|---|---|---|
| Purpose | Validate overall system performance and business value | Diagnose and optimize individual components |
| Focus Area | Full workflow from input to final output | LLM, retriever, tool-caller, orchestrator, etc. |
| Best For | User experience, compliance, reliability | Debugging, accuracy improvements, performance tuning |
| Scope | Broad, holistic view of the system | Narrow, deep investigation of one subsystem |
| Failure Detection | Detects user-visible or system-wide failures | Identifies root causes and bottlenecks |
| Cost & Time | More expensive and slower to run | Faster iterations with lower cost |
| When to Use | Before release, during production monitoring | During development, troubleshooting, optimization |
| Output Quality Signals | Task success rate, latency, user satisfaction | Precision@k, F1 score, error-handling quality |
| Risk Indicators | Workflow-level breakage, compliance gaps | Misrouting, tool-call errors, retrieval drift |
| Recommended Share | ~70% of total evaluation effort | ~30% of total evaluation effort |