Tracking and optimizing agentic workflows using W&B Weave and NVIDIA NeMo Agent Toolkit
Learn how to build an alert triage agent with W&B Weave and NVIDIA NeMo Agent Toolkit
Created on September 22|Last edited on September 25
Comment
This post was authored by Anuradha Karuppiah, Hsin Chen, and Kyle Zheng from NVIDIA alongside Ayush Thakur from Weights & Biases
💡
Agentic workflows, where large language models make decisions, call tools, and reason over multi-step trajectories, are rapidly becoming central to modern AI systems. However, building, debugging, and refining these workflows remains complex.
AI agents often span multiple layers, involve different architectural styles, and exhibit non-deterministic behavior. Small tweaks can significantly impact accuracy, latency, or token usage, making it hard to optimize through iteration.
This blog explores how W&B Weave and NVIDIA NeMo Agent Toolkit can be used to evaluate and optimize agentic workflows. The goal is to enable fast, informed iteration grounded in real metrics.
What is the NVIDIA NeMo Agent Toolkit?
The NVIDIA NeMo Agent Toolkit is a framework-agnostic library designed to help developers build, evaluate, profile, and optimize agentic AI workflows. Whether you're using LangChain, LlamaIndex, CrewAI, or your own custom setup, the NeMo Agent Toolkit integrates easily and provides tools to track latency, token usage, and tool calls across complex, multi-step agents. It comes with built-in evaluation, profiling, observability, and front-end plugins to facilitate faster and more structured iteration.
With the NeMo Agent Toolkit, you can define an agentic workflow with a YAML configuration file. This file can be used to:
- Run a single input through the workflow
- Start the workflow as a server with a built-in front-end, such as FastAPI or MCP
- Evaluate a dataset for accuracy and performance

The NeMo Agent Toolkit also provides deep insight into workflow behavior, enabling the visualization and analysis of evaluation and observability metrics through W&B Weave.
W&B Weave
W&B Weave helps developers evaluate, monitor, and iterate continuously to deliver AI and agentic applications with confidence. W&B Weave offers a suite of developer tools to support every stage of your agentic AI workflows. You can run robust application evaluations, keep pace with new LLMs, and monitor applications in production while collaborating securely.
W&B Weave is designed to overcome the barriers of traditional software development tools to meet the needs of non-deterministic AI-powered applications. W&B Weave is a framework and AI model agnostic, so you don’t need to write any code to work with popular AI frameworks, LLMs, and software libraries like NVIDIA NeMo Agent Toolkit.
To get started with W&B Weave, you’ll first need to authenticate your machine using your W&B API key.
💡
Next, install the W&B Weave SDK in your Python environment by running pip install weave
To authenticate, use:
- wandb login in the terminal, or
- wandb.login() if you’re in an interactive environment like Jupyter Notebook.
Paste your API key when prompted to complete the authentication process.
Building an example agent using NeMo Agent Toolkit
To demonstrate how evaluation works in practice, we’ll examine two workflows created with the NeMo Agent Toolkit. The first, the Alert Triage Agent, benchmarks operations-focused tasks using labeled datasets and automated evaluators. The second, the AI-Q Research Assistant, extends this evaluation approach to open-ended research workflows that rely on custom metrics and human annotations. Together, these examples illustrate how the NeMo Agent Toolkit and Weave integrate to make agentic systems more transparent and easier to optimize.
Step-by-step guide to evaluating the alert triage agent using W&B Weave
In this section, we’ll walk through how to evaluate and iterate on an agentic system using Weave, using the alert triage agent example from the NVIDIA NeMo Agent Toolkit repository.
Alert triage agent

The alert triage agent is a multi-agent system built with the NVIDIA NeMo Agent Toolkit. It is designed to automate the investigation of monitoring alerts in large-scale server environments. It aims to reduce manual effort, accelerate incident response, and standardize triage decisions across teams.
When an alert is triggered, for example, when a host goes offline or there is a spike in CPU usage, the system initiates a multistep investigation that mirrors the logic a human analyst might follow.
The workflow starts by checking for planned maintenance. If no maintenance is detected, the core alert triage agent analyzes the alert and dynamically selects the appropriate diagnostic tools. These tools include hardware status checks, telemetry analysis, network diagnostics, and service health verifications. Each tool provides contextual information to help the agent reason about the root cause.
If needed, a supporting cloud metric analysis agent is invoked to conduct a more detailed analysis of infrastructure metrics. After all relevant data is collected and interpreted, the agent generates a structured report containing an alert summary, evidence, diagnostic findings, and a root cause classification. This report can be reviewed by human operators or passed downstream for automated remediation.
Alert triage agent installation
To reproduce the experiments in this section, clone the NeMo Agent Toolkit repository and install the Alert Triage Agent Example by following the instructions in the example README.
Alert triage agent workflow
The workflow and evaluation are driven by a YAML configuration file. The configuration files used in this blog are available in the NeMo Agent toolkit repository for reference.
Snippet of the configuration file showing the Alert Triage Agent workflow and tools:
functions:hardware_check:_type: hardware_checkllm_name: tool_reasoning_llmoffline_mode: truehost_performance_check:_type: host_performance_checkllm_name: tool_reasoning_llmoffline_mode: truemonitoring_process_check:_type: monitoring_process_checkllm_name: tool_reasoning_llmoffline_mode: truenetwork_connectivity_check:_type: network_connectivity_checkllm_name: tool_reasoning_llmoffline_mode: truetelemetry_metrics_host_heartbeat_check:_type: telemetry_metrics_host_heartbeat_checkllm_name: tool_reasoning_llmoffline_mode: truemetrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live modetelemetry_metrics_host_performance_check:_type: telemetry_metrics_host_performance_checkllm_name: tool_reasoning_llmoffline_mode: truemetrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live modetelemetry_metrics_analysis_agent:_type: telemetry_metrics_analysis_agenttool_names:- telemetry_metrics_host_heartbeat_check- telemetry_metrics_host_performance_checkllm_name: telemetry_metrics_analysis_agent_llmmaintenance_check:_type: maintenance_checkllm_name: maintenance_check_llmstatic_data_path: examples/advanced_agents/alert_triage_agent/data/maintenance_static_dataset.csvcategorizer:_type: categorizerllm_name: categorizer_llmworkflow:_type: alert_triage_agenttool_names:- hardware_check- host_performance_check- monitoring_process_check- network_connectivity_check- telemetry_metrics_analysis_agentllm_name: ata_agent_llm
W&B Weave configuration
To enable exporting workflow data to the W&B Weave server, the following configuration is added to Alert Triage Agent YAML file:
general:use_uvloop: truetelemetry:tracing:weave:_type: weaveproject: "nat-ata"
Alert triage agent evaluation
Evaluation configuration:
eval:general:output_dir: .tmp/nat/examples/advanced_agents/alert_triage_agent/output/llama_31/workflow_alias: alert_triage_agent_llama_31_8bdataset:_type: json# JSON representation of the offline CSV data (including just the alerts, the expected output, and the label)file_path: examples/advanced_agents/alert_triage_agent/data/offline_data.jsonprofiler:base_metrics: trueevaluators:rag_accuracy:_type: ragasmetric: AnswerAccuracyllm_name: nim_rag_eval_llmclassification_accuracy:_type: classification_accuracy
The alert triage agent is evaluated using a labeled dataset and multiple scoring dimensions.
Dataset
The evaluation dataset simulates production-like alerting conditions to test the agent’s reasoning and decision-making in a controlled environment. Each data point includes:
- A structured alert input in JSON format
- Mocked API responses that resemble real-world diagnostics (such as IPMI hardware status, network ping results, and CPU telemetry)
- A ground truth label indicating the correct root cause category for the alert
[{"id": "0","question": "{ \"alert_id\": 0, \"alert_name\": \"InstanceDown\", \"host_id\": \"test-instance-0.example.com\", \"severity\": \"critical\", \"description\": \"Instance test-instance-0.example.com is not available for scrapping for the last 5m. Please check: - instance is up and running; - monitoring service is in place and running; - network connectivity is ok\", \"summary\": \"Instance test-instance-0.example.com is down\", \"timestamp\": \"2025-04-28T05:00:00.000000\" }","answer": "This alert is a false positive.","label": "false_positive"},{"id": "1","question": "{ \"alert_id\": 1, \"alert_name\": \"InstanceDown\", \"host_id\": \"test-instance-1.example.com\", \"severity\": \"critical\", \"description\": \"Instance test-instance-1.example.com is not available for scrapping for the last 5m. Please check: - instance is up and running; - monitoring service is in place and running; - network connectivity is ok\", \"summary\": \"Instance test-instance-1.example.com is down\", \"timestamp\": \"2025-04-28T05:00:00.000000\" }","answer": "The root cause of this alert is a hardware issue.","label": "hardware"},
The dataset is designed to cover a wide range of root cause scenarios, enabling both accuracy benchmarking and failure mode analysis. It supports reproducible evaluation of the agent’s full decision pipeline—from parsing inputs and invoking tools to generating a final classification.
Evaluators
To evaluate the agent’s outputs meaningfully, we use two complementary evaluation methods:
- LLM-based answer accuracy: Measures whether the agent’s response matches the expected answer. This evaluator is useful when the output is unstructured or partially subjective, such as in free-form explanations.
- Label-based classification prediction accuracy: A rule-based evaluator that extracts the predicted root cause category from the generated report and compares it directly to the ground truth label. It outputs a binary accuracy score (1.0 for a match, 0.0 for a mismatch), along with reasoning for interpretability.
Running evaluation
With the dataset and evaluation framework in place, running an end-to-end test is straightforward. The agent can be evaluated in offline mode using this command:
nat eval --config_file=examples/advanced_agents/alert_triage_agent/configs/config_offline_llama_31.yml
This loads alerts and corresponding labels from the JSON evaluation dataset and executes the full workflow using mock API responses for the tools. The workflow outputs and evaluation results are exported to Weave for further analysis and visualization.
Alert triage agent experiments: Comparing models
In this section, we evaluate the Alert Triage Agent workflow using two different large language models and compare the results.
To reproduce the experiments, run the following:
Run with meta/llama-3.1-8b-instruct:
nat eval --config_file=examples/advanced_agents/alert_triage_agent/configs/config_offline_llama_31.yml
Run with meta/llama-3.3-70b-instruct:
nat eval --config_file=examples/advanced_agents/alert_triage_agent/configs/config_offline_llama_33.yml
Viewing the results with W&B Weave
As the workflow runs, you will find a Weave URL (starting with a 🍩 emoji). Click on the URL to access your logged trace timeline. Select the Eval tab to view the evaluation results.
Eval summary
This tab shows the aggregate results across all entries in the dataset:

The results show a significant improvement in accuracy across the two runs for both evaluators: classification_accuracy and rag_accuracy.
In addition to accuracy metrics, the Summary tab includes several key performance indicators:
- wf_runtime_p95: 95th percentile workflow runtime for a dataset entry
- llm_latency_p95: 95th percentile latency of LLM calls for a dataset entry
- total_runtime: Total runtime for evaluating the entire dataset. Since entries were processed in parallel, with a max concurrency of 8, the total_runtime is close to wf_runtime_p95.
- Total Tokens: Total number of tokens used across all entries
Eval individual dataset results
The Results tab shows accuracy and performance metrics for each dataset entry. You can view and plot these metrics across both runs to compare agent behavior at the example level.

Workflow Traces
The Traces tab provides a detailed breakdown of each dataset entry's execution, including tool calls, LLM responses, and intermediate reasoning steps.

The Flamegraph page visualizes the execution time of each call within the workflow, making it easy to identify performance bottlenecks across tool invocations and LLM steps.

Evaluation demo
The experiments with the alert triage agent show how we can evaluate and visualize the performance of agentic systems using a labeled dataset and automated evaluators. However, not every use case has a clear ground truth or labels to measure against. In many domains, especially research-oriented tasks, outputs are long-form, interpretive, and sometimes subjective.
In such cases, programmatic or LLM-based accuracy metrics are insufficient, and evaluation relies more heavily on expert review and feedback. The next example, AI-Q deep research assistant, illustrates how human annotation can complement automated metrics, with W&B Weave providing the structure to make the process seamless.
AI-Q deep research assistant
The AI-Q research assistant is an intelligent research copilot built using the NVIDIA NeMo Agent Toolkit. It helps automate literature review, information synthesis, and context-driven reasoning for research workflows. By chaining together specialized tools and reasoning steps, it goes beyond simple question answering to orchestrate structured, multi-step research tasks.
A typical workflow involves tasks such as:
- Document retrieval and summarization: The assistant retrieves relevant papers, reports, or documents from enterprise sources via the NVIDIA RAG Blueprint and, optionally, web search (Tavily), then produces succinct summaries.
- Comparative synthesis: It analyzes findings across multiple sources and creates structured comparisons with citations, highlighting both the alignment and differences in results.
- Iterative research conversation: It maintains context across turns, enabling the refinement of queries, exploration of hypotheses, and targeted follow-ups.
- Markdown-structured reporting: It generates a final, source-attributed report organized according to a specified outline, including a consolidated bibliography.
Under the hood, the assistant plans the report, searches in parallel across sources with an LLM-as-a-judge for relevance, writes, reflects to identify gaps, and iterates until coverage is sufficient.

Tracing and evaluating with W&B Weave
Evaluation for the AI-Q Research Assistant utilizes the same configuration-driven setup shown earlier with the Alert Triage Agent. Runs are launched from a YAML configuration, and all workflow traces are logged to Weave for inspection. What’s different here is the use of custom, research-specific evaluation metrics:
- Coverage: Does the report capture all key facts from the ground truth?
- Synthesis: Does it integrate multiple sources in a meaningful way, showing alignment or differences?
- Hallucination: Does the output introduce any unsupported claims?
- Citation quality: Are references correctly attributed and verifiable?
- RAGAS metrics: Do retrieval and factuality hold up across context relevance, answer accuracy, and groundedness?
Because Weave tracks our evaluator outputs alongside per-example traces, it’s easy to compare these metrics across runs, correlate failures to specific steps (e.g., retrieval vs. synthesis), and configure evaluations per project (e.g., swap templates, thresholds, or weightings) while preserving full provenance.

Human annotations with W&B Weave
For research-oriented agents like AI-Q, automated metrics alone rarely tell the full story. Many outputs require expert judgment, evaluating nuance, clarity, or usefulness in ways that metrics can’t fully capture. To address this, we use the Weave API to attach human annotations directly to runs. These annotations serve as arbitrary tags that are written alongside each run’s metrics and traces, enabling instant filtering and grouping on the dashboard without the need to manually open individual runs.
A recommended set of annotations is included in the eval config to enable consistent filtering and comparison of runs in the Weave dashboard.
telemetry:logging:console:_type: consolelevel: DEBUGtracing:weave:_type: weaveproject: "nat-bp-project" # Name of the project in weave, runs will be grouped under this project name# Custom attributes that will become human annotation scorers in Weave UI (might throw an error if you are creating a new project with custom attributes)custom_attributes:dataset_version: "standard cystic fibrosis dataset"release_version: "1.0.0"evaluation_purpose: "Eval Purpose here"git_commit_hash: "1234567890"deployed_instance: "local"# Example arbitrary tags - these will become human annotation scorersevaluation_rating: 4 # Number input (1-5 scale)needs_review: true # Boolean checkbox
These fields appear as human annotation scorers and filter controls in the UI, allowing researchers and developers to quickly slice runs by dataset version, release version, or flags such as 'needs_review'.
Because annotations are attached at log time through the API, every run is consistently tagged with provenance (e.g., git_commit_hash, deployed_instance) and subjective scorers (e.g., evaluation_rating), which makes side-by-side comparisons much easier, especially in a team setting where runs created by others also populate the project space.
Combined with the logged evaluator outputs and per-example traces, this run-level annotation pattern makes it simple to:
- Filter by purpose, dataset version, or environment to isolate comparable runs.
- Correlate regressions or improvements to specific code commits or deployments.
- Standardize human review workflows (e.g., triage all needs_review=true runs) without manual inspection.
Human annotations on the Weave dashboard:

Note: Human annotation in AI-Q is an experimental feature and will be available in the NeMo Agent Toolkit repository in an upcoming release.
Conclusion
Agentic workflows offer powerful capabilities, but their complexity demands thoughtful evaluation. The NVIDIA NeMo Agent Toolkit provides the foundation for designing and running agents across diverse domains, while W&B Weave adds the visibility needed to understand and improve them. Together, they form a powerful loop for experimentation:
- In structured environments, such as the alert triage agent, you can benchmark workflows against labeled datasets, compare models head-to-head, and diagnose performance bottlenecks.
- In open-ended settings such as the AI-Q research assistant, evaluation extends beyond accuracy to include coverage, synthesis, hallucination, and citation quality, while human annotations provide the nuanced judgment that automated metrics cannot capture.
Whether your agents handle precise diagnostics or interpretive research, this combination enables you to iterate quickly, debug effectively, and build systems you can trust. Try out the alert triage agent example, explore the AI-Q research assistant blueprint, or apply the same approach to your own NeMo Agent Toolkit workflows - and use W&B Weave to see not just what your agent produces, but also how it gets there.
Authors
Weights & Biases
Ayush Thakur is a Manager of AI Engineering at Weights & Biases, where he primarily leads open-source integration efforts. He has co-created wandbot, W&B’s LLM-powered Q&A chatbot, and has co-created two courses focused on RAG and Evaluations.
NVIDIA
Anuradha Karuppiah is a Principal Software Engineer at NVIDIA, where she develops agentic AI systems and is a maintainer of the NeMo Agent Toolkit. Previously, she was a Linux developer focused on datacenter software and served as a maintainer for open source networking projects, including Free Range Routing.
Hsin Chen is a Senior Data Scientist at NVIDIA, where she develops AI solutions for cybersecurity. Her current work focuses on applying large language model agents to automate threat analysis and support security operations. Previously, she built machine learning and deep learning systems for cyber threat detection. Her work contributes to open-source projects such as NVIDIA Morpheus and the NeMo Agent Toolkit.
Kyle Zheng is an intern at NVIDIA, where he develops agentic AI solutions primarily for the AI-Q Research Assistant, Nemo Agent Toolkit, and NV-Ingest. He is currently in his second internship at NVIDIA and is pursuing a master’s degree at Clemson University.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.