Skip to main content

Evaluating Google ADK Agents with W&B Weave for reliable insurance workflows

This article provides a practical, end-to-end walkthrough of building, testing, and evaluating an AI insurance agent using the Agent Development Kit (ADK) and W&B Weave.
Created on July 16|Last edited on July 17
In the insurance industry, artificial intelligence agents are increasingly responsible for complex workflows, from answering customer inquiries to automating claims processing. However, evaluating these agents requires more than traditional “pass/fail” unit testing. Insurance agents rely on probabilistic models and tool orchestration, so they require nuanced, multi-step assessments to ensure correctness, reliability, and regulatory compliance.
Google's Agent Development Kit (ADK) framework addresses this need with a modular, scalable approach tailored for evaluating AI agents’ behavior and final outputs. When paired with W&B Weave’s powerful tracking and analytics, ADK enables organizations to monitor agent performance, streamline testing, and continuously improve systems at scale.
In this article, we'll be building just such an agent. Let's jump in.


Table of contents



Understanding ADK and its role in agent evaluation

The Agent Development Kit is a modular framework designed to help AI teams build, manage, and evaluate advanced AI agents. With ADK, developers can define various types of agents, orchestrate complex workflows, and integrate memory, tools, and logic to enable rich, intelligent behaviors. ADK's structure is especially valuable for applications where agents need to perform multi-step tasks, interact with external systems, and produce explainable, auditable outcomes.
One of ADK’s core strengths is its built-in evaluation system. This system enables users to create structured scenarios that test not only what an agent outputs, but also how it reasons and acts throughout a task. ADK supports two primary methods for agent evaluation: test files, which enable you to check single interactions or behaviors (like unit tests), and evalset files, which capture more complex, multi-step workflows for broader assessments. These tools help ensure that AI agents behave as expected in real-world situations, making it easier to identify and resolve issues early.
ADK offers a web interface for interactively building, editing, and running these test cases. It also supports code-based and automated workflows, allowing teams to integrate agent evaluation into their development pipelines. Using W&B Weave makes it easy to store, track, and visualize evaluation results over time, allowing you to understand how changes to your system impact performance and reliability. Using ADK and Weave together provides both rigorous test scaffolding and continuous visibility into how agents perform in real-world scenarios.

Key differences between evaluating traditional software and AI Agents

Evaluating traditional software typically relies on deterministic, rule-based logic. Given the same input, the program always produces the same output. As a result, testing centers focus on unit and integration tests with clear “pass/fail” outcomes, emphasizing code correctness, coverage, and performance under well-defined conditions.
In contrast, evaluating AI agents presents unique challenges. These agents are inherently probabilistic. Given the same input, they may produce multiple valid (or invalid) outputs, each influenced by context, prior conversation, and even randomness due to sampling temperature. Their responses are not just a direct function of code but a reflection of learned behaviors and reasoning patterns.
This fundamental difference means that evaluating LLM-based agents cannot rely solely on traditional assertions or binary pass/fail checks. Instead, effective assessment requires a blend of qualitative and quantitative methods:
  • Qualitative evaluation: It is critical to analyze not only what the agent answered but also how it arrived at its answer. This involves reviewing the sequence of actions (the "trajectory"), decisions on tool usage, and the logical path taken. For industries like insurance, understanding this decision-making process is crucial for accuracy, trustworthiness, and compliance.
  • Flexible output criteria: Since there may be multiple acceptable responses, evaluation should allow for partial credit, fuzzy matching, or variations in the correct answers.
  • Attention to reasoning and edge cases: LLMs may generalize or make unexpected inferences. Evaluations must test the quality of the agent's reasoning, its ability to handle ambiguity, and its adherence to established guidelines or constraints.
Overall, evaluating AI agents shifts the emphasis from simply checking final outputs to thoroughly examining both the process and the results. Agent evaluation becomes more nuanced and context-dependent, which requires specialized tools like ADK that can capture these complexities and support continuous improvement.

The importance of defining clear objectives and success criteria

Defining clear objectives and success criteria is crucial before automating the evaluation of AI agents. By establishing upfront what success looks like, you create a structured process for assessing agent performance that aligns with your business needs. These foundations help you identify critical tasks, determine which agent behaviors are most important, and establish clear benchmarks for acceptable performance.
It can be challenging to determine whether changes, such as a new model version or prompt modification, actually lead to better results. Without specific goals and measurable success criteria, your evaluation may become subjective or inconsistent. You may observe different outputs, but struggle to determine whether the system has truly improved, regressed, or simply changed.
This is why it is also crucial to track evaluation metrics over time. Continuous tracking enables you to visualize trends, assess whether your agent is improving or deteriorating, and understand the actual impact of your updates in the real world. Persistent record-keeping also helps pinpoint when performance shifts occurred and what might have caused them, providing valuable insights for both debugging and future iterations.
Tools like W&B Weave streamline and make this process reliable. Weave offers robust evaluation workflows, allowing you to log predictions, scores, and summaries directly from your code. You can store, query, and compare evaluation results across experiments, model versions, and changes to prompts. The platform provides dashboards, trace views, and leaderboards, allowing you to drill into performance details, spot trends, and share results with your team. By using clear objectives with continuous tracking and tools like Weave, you ensure your agent evaluations are not only rigorous and meaningful but also actionable and repeatable as your system evolves.

How ADK facilitates agent evaluation

The Agent Development Kit is a toolkit designed for evaluating the end-to-end behavior of AI agents, particularly in complex settings such as insurance. Unlike ordinary application testing, agent evaluation with ADK considers not only what the agent outputs, but also how it arrives at its answer, tracking the steps it takes, the tools it uses, and whether it follows important business rules. ADK offers a flexible, schema-backed system for defining test cases and evaluation scenarios, and makes it simple to organize and manage these using an interactive web interface.
ADK supports two main methods for agent evaluation: test files and evalset files. With test files, you can define focused sessions for unit testing, providing quick feedback on individual agent behaviors; with evalset files, you can capture larger, multi-turn sessions that reflect real-world workflows, making it possible to test more complex scenarios and workflows. Both methods are supported in the ADK web viewer, which enables you to create, edit, and run evaluations in an intuitive and interactive manner. Alternatively, ADK also supports programmatic and command-line workflows for automation and integration into CI pipelines.
By supporting evaluation at various levels of detail, ADK enables teams to identify issues early, track changes across agent versions, and ensure that agents behave as expected, not just in simple cases, but also in complex, multi-step scenarios.

Benefits and challenges of using test files versus evalset files

Test files in ADK are ideal for unit testing. Each one captures a single agent session, making it easy to check isolated behaviors or simple interactions as you iterate on your agent’s logic. The benefit here is speed: these tests run quickly and can be easily integrated into development workflows, allowing you to spot regressions or mistakes immediately. The main challenge is that test files are best for straightforward cases; they don’t capture the complexity of full production workflows.
Evalset files, on the other hand, are well-suited for broader, integration-level testing. By collecting multiple, potentially lengthy sessions, they can simulate the real-world complexity your agent needs to handle, including multi-step conversations or intricate decision sequences. This makes evalset files ideal for regression testing, ensuring that updates don’t break existing workflows. However, creating and maintaining evalset files can be more involved and time-consuming, particularly as your use cases and datasets grow.
Key fields typically found in an evalset file:
  • A top-level object or array containing multiple cases
  • For each case, there is :
    • a sequence (conversation) of user and agent messages, optionally with expected intermediate data
    • tool usage for the expected response
Here’s a minimal example of an evalset file:
[
{
"eval_id": "case001",
"conversation": [
{
"role": "user",
"user_content": {
"parts": [
{"text": "I'd like to file a claim for my lost luggage."}
]
}
},
{
"role": "agent",
"final_response": {
"parts": [
{"text": "Sure, I can help you file a claim for your lost luggage. Can you provide more details about your trip?"}
]
},
"intermediate_data": {
"tool_uses": [
{
"name": "claim_intake",
"args": {
"type": "lost_luggage"
}
}
]
}
}
],
"session_input": {
"user_id": "U123",
"app_name": "insurance_demo"
}
},
{
"eval_id": "case002",
"conversation": [
{
"role": "user",
"user_content": {
"parts": [
{"text": "When is my next premium payment due?"}
]
}
},
{
"role": "agent",
"final_response": {
"parts": [
{"text": "Your next payment is $85.35, due July 15th."}
]
},
"intermediate_data": {
"tool_uses": [
{
"name": "fetch_payment_due",
"args": {
"policy_id": "P10005"
}
}
]
}
}
]
}
]
Both approaches require maintaining up-to-date evaluation criteria, aligning expected tool use and responses with your current agent logic, and periodically revisiting tests to ensure continued relevance as your agents evolve.

ADK's Built-In Evaluation Feature

ADK’s built-in evaluation feature provides powerful tools and a clear workflow for assessing agent performance. Through the ADK web UI, developers can generate, edit, and organize test cases or eval sets, run evaluations, and immediately visualize where agent actions or answers deviate from expectations. The platform supports customizing evaluation metrics, such as tool usage accuracy and response similarity, so teams can tailor assessments to their precise needs.
For those who prefer automation or programmatic control, ADK also supports command-line and code-based evaluation. This makes it easy to include agent testing as part of routine development and CI/CD deployments. Flexible configuration files and detailed output reporting enable in-depth analysis of failure cases and provide actionable feedback.
By combining interactive tools and automation options, ADK’s evaluation features ensure that AI agents are rigorously tested, that results are understandable and auditable, and that teams have a solid foundation for tracking and improving agent reliability. When paired with tracking platforms like Weave, ADK’s results can be stored, visualized, and compared over time, supporting both immediate debugging and long-term quality assurance.
When you run evaluations in ADK, several core metrics are calculated to give a comprehensive view of your agent’s performance. First, tool usage accuracy measures whether the agent called the right tools, in the expected order, and with the correct arguments. This metric reflects how closely the system’s internal actions align with your intended workflow.
Next, the response similarity metric assesses the closeness of the agent’s final response to a human-provided reference answer, utilizing standard language similarity metrics, such as ROUGE-1. This provides a direct indication of whether the agent is generating responses that align with business or user expectations.
In some configurations, ADK also supports an LLM-based coherence score, where another large language model judges the agent’s response for overall clarity, correctness, and logical soundness.
Each metric can be customized or required at different thresholds, enabling teams to establish clear quality bars for both internal task execution and user-facing output. These quantitative measurements, together with ADK’s flexible tools for setup and reporting, provide the backbone for both rapid agent development and ongoing production assurance.

Integrating W&B Weave for Enhanced Evaluation

Integrating W&B Weave with ADK provides powerful new capabilities for tracking, analyzing, and visualizing your agent evaluations. While ADK’s built-in evaluation framework supports flexible testing and well-defined metrics, it does not include out-of-the-box integration with Weave for visualizing results. To address this, we implemented a custom evaluator class that mimics the standard ADK workflow, but automatically logs rich evaluation results to Weave. Feel free to check out the source code in the GitHub repo here.
Using this integration, each test run captures and stores detailed metrics, including whether the agent used the correct tools, tool recall (i.e., the proportion of required tools used), and a custom LLM-based judgment for response quality and correctness. For every test case, Weave records the ground-truth tools and answers, the agent's output, LLM-based similarity assessments, and summary statistics, such as overall tool coverage.
With your results stored in Weave, you can visualize agent performance over time, drill into failure cases, compare experiment runs, and create leaderboards of different agent versions or prompt strategies. This tighter feedback loop uncovers trends and subtle regressions that might be missed by reviewing logs alone, and provides a single source of truth for tracking improvements as you iterate on your agents.
By combining ADK’s rigorous evaluation structure with Weave’s analytics and visualization tools, your team gains deeper insights into agent behavior and can confidently transition from prototype to production.

Step-by-step tutorial for using ADK with W&B Weave

To set up a robust agent evaluation scenario, the first step is to construct a comprehensive, lifelike insurance agent using Google’s Agent Development Kit. The goal is to create a modern customer support system that can handle a diverse range of insurance tasks and queries, including policy lookup, claims tracking, quoting new coverage options, and answering frequently asked questions.
Instead of connecting to real backend systems, I created a detailed mock database that represents insurance data across multiple domains, including auto, health, homeowners, and life insurance. This database contains sample policies, claims, premium payments, customer appointments, FAQ entries, repair shop locations, and insurance quotes. By using a static dataset, the agent can simulate core workflows in a controlled and repeatable environment, making it ideal for evaluation and testing.
Here’s the database I used for my agent:
{
"policy_lookup": [
{"policy_id":"P100001","customer_id":"C20001","type":"Auto","start":"2022-05-01","end":"2023-05-01","coverage":{"liability":50000,"collision":1000,"deductible":500},"exclusions":["rental cars"],"status":"active"},
...
],
"claims_status_checker": [
{"claim_id":"CL5001","policy_id":"P100001","date_filed":"2023-04-15","status":"pending","type":"collision","adjuster":"Alex Wu","next_step":"awaiting police report"},
...
],
"coverage_calculator": [
{"policy_id":"P100001","option":"raise liability","change":25000,"new_premium":103.50,"new_deductible":500},
...
],
"premium_payment_system": [
{"invoice_id":"INV3001","policy_id":"P100001","due_date":"2023-05-05","amount_due":120.00,"status":"unpaid"},
...
],
"appointment_schedule_checker": [
{"appointment_id":"A4101","customer_id":"C20001","type":"callback","date":"2023-05-06T10:00:00","status":"scheduled"},
...
],
"faq_search": [
{"faq_id":"F001","question":"How can I file a claim?","answer":"You can file a claim online, by phone, or with your agent. Have your policy and event details ready."},
...
],
"find_nearby_repair_shop": [
{"shop_id":"R01","name":"AutoFix Pros","zip":"60636","approved":true,"phone":"773-555-1010"},
...
],
"insurance_quote_data": [
{"quote_id":"Q3001","customer_id":"C20001","coverage":"auto-basic","premium":109.50,"deductible":500,"valid_until":"2023-06-01"},
...
],
"customer_profile_lookup": [
{"customer_id":"C20001","name":"Jane Smith","email":"jsmith@email.com","phone":"555-4410","dob":"1982-03-01"},
...
]
}
Here is the agent code:
import json
from typing import Any, Dict, List, Optional

# ---- Load data ----
with open("insurancedata.json", "r", encoding="utf-8") as f:
INSURANCE_DATA = json.load(f)

def policy_lookup(policy_id: str) -> Optional[Dict[str, Any]]:
return next((p for p in INSURANCE_DATA['policy_lookup'] if p['policy_id'] == policy_id), None)

def customer_policies(customer_id: str, status: Optional[str] = None) -> List[Dict[str, Any]]:
return [
p for p in INSURANCE_DATA['policy_lookup']
if p['customer_id'] == customer_id and (status is None or p['status'] == status)
]

def claims_status_checker(policy_id: str, status: Optional[str] = None) -> List[Dict[str, Any]]:
return [
c for c in INSURANCE_DATA['claims_status_checker']
if c['policy_id'] == policy_id and (status is None or c['status'] == status)
]

def coverage_calculator(policy_id: str, option: Optional[str] = None) -> List[Dict[str, Any]]:
return [
cc for cc in INSURANCE_DATA['coverage_calculator']
if cc['policy_id'] == policy_id and (option is None or cc['option'] == option)
]

def premium_payment_system(policy_id: str) -> List[Dict[str, Any]]:
return [
i for i in INSURANCE_DATA['premium_payment_system']
if i['policy_id'] == policy_id
]

def appointment_schedule_checker(customer_id: str) -> List[Dict[str, Any]]:
return [
a for a in INSURANCE_DATA['appointment_schedule_checker']
if a['customer_id'] == customer_id
]

def faq_search(query: str, topk: int = 1) -> List[Dict[str, Any]]:
matches = [f for f in INSURANCE_DATA['faq_search'] if query.lower() in f['question'].lower()]
return matches[:topk]

def find_nearby_repair_shop(zip_code: str, approved_only: bool = True, topk: int = 3) -> List[Dict[str, Any]]:
shops = [s for s in INSURANCE_DATA['find_nearby_repair_shop'] if s['zip'] == zip_code]
if approved_only:
shops = [s for s in shops if s['approved']]
return shops[:topk]

def insurance_quote_data(customer_id: str) -> List[Dict[str, Any]]:
return [
q for q in INSURANCE_DATA['insurance_quote_data']
if q['customer_id'] == customer_id
]


agent_prompt = """
policy_lookup(policy_id)
# Use when you need all details for a specific policy using its policy_id (such as viewing coverages, start/end, type, etc).
print(policy_lookup("P100001"))
# Returns:
# {'policy_id': 'P100001', 'customer_id': 'C20001', ..., 'status': 'active'}

customer_policies(customer_id, status=None)
# Use when you want all policies (optionally filtered by status, e.g. 'active') for a given customer_id.
print(customer_policies("C20001", status="active"))
# Returns:
# [{'policy_id': 'P100001', ...}]

claims_status_checker(policy_id, status=None)
# Use when you need all claims for a given policy. Optionally filter by claim status ('pending', 'approved', etc).
print(claims_status_checker("P100001", status="pending"))
# Returns:
# [{'claim_id': 'CL5001', 'policy_id': 'P100001', 'status': 'pending', ...}]

coverage_calculator(policy_id, option=None)
# Use when you want to see alternate coverage/premium options for a policy, like how much premium changes if you change something.
print(coverage_calculator("P100001"))
# Returns:
# [{'policy_id': 'P100001', 'option': 'raise liability', 'change': 25000, 'new_premium': 103.50, 'new_deductible': 500}]

premium_payment_system(policy_id)
# Use when you need to see billing or payment info for a policy (invoices, due dates, amounts).
print(premium_payment_system("P100001"))
# Returns:
# [{'invoice_id': 'INV3001', 'policy_id': 'P100001', 'due_date': '2023-05-05', 'amount_due': 120.00, 'status': 'unpaid'}]

appointment_schedule_checker(customer_id)
# Use to list all scheduled or completed appointments (like adjuster visits or callbacks) for a customer.
print(appointment_schedule_checker("C20001"))
# Returns:
# [{'appointment_id': 'A4101', 'customer_id': 'C20001', 'type': 'callback', 'date': '2023-05-06T10:00:00', 'status': 'scheduled'}]

faq_search(query, topk=1)
# Use to look up answers to common insurance questions by keyword.
print(faq_search("deductible", topk=2))
# Returns:
# [{'faq_id': 'F002', 'question': 'What is a deductible?', 'answer': ...},
# {'faq_id': 'F007', ...}]

find_nearby_repair_shop(zip_code, approved_only=True, topk=3)
# Use when you want to find repair shops near a ZIP code, optionally only approved ones. Useful for auto claims.
print(find_nearby_repair_shop("60636"))
# Returns:
# [{'shop_id': 'R01', 'name': 'AutoFix Pros', ...},
# {'shop_id': 'R02', ...}]

insurance_quote_data(customer_id)
# Use when you wish to view all (recent/past) insurance quotes for a customer.
print(insurance_quote_data("C20001"))
# Returns:
# [{'quote_id': 'Q3001', 'customer_id': 'C20001', 'coverage': 'auto-basic', ...}]
"""

from google.adk.agents import Agent
import os

# Configure environment if needed
os.environ["GOOGLE_CLOUD_PROJECT"] = "dsports-6ab79"
os.environ["GOOGLE_CLOUD_LOCATION"] = "us-central1"
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True" # or set as needed

root_agent = Agent(
model="gemini-2.5-pro",
name="insurance_agent",
instruction="""
You are an expert insurance assistant. You can answer customer questions and perform insurance-related tasks
using the available tools. For every question, carefully select and use the most relevant tool(s) from your toolkit.
Never guess—return only what is found in the database or tools. Here are your tool instructions and usage examples:

""" + agent_prompt,
description="An insurance assistant that can answer any insurance query using the provided tools.",
tools=[
policy_lookup,
customer_policies,
claims_status_checker,
coverage_calculator,
premium_payment_system,
appointment_schedule_checker,
faq_search,
find_nearby_repair_shop,
insurance_quote_data
]
)
The agent itself is configured to use a suite of specialized tools, each corresponding to a different backend function or service that an insurance assistant would rely on in a production setting. These tools include functions to look up policies by ID, retrieve all claims on a given policy, check payment history, schedule appointments, search FAQs, and more. Each tool operates directly on the mock database, ensuring that the agent’s responses always come from the underlying structured data rather than from hallucination or guesswork.
With these building blocks in place, the insurance agent can process complex, multi-step user requests fully autonomously, drawing on a realistic yet safe and manageable dataset. This setup lays the foundation for consistent, high-quality agent evaluation, making it possible to assess both the correctness of the agent’s outputs and the soundness of its tool usage or decision-making path in a manner that mirrors real-world requirements.

Creating a test set

After setting up the insurance agent and mock environment, I used the ADK web UI to create an initial set of ten test scenarios. These test examples were designed to cover a wide range of realistic insurance queries and workflows, ensuring that both common and edge cases were represented. The web interface makes it straightforward to create and manage each session, allowing me to specify the user inputs, expected agent behavior, and the sequence of tools that should be called to produce a correct response.
While building these tests, I occasionally noticed that the agent didn’t always use all of the necessary tools or sometimes took shortcuts in its decision-making.

In these situations, I iteratively nudged the agent with more explicit instructions, adjusted the prompt, or tweaked the test cases themselves to better guide the agent toward the ideal solution path. This hands-on, interactive evaluation process was especially valuable for uncovering subtle issues in tool usage and ensuring that each test case reflected an optimal, real-world approach to completing the task. After each query and final answer, I used the ADK UI to add my sample to the eval set file:
Through this cycle of rapid prototyping and refinement, I assembled a robust set of test cases that not only verified whether the agent delivered the correct answers but also validated that it employed the appropriate tools and reasoning steps along the way. This set the stage for more systematic evaluation and further improvements down the line.
After iteratively refining my test examples in the ADK web UI, I often ended up with multiple intermediate agent responses and helper prompts in the test cases. These were valuable during the test creation process, as they allowed me to guide the agent step by step toward the correct use of tools and proper reasoning. However, once the agent’s final answers and tool usage were confirmed to be correct, these intermediate helper messages were no longer necessary and could actually clutter the evaluation data.
To streamline and clean up my evaluation dataset, I wrote a script that systematically strips away all non-essential responses from each test case. This script processes each test example, removing any incorrect or intermediate helper responses, and leaves behind only the original user query, the complete set of tool calls made during the session, and the agent’s final correct answer. By condensing each conversation to just these critical elements, the pruned evaluation set becomes easier to interpret and more suitable for reliable, automated testing.
import os
import json

# HARDCODE your input file here:
infile = "/path_to/original_insurance_eval_set.evalset.json"

# Create output filename ending with _cleaned.evalset.json
dirname, basename = os.path.split(infile)
if basename.endswith('evalset.json'):
core = basename[:-len('evalset.json')].rstrip('.-_')
outbase = f"{core}_cleaned.evalset.json"
else:
core = os.path.splitext(basename)[0]
outbase = f"{core}_cleaned.evalset.json"
outfile = os.path.join(dirname, outbase)

with open(infile, "r", encoding="utf-8") as f:
data = json.load(f)

for case in data.get("eval_cases", []):
conv = case.get("conversation", [])
if not conv:
continue
# Collect all tool_uses for this case
merged_tool_uses = []
for turn in conv:
merged_tool_uses.extend(turn.get("intermediate_data", {}).get("tool_uses", []))
# Extract first user message and last model response
first = conv[0]
last = conv[-1]
# Build new single-turn conversation:
# user_content from first, final_response from last, new intermediate_data
new_turn = {
"invocation_id": last.get("invocation_id", first.get("invocation_id")),
"user_content": first.get("user_content", {}),
"final_response": last.get("final_response", {}),
"intermediate_data": {
"tool_uses": merged_tool_uses,
"intermediate_responses": []
},
"creation_timestamp": last.get("creation_timestamp", first.get("creation_timestamp"))
}
case["conversation"] = [new_turn]

with open(outfile, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)

print(f'Pruned/cleaned eval set written as: {outfile}')
This post-processing step ensures that the evaluation data remains focused, accurate, and free from any artifacts introduced during test creation and troubleshooting. As a result, the cleaned eval set provides a strong foundation for ongoing agent assessment, continuous integration workflows, and consistent benchmarking.
Here’s the previous example after running the post-processing script:


After creating this test set, you can navigate to the eval tab inside the ADK web UI, select the test cases you would like to try out, and run these test cases inside the web UI without any code.

While the ADK web UI provides a convenient "no code" solution for running and visually inspecting test results, I wanted more granular control over how my agent's responses are scored. In particular, I aimed to implement custom grading logic that could more accurately assess the nuances of the insurance domain, and I also sought to utilize Weave for evaluation tracking and visualization. This would allow me to easily view metrics such as which tool calls were missed, directly compare model outputs with ground truth answers, and visually explore patterns or failures across many runs.
To accomplish this, I decided to build a new Evaluator class specifically designed for my use case. This Evaluator not only grades each test case using both LLM-based semantic checks and tool usage analysis, but also logs each prediction and its associated metrics to Weave. By integrating my custom evaluation pipeline with Weave, I was able to generate interactive dashboards that make it simple to monitor performance trends, investigate specific error cases, and compare the agent's behavior across different builds or prompt configurations. This approach ensures full flexibility and transparency in the evaluation process, supporting rigorous model improvement efforts far beyond what traditional test pass/fail summaries can provide. If you would like to use this evaluator, feel free to clone the repo I made here.
To evaluate my insurance agent using the custom logic I had developed, I created a simple evaluation script that imports the new evaluator class and uses it to score the agent on my cleaned test set. This script is designed for flexibility, supporting both standalone runs and easy integration with automated test workflows such as pytest.
The evaluation script loads environment variables, brings in my LiteAgentEvaluator, and then runs the agent on each test case in the evaluation set. By setting use_weave=True, the evaluation results and key metrics are automatically logged to Weave for later analysis and visualization. This approach lets me track not only the final accuracy of each agent response, but also deeper details like tool usage correctness and semantic similarity judged by an external LLM.
Running this evaluation loop provides a detailed, consistent scoring of my agent's abilities on a curated set of insurance workflows. Thanks to the modular script structure, it's easy to expand the suite, automate regular checks, or compare different agent configurations as development continues. The end result is a reliable, auditable workflow for measuring agent quality and surfacing actionable insights throughout the agent improvement process.
Here’s the code for the eval:
import dotenv
import pytest
from google.adk.evaluation.agent_evaluator import AgentEvaluator
from lite_custom_evaluator import LiteAgentEvaluator # see the github repo for this script

pytest_plugins = ("pytest_asyncio",)

use_weave = True # Set to True to use LiteAgentEvaluator (with weave), or False to use AgentEvaluator

@pytest.fixture(scope="session", autouse=True)
def load_env():
dotenv.load_dotenv()

@pytest.mark.asyncio
async def test_all():
"""Test the agent's basic ability on a few examples."""
agent_name = "seq_agent"
eval_path = "/Users/brettyoung/Desktop/dev25/tutorials/adk/adk-samples/python/agents/seq_agent/seq_agent/insurance_eval_set_cleaned.evalset.json"

if use_weave:
await LiteAgentEvaluator.evaluate(
agent_name,
eval_path,
num_runs=2,
use_weave=True
)
else:
await AgentEvaluator.evaluate(
agent_name,
eval_path,
num_runs=2
)
You can run the eval with the following command:
python -m pytest -s eval
Once your evaluation script has finished running, you can open the Weave dashboard to explore the results of your evaluation in detail. Weave does much more than simply provide an overall score for your agent. It offers a comprehensive and interactive interface where you can review not only the final evaluation metrics but also dig into the specifics for each test case.
Inside Weave, you can view a breakdown of every model response your agent generated during testing. For each example, you will see the actual output from the agent, all the tool calls that were made, and how these compare with the expected ground truth answers. Individual response scores are displayed clearly, covering aspects such as whether all the required tools were used, how closely the output matches the correct answer according to the LLM grader, and any other custom metrics that have been tracked.
This level of insight enables you to quickly identify recurring issues, understand your agent's strengths and weaknesses, and pinpoint exactly where they may need further improvement. You can easily filter to focus on specific categories of errors, investigate problem cases, and track your progress over multiple evaluation runs. By using Weave’s powerful visualization tools, you transform raw evaluation data into meaningful feedback that supports ongoing development and refinement of your agent.
Here are a few screenshots inside Weave showing the results of the evaluation:


Inside Weave, we see that the evaluation interface breaks down results per test case, allowing you to inspect each query alongside the agent’s tool usage, outputs, and associated metrics.
Each row represents a test case, including its input, the expected tools (gt_tools), and the tools the agent actually used (agent_tools). When you select a specific example, the right-hand pane expands to show full detail: the original input, all tool calls made by the agent, and a comparison against the expected tool usage.
Metrics such as llm_correctness, tool_correctness, and tool_recall are shown clearly per case, providing immediate insight into what went right or wrong. In the selected example, all tools were used correctly, and the output was correct, resulting in perfect scores across the board. These per-example evaluations make it easy to identify edge cases or patterns of failure that need attention.
With this level of granularity, Weave turns evaluations into a navigable log of your agent’s behavior, allowing you to filter, search, and compare across time. This helps you understand not only aggregate scores but also the specific reasons behind them, providing a clear path for iterative debugging and optimization.

Conclusion

Evaluating AI agents in complex domains such as insurance requires a much more nuanced approach than standard software testing. With the Agent Development Kit, teams have the tools to create realistic environments, define detailed test sets, and precisely track how an agent reasons, uses tools, and delivers outputs at every step. By complementing ADK’s structured evaluation capabilities with Weave’s powerful analytics and visualization, organizations can move beyond checking for simple correctness and instead build a culture of continuous, data-driven improvement.
The combination of ADK and Weave enables you to monitor not only whether your agent produces the correct answers, but also whether it is reasoning correctly, utilizing the appropriate workflows, and adhering to regulatory and business standards. You gain detailed visibility into each step the agent takes, as well as clear metrics that reveal both overall trends and specific areas for refinement. This empowers development teams to catch problems early, measure real progress with every update, and maintain confidence in deploying AI-powered workflows.
Ultimately, establishing a rigorous and transparent evaluation process helps ensure your AI agents are trustworthy, reliable, and ready for production. By using tools like ADK and Weave together, you can accelerate development, rapidly respond to new challenges, and deliver the high level of quality that today’s complex, regulated industries demand.







































Iterate on AI agents and models faster. Try Weights & Biases today.