Skip to main content

Building and evaluating AI agents with Azure AI Foundry Agent Service and W&B Weave

A hands-on guide to building and evaluating real-time, tool-using AI agents with Azure AI Foundry Agent Service, SerpAPI, and W&B Weave. This is a translated version of the article. Feel free to report any possible mis-translations in the comments section
Created on August 26|Last edited on August 26
Building intelligent AI agents that can fetch real-time information, connect to external APIs, and orchestrate complex workflows on the fly is no simple task. But with Microsoft’s Azure AI Foundry Agent Service, you can rapidly assemble and deploy robust agent applications with all the security, scalability, and integration benefits of the Azure ecosystem.
In this walkthrough, we will build a travel assistant using Foundry Agent Service that helps users plan trips by finding hotels and nearby restaurants based on their preferences. The agent interacts with real APIs hosted on Azure Functions to fetch live data, uses reasoning over the user’s input to determine what information is needed, and then calls the appropriate tools to deliver recommendations. We will also integrate W&B Weave to track the agent’s behavior in detail—namely, capturing inputs and outputs—so we can understand and debug its decisions throughout the conversation.
You’ll learn how to set up and configure the Azure service, register your search tools, implement endpoints for live hotel and dining recommendations, and finally, how to connect and interact with your agent using Python.

Table of contents



What is an agent?

An AI agent is an intelligent system designed to achieve specific goals through planning, the use of external tools, retaining memory, and adapting over time. Unlike traditional programmatic automation (which follows rigid, predefined rules), AI agents dynamically process information, make decisions, and refine their approach based on feedback.
While chatbots primarily engage in conversations and require user input at every step, AI agents operate independently. They don’t just generate responses, they take action, interact with external systems, and manage multi-step workflows without constant supervision.
Some key components of AI agents include:
  • Tools:Connect to APIs, databases, and software to extend functionality.
  • Memory:Store information across tasks to improve consistency and recall.
  • Continual learning: Adapt and refine strategies based on past performance.
  • Orchestration:Manage multi-step processes, break down tasks, and coordinate with other agents.

Building our agent with Azure AI Foundry Agent Service

There’s quite a few steps to build out our agent, so let's dive right into the tutorial.
To start, you will need to create an Azure account and set up billing, and navigate to Azure AI Foundry. Next, create a project:

Now, you'll want to navigate to the “Agents” pane on the left hand side inside Azure AI Foundry. You can click the “New Agent” button to create a new agent.

After creating an agent, you should see a window like the following:

This page allows you to adjust the base model of your agent, and also allow you to configure its behavior, connect it to external data sources through the Knowledge section, and enable runtime tool usage via Actions. You can also fine-tune how creative or deterministic the agent is by adjusting the Temperature and Top P sliders.
In this tutorial, we will explore how to connect our agent to different “Actions” (aka. Tools) which will enable our agent to fetch realtime information from specific data sources. Before configuring the agent, we’ll build the underlying APIs it will call.
One endpoint handles hotel searches. It takes in a location, check-in and check-out dates, number of adults, and an optional currency. It builds a request to SerpAPI for hotel data, parses the response, and returns hotel names, rates, coordinates, and booking links. If no results are found, it returns an appropriate message.
Another endpoint provides food recommendations based on a query, location coordinates, and optional cuisine. It formats the search string, uses SerpAPI to find local food spots, and returns names, ratings, and review counts. If there are no matches, it responds with an error message. These endpoints will be added to the agent later so it can use them when responding to user travel queries.

Hosting our tools with Azure Functions

We will use Azure functions to host our tools, along with the Serper API to provide real-time data for our agent. You will first need to obtain a Serper API key here. Additionally, you will need to ensure your local development environment is ready for deploying Azure AI functions. For more information on the requirements to deploy Azure functions, you can find more information here.

Setting up the Azure CLI

To start, we will need to install the Azure CLI. I’m on Mac, so I will show the install steps for Mac OS, but if you are on a different platform, here are the docs for other platforms.
brew update && brew install azure-cli
Then log into your Azure account:
az login
Now, we'll install a few packages for Azure functions:
brew tap azure/functions
brew install azure-functions-core-tools@4
brew link --overwrite azure-functions-core-tools@4
Note, in order to install the above packages, you may need to update your Xcode command line tools. Also keep in mind that there are a few bugs preventing the execution of local functions on Mac devices, so we will skip testing our tools locally in this tutorial.
Next, we will want to obtain our subsction id for Azure. This can be found by running the following command:
az account show --query id -o tsv
Then, copy this value and export it to your environment using the following command:
export AZURE_SUBSCRIPTION_ID=your_id_from_above
Now we are ready to create a template that will serve as the boilerplate for hosting our functions. Start by running:
azd init --template functions-quickstart-python-http-azd -e flexquickstart-py
Next, set up a virtual environment with:
python3 -m venv .venv
source .venv/bin/activate
Make sure the required Azure resource provider is registered:
az provider register --namespace Microsoft.App

Writing our tool functions

Next, we'll replace our function_app.py file with the following code, which is the logic for our tools that our travel agent will use:
import azure.functions as func
import logging
import requests
import json

API_KEY = "your_serper_api_key"

app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS)

@app.route(route="foodsearch", methods=["POST"])
def food_search(req: func.HttpRequest) -> func.HttpResponse:
logging.info("🔵 food_search triggered")
try:
data = req.get_json()
query = data.get("query")
latitude = data.get("latitude")
longitude = data.get("longitude")
cuisine = data.get("cuisine")

if not (query and latitude and longitude):
return func.HttpResponse(
json.dumps({"error": "Missing required parameters. Must include query, latitude, and longitude."}),
status_code=400
)

search_query = f"{query} {cuisine}" if cuisine else query
search_params = f"ll=@{latitude},{longitude},14z"
url = f"https://serpapi.com/search?engine=google_local&q={requests.utils.quote(search_query)}&{search_params}&api_key={API_KEY}"
logging.info(f"🟠 Requesting: {url}")

res = requests.get(url)
res.raise_for_status()
results = res.json().get("local_results", [])

if not results:
return func.HttpResponse(json.dumps({"error": "No food places found"}), status_code=404)

places = [{
"name": r.get("title", "Unknown"),
"rating": r.get("rating", "N/A"),
"reviews": r.get("reviews", 0)
} for r in results]

return func.HttpResponse(json.dumps({"food_places": places}), mimetype="application/json", status_code=200)

except Exception as e:
logging.exception("food_search failed")
return func.HttpResponse(json.dumps({"error": "Server error", "details": str(e)}), status_code=500)

@app.route(route="hotelsearch", methods=["POST"])
def hotel_search(req: func.HttpRequest) -> func.HttpResponse:
logging.info("🔵 hotel_search triggered")
try:
data = req.get_json()
location = data.get("location")
check_in = data.get("check_in_date")
check_out = data.get("check_out_date")
adults = data.get("adults")
currency = data.get("currency", "USD")

if not (location and check_in and check_out and adults):
return func.HttpResponse(json.dumps({"error": "Missing required parameters", "received": data}), status_code=400)

url = (
f"https://serpapi.com/search?engine=google_hotels"
f"&q={requests.utils.quote(location)}"
f"&check_in_date={check_in}"
f"&check_out_date={check_out}"
f"&adults={adults}"
f"&currency={currency}"
f"&api_key={API_KEY}"
)
logging.info(f"🟠 Requesting: {url}")

res = requests.get(url)
res.raise_for_status()
properties = res.json().get("properties", [])

if not properties:
return func.HttpResponse(json.dumps({"error": "No hotels found"}), status_code=404)

hotels = [{
"name": h.get("name", "No Name"),
"price": h.get("rate_per_night", {}).get("lowest", "N/A"),
"currency": currency,
"lat": h.get("gps_coordinates", {}).get("latitude"),
"lon": h.get("gps_coordinates", {}).get("longitude"),
"link": h.get("link", "N/A")
} for h in properties]

return func.HttpResponse(json.dumps({"hotels": hotels}), mimetype="application/json", status_code=200)

except Exception as e:
logging.exception("hotel_search failed")
return func.HttpResponse(json.dumps({"error": "Server error", "details": str(e)}), status_code=500)

This script defines two Azure Functions:
  1. one for hotel search, and
  2. one for food search.
Each function listens for incoming POST requests and uses parameters from the request body to fetch relevant data from SerpAPI.
The hotel_search function takes a location, travel dates, and number of adults. It queries Google Hotels via SerpAPI and returns a list of hotel options. Crucially, it also includes the exact latitude and longitude of each hotel in the response. This is important because we’ll use these coordinates in the next step - to help the food search tool find nearby restaurants with precision. Rather than relying on vague location names, we’re anchoring our food queries to the hotel’s exact geographic location.
The food_search function uses a search query, coordinates, and optional cuisine type to search for local dining spots. It calls SerpAPI’s Google Local endpoint and returns restaurant names, ratings, and review counts.
Together, these tools allow the agent to follow a reasoning chain - first find the hotel, then use its coordinates to accurately recommend nearby food. This helps the assistant avoid common errors like suggesting places that are technically in the same city but far from the user’s actual location.
Finally, you will need to add the requests library inside the requirements.txt file which was created alongside our function_app.py file. Simply append add in "requests" on the second line of the file.
Now we can deploy our functions with the following command:
azd up
After deployment, you will see a URL printed in the terminal. This is the base URL for your hosted functions. You’ll use this to register your tools in the agent configuration so they can be called during runtime. Make sure to copy and save it - you’ll need it shortly.

Linking our tools to our agent in Azure AI Foundry Agent Service

Now that our functions are deployed, we’ll define OpenAPI schemas to describe how our agent should interact with them. These schemas act like contracts that specify what input each tool expects and what output it returns. This allows the agent to understand how to call the tools correctly. Each schema includes metadata about the tool, the server URL where it’s hosted, the endpoint path, HTTP method, expected input structure, and the format of the expected response.
For the hotel search function, the schema defines a POST request that takes a location, check-in and check-out dates, number of adults, and an optional currency. The response includes hotel details such as name, price, location coordinates, and booking link.
Here’s the schema for the hotel_search tool:
{
"openapi": "3.0.0",
"info": {
"title": "Hotel Search Tool",
"version": "1.0.0",
"description": "Find hotels by location and travel dates."
},
"servers": [
{
"url": "url to your api"
}
],
"paths": {
"/hotelsearch": {
"post": {
"operationId": "hotelSearch",
"summary": "Search for hotels",
"description": "Search for hotels using location, check-in/check-out dates, and number of adults.",
"requestBody": {
"required": true,
"content": {
"application/json": {
"schema": {
"type": "object",
"required": [
"location",
"check_in_date",
"check_out_date",
"adults"
],
"properties": {
"location": {
"type": "string",
"example": "New York, NY"
},
"check_in_date": {
"type": "string",
"format": "date",
"example": "2025-07-01"
},
"check_out_date": {
"type": "string",
"format": "date",
"example": "2025-07-04"
},
"adults": {
"type": "integer",
"example": 2
},
"currency": {
"type": "string",
"example": "USD"
}
}
}
}
}
},
"responses": {
"200": {
"description": "Success",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"hotels": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"price": {
"type": "string"
},
"currency": {
"type": "string"
},
"lat": {
"type": "number"
},
"lon": {
"type": "number"
},
"link": {
"type": "string"
}
}
}
}
}
}
}
}
},
"400": {
"description": "Bad request",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string"
}
}
}
}
}
},
"500": {
"description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string"
},
"details": {
"type": "string"
}
}
}
}
}
}
}
}
}
}
}
The food search function schema also uses a POST request and expects a query, with optional cuisine, location, latitude, and longitude. It returns an array of food places with their name, rating, and number of reviews.
Here’s the schema for the food_search tool:
{
"openapi": "3.0.0",
"info": {
"title": "Food Search API",
"version": "1.0.0",
"description": "Searches for food places using a query and either a location string or coordinates."
},
"servers": [
{
"url": "url to your api",
"description": "Azure Function App endpoint"
}
],
"paths": {
"/api/foodsearch": {
"post": {
"summary": "Search food places",
"operationId": "foodSearch",
"requestBody": {
"required": true,
"content": {
"application/json": {
"schema": {
"type": "object",
"required": [
"query"
],
"properties": {
"query": {
"type": "string",
"example": "pizza"
},
"cuisine": {
"type": "string",
"example": "Mexican"
},
"location": {
"type": "string",
"example": "Chicago, IL"
},
"latitude": {
"type": "number",
"example": 41.8781
},
"longitude": {
"type": "number",
"example": -87.6298
}
}
}
}
}
},
"responses": {
"200": {
"description": "Successful response",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"food_places": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"rating": {
"type": "number"
},
"reviews": {
"type": "integer"
}
}
}
}
}
}
}
}
},
"400": {
"description": "Bad request",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string"
}
}
}
}
}
},
"500": {
"description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string"
},
"details": {
"type": "string"
}
}
}
}
}
}
}
}
}
}
}
These OpenAPI specs will be linked in the agent configuration so the agent knows how to call each endpoint properly when responding to user queries.
In order to link these functions to our agent, we will now return to the agent builder dashboard inside Azure AI Foundry, and then click the add button near the “Actions” panel for our newly created agent.


Click the “OpenAPI 3.0 specified tool” and then you will see the following screen:

Here, you can give your tool a name and description of how it should be used by the agent. After entering this information and clicking the “next” button, you will see the following screen where you can enter your OpenAPI schema JSON we created earlier:

After pasting the schema into the textfield, you will be able to create your tool. Repeat these steps to add your food search tool.

Prompting our agent

Now, we will give our agent a set of core instructions of how it should operate. Here’s what my instruction prompt below:

Here’s the full instruction prompt I used for my agent:
You are a smart travel assistant designed to help users find hotels and food spots in specific areas. Always respond clearly and helpfully, guiding users step by step if needed. Your main tools are hotel_searcher for hotels and food_searcher for food options. Use them precisely and effectively.

When helping a user:

Use hotel_searcher to find hotels. Always ensure the location includes the city and state/province (e.g., "Austin, Texas" not just "Austin").

Use food_searcher to find nearby places to eat. If the user is referencing a hotel, extract the latitude and longitude from the hotel result and pass them to food_searcher. This ensures an accurate nearby search.

If the user is planning a trip, find the hotel first, then use its coordinates to search for food.

Always explain what you’re doing in plain language so the user understands. If something is missing (like a city name or query), ask the user clearly and continue once the info is provided.

Be efficient, but conversational - your goal is to solve the user’s request using the tools as needed.

The instructions you provide here are a core part of how your agent behaves, and they have a direct impact on the user experience. This isn’t just metadata - it’s the operational logic the model uses to interpret user input, decide when and how to use tools, and determine what kind of language to use in responses. The better and more targeted your instructions are, the more reliable and helpful your agent will be.
Spending time refining this prompt is one of the best ways to improve agent performance. A small change in wording can guide the model to handle edge cases more gracefully or prioritize the right tool for the job. If users are confused, getting irrelevant results, or the agent is skipping tool calls it should be making, the instruction prompt is often the place to look.
This example prompt gives the agent a clear hierarchy of actions: when to use each tool, what context to extract, and how to handle incomplete requests. It's also written in plain language to ensure interpretability by the model, which is key to maintaining consistent behavior. Iterating on this over time based on user feedback or real test cases will lead to a noticeably better agent.
Ok, now we are ready to test out our agent! You can click the “Try in playground” option which will enable you to quickly test out the agent in a chat window:


We can also click the “View Code” button at the top which will give us some sample code for accessing our agent:


Accessing our agent with Python

I modified the code slightly to support a multi-round chat conversation, and you can modify the following code to add your agent’s unique ID (found in the agent configuration page) and the connection string (found in the sample code provided by the “view code” button). Here’s the code with Weave added to track our agents responses:
import weave
weave.init("az_agent")

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project_client = AIProjectClient.from_connection_string(
credential=DefaultAzureCredential(),
conn_str="your_connection_string"
)

agent = project_client.agents.get_agent("asst_MVmqhbpGlAay9Bs2QBf7WLGJ")
thread = project_client.agents.create_thread()

@weave.op()
def ask_foundry_agent(project_client, agent, thread_id, user_input):
project_client.agents.create_message(
thread_id=thread_id,
role="user",
content=user_input
)
project_client.agents.create_and_process_run(
thread_id=thread_id,
agent_id=agent.id
)
messages = project_client.agents.list_messages(thread_id=thread_id).text_messages
print(str(messages))
if messages:
return messages[-1].text.value
else:
return "No response from assistant."

print("Type 'exit' to end the conversation.")

while True:
user_input = input("You: ")
if user_input.strip().lower() in ("exit", "quit"):
print("Chat ended.")
break
assistant_reply = ask_foundry_agent(project_client, agent, thread.id, user_input)
print("Assistant:", assistant_reply)

This code sets up a simple command-line chat loop that lets you interact with your deployed Azure agent in real time, using multi-turn conversation tracking and basic logging via Weave. The script starts by initializing Weave to track interactions using weave.init("az_agent").
It authenticates to Azure using DefaultAzureCredential and connects to your specific agent project via AIProjectClient.from_connection_string(...), where you need to provide the correct connection string for your workspace. The agent object is fetched using your agent's unique ID, and a new thread is created. The thread allows multi-turn conversations by maintaining context between user inputs and assistant responses.
The core logic is in the ask_foundry_agent function. Overall, it does the following:
1. Sends the user's message to the agent using create_message.
2. Triggers a run on the agent with create_and_process_run.
3. Retrieves the assistant’s response from the thread using list_messages.
The while loop allows the user to keep chatting with the agent until they type "exit" or "quit", printing each agent response as it comes back. After running our script, and chatting with the model, we will see our traces inside Weave:

Weave is helpful because it lets you see exactly what happened during every interaction with your agent. In the screenshot, you can track each call to the function, along with the full inputs and outputs. This includes the user input, the thread ID, the agent config, and the response from the model. You can see how the agent interpreted the input, what it returned, and how long it took. This makes it easier to debug issues like missing responses, incorrect tool usage, or slow performance. Instead of guessing why the agent did something, you can inspect the trace and confirm exactly what was passed in and what came out. It also helps verify that the system prompt and tool instructions were properly included. For multi-round interactions, you can follow the entire flow message by message and pinpoint where something went wrong.

Evaluating web search capabilities with BrowseComp

One of my favorite features of Azure Foundry Agents is the out-of-the-box support for web search tools, which I believe is a necessary capability for any modern AI agent. We'll build out an Agent that can perform live searches using Bing, and then benchmark them on OpenAI’s BrowseComp benchmark, which is specifically designed to test agents on difficult-to-find questions that require reasoning, persistence, and creative search strategies.
BrowseComp includes a set of questions that are intentionally hard to solve with a single query, making it a strong evaluation set for agents that rely on dynamic web retrieval. By connecting each agent to the Bing Search knowledge source, we can simulate real-world search conditions and see how well different models perform when tasked with navigating and synthesizing real-time web data.
To evaluate different web-search agents, I chose to benchmark my agent which gpt-4o as the backbone modle. After setting up the agent in Azure AI Foundry, I selected the appropriate model deployment from the dropdown menu. To give these agents access to real-time web data, we will add the Bing Search knowledge source, which allows them to ground their responses using live internet results. Click the “Add” button under the Knowledge section. This will bring up a list of available data sources you can connect to your agent. From the options, choose “Grounding with Bing Search.”

This will allow your agent to access live web content through Bing. After selecting it, you’ll be prompted to choose an existing Bing Search connection or create a new one. If you already have a connection set up with your API key, you can simply select it and hit “Connect.”
Once connected, your agent will be able to pull fresh search results directly from the web, enabling it to respond with up-to-date information based on what it finds online.

Now we are ready to evaluate the agent with Weave and BrowseComp. The following script will prepare a portion of the dataset, initialize our argents, and use Weave Evaluations to compare performance of each agents on the dataset. Note that I only will use a portion of the dataset for evaluation, so in order to get the exact performance of the agents, I recommend running this script on the full dataset.
Here’s the code for the eval:
import os
import asyncio
import base64
import hashlib
import pandas as pd
import json
import weave
from litellm import completion

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

CONNECTION_STRING = "your_connection_string"

AGENT_IDS = {
"4o_web_searcher": "asst_vw3S54YgIzoy1WbdJWsceFiz"
}

# One client instance for efficiency
project_client = AIProjectClient.from_connection_string(
credential=DefaultAzureCredential(),
conn_str=CONNECTION_STRING
)

##############################
# 1. Data Preparation
##############################

def derive_key(password: str, length: int) -> bytes:
key = hashlib.sha256(password.encode()).digest()
return (key * (length // len(key))) + key[: length % len(key)]

def decrypt(ciphertext_b64: str, password: str) -> str:
encrypted = base64.b64decode(ciphertext_b64)
key = derive_key(password, len(encrypted))
decrypted = bytes(a ^ b for a, b in zip(encrypted, key))
return decrypted.decode(errors="ignore")

def load_eval_dataset(n=10):
csv_url = "https://openaipublic.blob.core.windows.net/simple-evals/browse_comp_test_set.csv"
df = pd.read_csv(csv_url)
dataset = []
for i, row in df.iterrows():
if i >= n:
break
try:
question = decrypt(row['problem'], row['canary'])
answer = decrypt(row['answer'], row['canary']) # <-- DECRYPT ANSWER TOO!
dataset.append({"text": question, "label": answer})
except Exception as e:
print(f"Failed to decrypt row {i}: {e}")
return dataset

##############################
# 2. Agent Interface
##############################

def agent_inference(agent_id: str, prompt: str, thread=None) -> str:
"""
Run an inference with Azure AI agent and return the most recent message string value.
Uses the last message.
"""
if not thread:
thread = project_client.agents.create_thread()
# Post user's message
project_client.agents.create_message(
thread_id=thread.id,
role="user",
content=prompt
)
project_client.agents.create_and_process_run(
thread_id=thread.id,
agent_id=agent_id
)
messages = project_client.agents.list_messages(thread_id=thread.id)
if messages.text_messages:
latest = messages.text_messages[0].as_dict()
text = latest.get("text", {}).get("value", "")
return text
return "[NO RESPONSE]"

##############################
# 3. Weave Model Wrapper
##############################

class FourOWebSearcherModel(weave.Model):
@weave.op
def predict(self, text: str) -> str:
return agent_inference(AGENT_IDS["4o_web_searcher"], text)

##############################
# 4. Scoring Function
##############################

@weave.op
def gpt4o_scorer(label: str, model_output: str) -> dict:
"""Score the model's output by comparing it with the ground truth."""
query = f"""
YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER

Model's Answer: {str(model_output)}
Correct Answer: {label}

Your task:
1. State the model's predicted answer (answer only).
2. State the ground truth (answer only).
3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within these delimiters:
```json
{{ "correctness": true/false }}
```
"""
response = completion(
model="gpt-4o-2024-08-06",
temperature=0.0,
messages=[{"role": "user", "content": query}]
)
# Parse response
try:
result = response.choices[0].message.content
json_start = result.index("```json") + 7
json_end = result.index("```", json_start)
correctness = json.loads(result[json_start:json_end].strip()).get("correctness", False)
except Exception as e:
correctness = False

return {"correctness": correctness, "reasoning": response.choices[0].message.content}

##############################
# 5. Main Evaluation Loop
##############################
async def run_evaluations():
weave.init("azure_agent_eval")

dataset = load_eval_dataset(100)
print(f"Loaded {len(dataset)} examples.")

models = {
"4o_web_searcher": FourOWebSearcherModel(),
}

scorers = [gpt4o_scorer]

for model_name, model in models.items():
print(f"\n\n=== EVALUATING {model_name.upper()} ===")
evaluation = weave.Evaluation(
dataset=dataset,
scorers=scorers,
name=f"{model_name} Evaluation"
)
results = await evaluation.evaluate(model)
print(f"Results for {model_name}: {results}")

if __name__ == "__main__":
# os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
asyncio.run(run_evaluations())
BrowseComp is an exceptionally difficult benchmark. It's specifically designed to push web-search agents beyond surface-level lookup tasks. Many of the questions require multi-step reasoning, creative search queries, and the ability to synthesize fragmented information spread across obscure sources. In OpenAI’s own testing, even advanced models like GPT-4o and GPT-4.5 performed poorly despite having browsing tools.

Surfacing bugs quickly with Weave

At first, I ran into a few issues with my evaluation script. One key bug was that I hadn't properly decrypted the labels in the dataset - so when the model output was being compared to the ground truth, it was always marked wrong. The bug was subtle because the model responses looked fine at a glance, but the scores were coming out as 0% across the board.
This is where Weave really helped. The comparisons view made it obvious something was off - the model’s output was being evaluated against what looked like gibberish or empty values. Once I opened up a specific trace and saw that the label field hadn’t been decrypted while the text field had, the problem was immediately clear.

The fix was simple:I just needed to call the decrypt function on both the question and the answer fields in the loop that loads the dataset. After that, the labels were properly displayed in Weave, and I could clearly see which examples were actually correct or not. Weave made that debugging process fast and visual - no guessing, just direct evidence that let me patch the script in seconds.
Here are the results for my evaluation:

Overall, our agent scored 6% accuracy on the 100 example subset that we created of Browsecomp. Now without testing on the full benchmark, its difficult to tell how the final score would pan out, however, this does seem like a promising result compared to some of the scores OpenAI has reported. According to OpenAI, their GPT-4o model with browsing enabled scored 1.9% accuracy, which is over 3x lower compared to the estimated performance of our agent.


Conclusion

This walkthrough isn’t just about building a travel assistant - it’s about expanding what agents can actually do when connected to real-world data and tools. By combining Azure AI Foundry Agent Service’s orchestration features with live-function endpoints, OpenAPI schemas, and observability via Weave, you’re not just experimenting - you’re deploying agents with real agency. The emphasis shifts from chatting to acting. As you build more of these systems, what matters most isn’t which model you use, but how well your tools, logic, and data flow together to serve a purpose.