Going from demo to production with Google Cloud's Vertex AI Agent Builder

This article walks through building an AI-powered travel assistant with Vertex AI Agent Builder, integrating real-time data retrieval, tool orchestration, and monitoring to tackle the challenges of making an AI agent work reliably in production.
Brett Young, Christian Williams
Created on February 27|Last edited on May 13
Comment
Building an AI agent demo is straightforward for many use-cases. But making that same demo work reliably in production is a whole different story. The gap between a promising prototype and a reliable, scalable agent is what separates a demo from a useful product. Many teams showcase impressive capabilities in controlled environments, only to watch their models break down when faced with real-world challenges.
In this article, the goal is to build an AI travel agent with Vertex AI Agent Builder, integrate W&B Weave for monitoring, and tackle the many challenges that come with deploying a production-ready system. We'll focus on giving the agent the ability to retrieve real-time hotel and restaurant recommendations, process user inputs dynamically, and intelligently manage tool calls based on context. We'll also go through many of the unexpected challenges that came up, including issues with input handling, tool interactions, and overall agent behavior.
﻿
﻿
Table of contentsConfiguring Google Cloud IAM permissions Setting up an agent with Vertex AI Agent Builder1. Obtain a SerpAPI key2. Create Google Cloud Run Functions (using Cloud Run UI)3. Create OpenAPI schemas and tool descriptions 4. Create a "Playbook" inside the Vertex Conversational Agents Dashboard 5. Configure an LLM for the agentAccessing our Vertex Agent in Python Tracking costs and usage with Weave Takeaways from building and refining the agent1. Logging inputs and outputs is important for debugging2. Agent Confidence requires tuning 3. Tool functions must be created thoughtfully 4. Prompts are important for using the right tools5. Edge cases need to be proactively tested6. Exposing errors to users can improve usabilityFinal thoughts
﻿
Configuring Google Cloud IAM permissions To start, make sure you have a Google Cloud account and have set up the Google Cloud CLI on your system, and also added the necessary IAM permissions. For more details on how to set these up, check out this article, that walks through the steps in more details.
In addition to the IAM roles specified in those steps, you will also need to add the roles shown below: 
﻿
﻿
Setting up an agent with Vertex AI Agent BuilderNow we’re ready to start working on our agent. This involves setting up APIs, defining tools, and integrating them into a “Playbook” (more on this later) to enable multi-step workflows. I will cover the steps below: 
1. Obtain a SerpAPI keySerpAPI is used to fetch hotel and restaurant data. You need an API key to access it. If you don’t have one, sign up at SerpAPI and generate a key. This key will be required in the cloud functions to make API calls.
2. Create Google Cloud Run Functions (using Cloud Run UI)To handle API calls for searching hotels and food places, we create two functions using Cloud Run Functions Web UI Editor. Using Google Cloud Run functions was handy for this project due to their serverless design, auto-scaling, and pay-per-use model. Additionally, Cloud Run functions can handle traffic spikes efficiently and support containerized apps without infrastructure management.
Each function extracts parameters from user queries and makes an external API call. Below is the implementation for foodSearch:
const https = require('https');
﻿
const API_KEY = "YOUR_SERPAPI_KEY";
﻿
exports.foodSearch = (req, res) => {
  console.log("🔵 Function triggered");
﻿
  try {
    const { location, latitude, longitude, query, cuisine } = req.body;
﻿
    if (!query) {
      console.error("❌ Missing 'query' parameter:", req.body);
      return res.status(400).json({ error: "Missing required parameter 'query'" });
    }
﻿
    // Construct search query
    let searchQuery = query;
    if (cuisine) {
      searchQuery += ` ${cuisine}`;
    }
﻿
    let searchParams;
    if (latitude && longitude) {
      console.log(`🌍 Using latitude/longitude: ${latitude}, ${longitude}`);
      searchParams = `ll=@${latitude},${longitude},14z`;
    } else if (location) {
      console.log(`📍 Using location string: ${location}`);
      searchParams = `location=${encodeURIComponent(location)}`;
    } else {
      console.error("❌ Missing location parameters (either 'location' or 'latitude/longitude' required)");
      return res.status(400).json({ error: "Missing location parameters. Provide 'location' or 'latitude/longitude'." });
    }
﻿
    const url = `https://serpapi.com/search?engine=google_local&q=${encodeURIComponent(searchQuery)}&${searchParams}&api_key=${API_KEY}`;
﻿
    console.log("🟠 Requesting:", url);
﻿
    https.get(url, (apiRes) => {
      let data = '';
﻿
      apiRes.on('data', (chunk) => {
        data += chunk;
      });
﻿
      apiRes.on('end', () => {
        console.log("🟢 API Response received");
﻿
        try {
          const jsonResponse = JSON.parse(data);
﻿
          if (!jsonResponse.local_results || jsonResponse.local_results.length === 0) {
            console.warn("⚠️ No food places found in API response.");
            return res.status(404).json({ error: "No food places found" });
          }
﻿
          const foodPlaces = jsonResponse.local_results.map(place => ({
            name: place.title || "Unknown",
            rating: place.rating || "N/A",
            reviews: place.reviews || 0
          }));
﻿
          console.log("✅ Sending response:", foodPlaces);
          res.setHeader('Content-Type', 'application/json');
          res.status(200).json({ food_places: foodPlaces });
﻿
        } catch (err) {
          console.error("🔴 JSON Parse Error:", err.message);
          res.status(500).json({ error: "Failed to parse API response", details: data });
        }
      });
﻿
    }).on('error', (err) => {
      console.error("❌ API Request Error:", err.message);
      res.status(500).json({ error: "Failed to fetch data from API", details: err.message });
    });
﻿
  } catch (error) {
    console.error("❌ Function Error:", error.message);
    res.status(500).json({ error: "Server error", details: error.message });
  }
};
﻿
﻿
﻿
I apologize for not sticking with Python here. I ran into a few issues using the requests library in my Python Cloud Function, so had to convert my function to Node.js.
💡
The foodSearch function extracts the user’s query (e.g., "sushi"), location details (latitude/longitude or a city name), and makes an external API request to SerpAPI. It validates that the query is present, builds the API request URL dynamically, and then fetches results from Google’s local search. The function formats the response before sending it back to the user.
Similarly, I'll share the beginning of my hotelSearch cloud function, which follows the same pattern but retrieves hotels based on location, check-in date, and check-out date. If you would like to see the full implementation for this function, I'll share the function here. 
﻿
const https = require('https');
﻿
const API_KEY = "YOUR_SERPAPI_KEY";
﻿
exports.hotelSearch = (req, res) => {
  console.log("🔵 Function triggered");
﻿
  try {
    const { location, check_in_date, check_out_date, adults } = req.body;
    let { currency } = req.body;
﻿
    if (!location || !check_in_date || !check_out_date || !adults) {
      console.error("❌ Missing parameters:", req.body);
      return res.status(400).json({ error: "Missing required parameters", received: req.body });
    }
﻿
    // Default currency to USD if not provided
    currency = currency || "USD";
﻿
    const url = `https://serpapi.com/search?engine=google_hotels&q=${encodeURIComponent(location)}&check_in_date=${encodeURIComponent(check_in_date)}&check_out_date=${encodeURIComponent(check_out_date)}&adults=${encodeURIComponent(adults)}&currency=${encodeURIComponent(currency)}&api_key=${API_KEY}`;
﻿
    console.log("🟠 Requesting:", url);
﻿
    https.get(url, (apiRes) => {
      let data = '';
﻿
      apiRes.on('data', (chunk) => {
        data += chunk;
      });
﻿
      apiRes.on('end', () => {
        console.log("🟢 API Response received");
﻿
        try {
          const jsonResponse = JSON.parse(data);
﻿
          if (!jsonResponse.properties || jsonResponse.properties.length === 0) {
            console.warn("⚠️ No hotels found in API response.");
            return res.status(404).json({ error: "No hotels found" });
          }
﻿
          const hotels = jsonResponse.properties.map(hotel => ({
            name: hotel.name || "No Name",
            price: hotel.rate_per_night?.lowest || "N/A",
            currency: currency, // Include currency for clarity
            lat: hotel.gps_coordinates?.latitude || null,
            lon: hotel.gps_coordinates?.longitude || null,
            link: hotel.link || "N/A"
          }));
﻿
          console.log("✅ Sending response:", hotels);
          res.setHeader('Content-Type', 'application/json');
          res.status(200).json({ hotels });
﻿
        } catch (err) {
          console.error("🔴 JSON Parse Error:", err.message);
          res.status(500).json({ error: "Failed to parse API response", details: data });
        }
      });
﻿
    }).on('error', (err) => {
      console.error("❌ API Request Error:", err.message);
      res.status(500).json({ error: "Failed to fetch data from API", details: err.message });
    });
﻿
  } catch (error) {
    console.error("❌ Function Error:", error.message);
    res.status(500).json({ error: "Server error", details: error.message });
  }
};
﻿
The hotelSearch function takes in the user’s location, check-in and check-out dates, and the number of adults. It constructs an API call to SerpAPI’s hotel search, retrieves the results, and formats them before returning them to the user. The function ensures required parameters are present and defaults the currency to USD if not provided.
3. Create OpenAPI schemas and tool descriptions Now, navigate to the "Conversational Agents" sections in the Vertex AI Agent Builder dashboard: 
﻿
﻿
After selecting the "Tools" pane, and clicking "Create a Tool," you will be able to configure details about how the agent will use the tool. Here, you will create OpenAPI schemas that enable the agent to use our tools. These schemas define how the agent interacts with each function, specifying request parameters, endpoints, and expected responses. I’ll share the full schemas for the foodSearcher and the hotelSearcher in the Github Repo. Also, I'll share some helpful docs for how to create Schemas for your tools. 
﻿
Here’s a screenshot of my “Food Searcher” tool in the Vertex's Conversational Agents Dashboard: 
﻿
﻿
Additionally, here's a screenshot for the “hotel searcher” tool:
﻿
﻿
Once the schemas and descriptions are ready, deploy the functions in Cloud Run. While I allowed unauthenticated function requests for initial testing, it's strongly recommended to use authenticated invocations in a production environment. To secure the Cloud Functions, configure them to require authentication by assigning your service account the "Cloud Run Invoker" permission in the IAM Google Cloud Console.
4. Create a "Playbook" inside the Vertex Conversational Agents Dashboard Once the functions are deployed, add them as tools in Vertex AI Agent Builder and define instructions in a Playbook that outlines when and how each tool should be used.
A playbook in Vertex AI is a structured set of instructions that guides the agent’s behavior in specific scenarios. It defines how the agent should respond to different user intents, when to invoke tools, and how to handle conversations dynamically. Playbooks help ensure the agent follows a consistent logic when managing queries, making interactions more reliable and context-aware.
﻿
﻿
Here’s a screenshot of the main playbook used for the agent. The goal prompt used is fairly generic, which may or may not be optimal for you, but it worked well in testing for this agent. In the instructions section, we tell the agent how we want it to use the tools available.
With this setup complete, the agent can now answer travel-related queries using real-time API calls while following the structured logic outlined in the playbook.
5. Configure an LLM for the agentOnce the tools and playbook are set up, the next step is to configure the LLM that powers the agent's responses. In Vertex AI Agent Builder, you can select a generative model under the Generative AI settings. 
﻿
﻿
For this project, we'll use Gemini 1.5 Flash (Preview) as the LLM. Note that by the time you are reading this, Gemini 2.0 Flash is likely available inside Agent Builder, but at the time of building this project, I only had access to Gemini 1.5 Flash. Additionally, the token limits can be adjusted for both input and output. The input token limit is set to 8k and the output token limit to 512. You may want to increase this limit, as it is the upper bound for the number of tokens that the model can use to respond with. In practice, a word usually contains 2 to 4 tokens. This balance allows the model to handle relatively long input prompts while keeping responses concise. Higher token limits can increase latency and cost, so it’s important to test and find a configuration that works well for your specific use case.
The temperature setting controls response variability - lower values make the model more deterministic, while higher values allow for more variation in responses. The temperature is set to 1, which provides a mix of predictability and flexibility. This means the agent won't be overly rigid but will still follow structured instructions relatively well.
With the LLM configured, the agent is now ready to process queries, call tools dynamically, and generate responses based on real-time API data while adhering to the logic outlined in the playbook.
After several iterations of the prompts and tools, I was able to build an agent that works decently well for finding hotels and restaurants nearby. I’ll share a screenshot below of a conversation I had with my agent: 
﻿
﻿
Accessing our Vertex Agent in Python Now we're ready to write some code that will use our agent. In addition to setting up a chat system between the user and the agent, we will also set up logging with Weave in order to log conversations as well as token usage. This will allow us to monitor how the agent interacts with tools, track how much context is being used in each interaction, and better understand how efficiently the agent is operating. By logging function inputs, outputs, and token consumption, we can catch errors, optimize costs, and refine our prompt strategies.
Here’s the code for running inference with our agent: 
import uuid
import json
import weave
from google.cloud import dialogflowcx_v3beta1 as dialogflow
﻿
weave_client = weave.init("vertex-agent")
﻿
# Add token cost configuration
weave_client.add_cost(
    llm_id="gemini_flash_002", 
    prompt_token_cost=0.00001875,  # Cost per 1,000 characters when <= 128k input tokens 
    completion_token_cost=0.000075  # Cost per 1,000 characters
)
﻿
﻿
# Dialogflow CX configuration
PROJECT_ID = "your gcloud project"
LOCATION_ID = "us-central1"
AGENT_ID = "your agent ID"
LANGUAGE_CODE = "en"
﻿
SESSION_ID = str(uuid.uuid4())  # Generate a persistent session ID
﻿
def simple_token_count(text: str) -> int:
    """Simple token count excluding whitespaces."""
    return len(text.replace(" ", ""))
﻿
@weave.op
def run_dialogflow_inference(text, session_id=SESSION_ID):
    """Sends a text query to Dialogflow CX and returns bot reply, tool input params, and tool output."""
    api_endpoint = f"{LOCATION_ID}-dialogflow.googleapis.com"
    client_options = {"api_endpoint": api_endpoint}
    session_client = dialogflow.SessionsClient(client_options=client_options)
﻿
    session_path = f"projects/{PROJECT_ID}/locations/{LOCATION_ID}/agents/{AGENT_ID}/sessions/{session_id}"
﻿
    text_input = dialogflow.TextInput(text=text)
    query_input = dialogflow.QueryInput(text=text_input, language_code=LANGUAGE_CODE)
    request = dialogflow.DetectIntentRequest(session=session_path, query_input=query_input)
﻿
    # Count input tokens before making the request
    input_tokens = simple_token_count(text)
﻿
    response = session_client.detect_intent(request=request)
﻿
    # Convert response to JSON for inspection
    response_dict = json.loads(response.__class__.to_json(response))
    print(json.dumps(response_dict, indent=2))  # Print full response for debugging
﻿
    # Extract bot reply safely
    bot_reply = "No response from agent."
    qr = response_dict.get("queryResult", {})
    rm = qr.get("responseMessages", [])
    if rm and rm[0].get("text", {}).get("text"):
        bot_reply = rm[0]["text"]["text"][0]
﻿
    # Count output tokens
    output_tokens = simple_token_count(bot_reply)
﻿
    # Extract toolCall data from actionTracingInfo
    tool_input_params = None
    tool_output = None
﻿
    action_tracing_info = qr.get("generativeInfo", {}).get("actionTracingInfo", {})
    if "actions" in action_tracing_info:
        for action in action_tracing_info["actions"]:
            if "toolUse" in action:
                tool_call_data = action["toolUse"]
                tool_input_params = tool_call_data.get("inputActionParameters", {}).get("requestBody", {})
                tool_output = tool_call_data.get("outputActionParameters", {}).get("200", {})
﻿
                print("TOOL INPUT PARAMS:", json.dumps(tool_input_params, indent=2))
                print("TOOL OUTPUT:", json.dumps(tool_output, indent=2))
﻿
                # Adjust token counts for tool input/output
                output_tokens += simple_token_count(str(tool_input_params))
                input_tokens += simple_token_count(str(tool_output))
﻿
    # Return response with token usage information
    return {
        "bot_reply": bot_reply,
        "tool_input_params": tool_input_params,
        "tool_output": tool_output,
        "model": "gemini_flash_002",
        "usage": {
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "total_tokens": input_tokens + output_tokens
        }
    }
﻿
# Example chat loop
if __name__ == "__main__":
    print("Start chatting with the agent (type 'exit' to stop):")
﻿
    while True:
        user_text = input("You: ")
        if user_text.lower() in ["exit", "quit"]:
            break
﻿
        # Set AGENT_ID and SESSION_ID for each call inside main
        with weave.attributes({"AGENT_ID": AGENT_ID, "SESSION_ID": SESSION_ID}):
            response = run_dialogflow_inference(user_text, SESSION_ID)
﻿
        print(f"Bot: {response['bot_reply']}")
        if response["tool_input_params"]:
            print(f"Function Input Params: {response['tool_input_params']}")
        if response["tool_output"]:
            print(f"Function Output: {response['tool_output']}")
        print(f"Token Usage: {response['usage']}")
﻿
The script initializes a chat loop where users can interact with the agent, while also tracking token usage per interaction. Each query is sent to Dialogflow CX, and responses are parsed for bot replies, tool calls, and outputs. If tools are used, their inputs and outputs are extracted and displayed, ensuring full visibility into how the agent is making decisions. Weave attributes are used to log session and agent details, enabling better observability when running the system in production.
With this setup, we can continuously evaluate and refine the agent based on real interactions, ensuring that it behaves as expected while keeping resource usage under control.
Tracking costs and usage with Weave LLMs can be expensive to run, and logging input and output token usage helps us analyze efficiency and optimize the system. However, unlike many other LLMs that use tokens as the pricing metric, Gemini 1.5 models are priced based on character count. 
To account for this difference, we use a simple function that approximates token usage by counting non-whitespace characters, giving us a rough estimate of how much processing power each interaction consumes. The Dialogflow CX API currently does not return token usage, so for now, we will manually estimate usage to track resource consumption. This information is captured in the usage dictionary of the output of our agent inference function, which logs input, output, and total character counts for each response.
﻿
﻿
The Weave integration also plays a key role here by logging session details (AGENT_ID and SESSION_ID), helping us track conversations over time. This makes it easier to debug issues, identify inefficient tool usage, and monitor how the agent is interacting with different tools in production. To simplify tracking, we use the with weave.attributes({...}) context manager to avoid passing the session ID as a function argument while still making sure it’s logged correctly. This way, later on, we can filter and analyze interactions by session, making it easier to investigate specific conversations without needing to modify function signatures. Here’s a screenshot inside Weave of how we can filter by session ID: 
﻿
﻿
To ensure that token cost and counting work properly, you must include the model identifier and the usage dictionary in the return statement. The usage field tracks input tokens, output tokens, and total token usage, while the model identifier ("gemini_flash_002") is required for cost calculations. Without these values, Weave won’t be able to properly associate the request with the correct pricing model or estimate token consumption.
"model": "gemini_flash_002",
"usage": {
    "input_tokens": input_tokens,
    "output_tokens": output_tokens,
    "total_tokens": input_tokens + output_tokens
}
Additionally, pricing information is explicitly configured using Weave, since Gemini models charge based on character count instead of tokens. This is done by initializing Weave and using add_cost() to define the cost per 1,000 characters for both prompt and completion tokens:
weave_client = weave.init("vertex-agent")
weave_client.add_cost(
    llm_id="gemini_flash_002", 
    prompt_token_cost=0.00001875,  # Cost per 1,000 characters when <= 128k input tokens 
    completion_token_cost=0.000075  # Cost per 1,000 characters when <= 128k input tokens 
)
Without the model identifier and token usage values, cost tracking will not function correctly, making it difficult to monitor and optimize expenses.
Takeaways from building and refining the agentBuilding a reliable AI agent isn’t just about integrating tools, it’s about ensuring they work consistently and accurately under real-world conditions. Through testing and iteration, a few key takeaways emerged that highlight the challenges of tool orchestration, tool input/output dependencies, and handling ambiguity.
1. Logging inputs and outputs is important for debuggingMonitoring AI agents is more complex than tracking traditional LLM-based applications. Agents that rely on multiple tools, APIs, and retrieval systems introduce new complexities, which aren’t obvious in a well-rehearsed demo. Costs, response latency, and failure modes all become magnified at scale, and without proper observability, debugging is a nightmare.
One of the biggest lessons is the importance of logging function inputs and outputs. Without clear logging, it’s difficult to diagnose why the agent is failing to retrieve correct data or why certain tool calls aren’t working as expected. During the initial debugging phases, the chat simulator inside Vertex AI Agent Builder was extremely helpful because it showed the exact tool calls, their inputs, and outputs in real time. This made it easy to pinpoint errors and refine how the agent handled tool selection. However, once the agent is deployed to production, ongoing monitoring is just as important—which is where Weave comes in. By integrating Weave, we can capture detailed logs of tool usage, input patterns, and failures to maintain observability even after deployment. This ensures we have all the necessary data to debug without having to rely solely on user reports or manually reproducing issues.
2. Agent Confidence requires tuning One of the more nuanced challenges in building a production-ready agent is tuning the agent balance asking for more information versus proceeding with what it has in its current context. If the agent automatically triggers tools without verifying key details, it risks acting on incorrect assumptions—like misinterpreting "kc" as the wrong location. But if the agent asks too many clarifying questions, it slows down the user experience and feels inefficient. Below is an example of my Agent asking a fairly obvious question: 
﻿
﻿
Through testing, it became clear that the agent already has some built-in handling for this, but adjusting the tool and playbook prompts can make it more or less cautious when deciding whether to ask for additional details. If the function requires a high degree of confidence in certain parameters, the prompt should encourage explicit confirmation. If the tool can reasonably infer missing details, the prompt can be tweaked to let it proceed without over-questioning the user. This level of control allows for fine-tuning how much validation happens before a tool is called, ensuring the agent balances accuracy and efficiency based on the specific use case.
3. Tool functions must be created thoughtfully Building reliable tool functions isn’t just about ensuring they work individually—it’s about making sure they interact smoothly within the agent's broader workflow. Since an LLM acts as a sort of middleman between function calls, it has to infer how to connect outputs from one tool to inputs for another. This introduces complexity that isn't present in traditional applications, where function outputs are explicitly passed in code with well-defined input types. If the LLM doesn't receive structured data in a way that aligns with its reasoning process, function calls can become unreliable or fail outright.
One of the clearest examples of this issue came from searching for restaurants near a hotel. Initially, the food search tool used broad location inputs (e.g., "Kansas City, MO"), which led to less relevant results because it didn't have specific details about the hotel’s location. The food search tool also wasn’t robust enough to return accurate results based on hotel names alone. The fix was modifying the hotel search function to return latitude/longitude coordinates, allowing the food searcher to narrow down recommendations to precise locations instead of relying on vague city-wide searches.
Beyond structuring tool outputs correctly, there’s also the challenge of handling large amounts of returned data. When tool searches return too much information, it can overwhelm the LLM, causing context window issues and making it harder for the agent to extract relevant insights. On the other hand, returning too little data can lead to incomplete or unhelpful responses. The best approach is to dynamically adjust the level of detail depending on the number of tool calls that will be used, the context length of the LLM, as well as cost constraints.
Since the LLM is responsible for orchestrating function calls, carefully crafted prompts are just as important as well-structured function outputs. In addition to modifying the hotel search function, I also had to adjust the prompts for the food search tool so that it would use latitude and longitude coordinates if they were available, ensuring searches were based on precise locations rather than just city names. These small but necessary refinements make a huge difference in ensuring tool outputs are actually usable by the next function call, reducing failures, and improving overall accuracy.
Even seemingly minor changes to function outputs can affect how the LLM uses subsequent tools, leading to unexpected failures or incorrect tool selections. Since the LLM interprets responses dynamically rather than following strict programmatic rules, a small modification - such as changing a field name or slightly altering the output format - can cause the agent to misinterpret or fail to pass required parameters. Because of this, rigorously testing the agent after every change is essential, no matter how small the update may seem. Ensuring consistency across tool outputs and prompts helps prevent cascading failures that can degrade the agent’s reliability over time.
4. Prompts are important for using the right toolsThe effectiveness of the agent isn’t just about having the right tools - it’s about how well the prompts instruct the agent to use them. Many of the challenges discussed earlier, like ensuring the food search tool used latitude and longitude coordinates when available, were mostly resolved through prompt adjustments rather than changes to the underlying functions. This highlights just how important well-crafted prompts are in guiding tool execution.
For example:
In our example, if the orchestration prompt doesn’t clearly specify that hotel searches should return latitude/longitude, the agent might continue relying on city names, leading to less precise restaurant searches.
Each tool-specific prompt needs to include explicit instructions on when and how the tool should be used, preventing unnecessary or redundant calls.
Even small tweaks in prompt wording can dramatically impact tool selection and execution accuracy. Ensuring that each tool’s prompt clearly defines input requirements, expected outputs, and how results should be used by other tools is just as important as the function’s implementation itself.
5. Edge cases need to be proactively testedLike the challenges faced above, there are always unexpected scenarios that can break the agent or cause it to return inaccurate results. These issues aren’t always obvious during development, especially when testing with controlled inputs. In reality, different people will interact with the agent in surprising and unpredictable ways - they might phrase requests differently, provide incomplete details, or use shorthand the agent isn’t designed to handle.
Because of this, testing with real users is one of the best ways to uncover edge cases that wouldn't be caught otherwise. Observing how users naturally engage with the system helps refine prompts, improve tool execution, and ensure the agent is actually usable in real-world conditions. The more the agent is tested under real usage scenarios, the better it can adapt and handle unexpected inputs without failing.
One key failure mode that needs to be addressed is when users ask detailed, factual questions that fall outside the scope of the implemented tools. For example, if a user asks, "Does the hotel have a gym?" but there’s no specific tool that retrieves hotel amenities, the agent might fail to respond meaningfully. To handle this, a catch-all tool - similar to a default case in a switch statement—should be implemented. This tool would attempt to process queries that aren't directly covered by any specialized function, either by querying an external API, triggering a general knowledge retrieval function, or guiding the user toward alternative resources.
Without a fallback mechanism, the agent will inevitably hit dead ends when asked about details it wasn’t explicitly designed to handle. By proactively accounting for these gaps, the system can maintain robustness and avoid situations where users feel the agent is unhelpful or unreliable.
6. Exposing errors to users can improve usabilityWhen tool calls fail, should the user know? The ideal solution is for the agent to handle failures automatically and retry, but if that’s not possible, informing the user and allowing them to correct the agent is often better than forcing them to start over (in my opinion).
One approach is to implement an error summarizer that can analyze failures, explain what went wrong in simple terms, and determine whether the agent should retry or prompt the user for clarification. This prevents situations where the agent silently fails or returns a vague error message, leaving the user stuck.
While relying on the user to correct an issue isn’t ideal, it’s still a better experience than forcing them to restart entirely. An error summarizer makes it easier to surface actionable insights, helping either the agent to attempt another tool call with adjusted parameters or the user to provide more precise input without unnecessary frustration.
Final thoughtsBuilding a production-ready AI agent is an iterative process. The biggest challenges aren’t just about calling APIs, they're about ensuring inputs and outputs work together seamlessly, balancing automation vs. user confirmation, and structuring prompts effectively.
By logging tool interactions, testing edge cases, refining input validation, and exposing failures when necessary, the agent becomes far more reliable and adaptable in real-world scenarios.
Our agent works well enough to demonstrate the concept, but it’s far from perfect. There are still plenty of issues to fix, edge cases to handle, and improvements to make before it’s actually useful in a real setting. Hopefully, this breakdown gives a sense of what it takes to move from a basic demo to something more reliable.
If you’re working on something similar or have ideas on making these kinds of agents more robust, please let me know in the comments below.
﻿
Add a comment
Tags: Articles, Tutorial, Agents, Weave, Evaluations
Iterate on AI agents and models faster. Try Weights & Biases today.