LLM Observability with W&B and Ploomber
Created on April 9|Last edited on July 2
Comment
Introduction
Within machine learning and artificial intelligence, observability is a crucial aspect that enables developers to monitor, understand, and optimize the performance of their models. One tool that stands out in this domain is Weights and Biases (W&B), a platform designed to track experiments, visualize data, and share insights.
In this blog, we'll explore how W&B can be leveraged to enhance the observability of applications powered by large language models (LLMs) like GPT-3.5.
We can interact with LLMs using a technique called "prompting." Essentially, this is where we provide the model with a prompt (a piece of text) and the model generates a response based on the input.
One of the challenges with LLMs is that they can sometimes produce incorrect or biased outputs ("hallucinations"), so it's important to monitor and understand their behavior through observability. Observability goes beyond traditional monitoring; it's about gaining insights into the internal states of the model, understanding its behavior, and identifying areas for improvement.
Benefits of Using W&B for LLM Observability
Weights and Biases offers a suite of features that can significantly enhance the observability of LLMs. Here are some key benefits:
- Experiment tracking: W&B allows for systematic tracking of experiments, making it easier to compare different model versions and configurations.
- Real-time monitoring: Developers can monitor the model's performance in real time, identifying issues as they arise.
- Insightful visualizations: W&B provides a range of visualization tools that help in understanding the model's behavior and identifying patterns.
- Collaboration and sharing: Insights and results can be easily shared with team members or the broader community, fostering collaboration.
These features make W&B an ideal platform for enhancing the observability of LLMs. In the following sections, we'll explore how W&B can be integrated with LLMs like GPT-3.5 to achieve these benefits.
Understanding Tracing in Weights & Biases
Tracing is a feature in Weights & Biases that provides a structured way to log and visualize the execution flow of large language models and other machine learning pipelines. It helps in debugging, monitoring reproducibility, and optimizing the performance of these models by capturing detailed information about each step in the execution process.
The process can be broken down into the following steps:
- Initialization: A W&B run is started using wandb.init() , which sets up the environment for logging data to the W&B dashboard.
- Logging Traces: Traces are logged using the Trace class from wandb.sdk.data_types.trace_tree. A trace represents a single execution step or span in the pipeline. Each trace captures information such as the start and end times, inputs and outputs, status, and additional metadata.
- Nested spans: Traces can have nested child spans, allowing for a hierarchical representation of the pipeline. For example, a root span for an agent might have child spans for a chain, which in turn has child spans for individual LLM calls.
- Viewing Traces: Once logged, traces can be viewed in the W&B dashboard, where they are presented in a trace table and trace timeline. This visualization helps in understanding the sequence of steps and their dependencies.
In summary, tracing in W&B provides a structured way to capture and visualize the execution flow of LLMs and other machine learning pipelines, enabling observability, debugging, and performance optimization.
In the next section, we will explore how tracing can be integrated with LLMs to enhance their observability.
Sample Application: GPT-3.5 Book Recommender Assistant
To demonstrate the integration of W&B with GPT-3.5, we'll create a practical application that uses the OpenAI API to build an assistant capable of recommending books to a user based on natural language queries. We'll then leverage W&B to enhance the observability of the assistant, tracking its performance and interactions with users.
Here's a breakdown of how the application works:
Initialization
The script starts by importing necessary libraries and initializing the W&B project. This sets the stage for logging data and tracking experiments. Using a dataset of top books from Goodreads, it generates embeddings that it can use to look up book recommendations.
Creating the agent
A GPT-3.5 assistant is created using OpenAI API through their Chat Completions API. This assistant is prompted with the task of generating book recommendations based on the user's query.
Generating recommendations
The book_recommender_agent function is where the assistant interacts with the user's query. It processes the query, interacts with the assistant, and logs the trace to W&B. More specifically, it detects if the user mentioned a specific author and tunes the query to that. It then generates a relevant list based on the embeddings, and prompts the assistant to generate recommendations with book summaries based on the list of relevant books.
Tracing and logging
The core of the observability enhancement lies in the _wandb.py file and specifically the trace_log function. This function creates a Trace object, which is a structured representation of a computational step. In this case, it captures details about the assistant's run, including the status, token usage, start and end times, input query, and response. This trace is then logged to W&B using root_span.log(name="openai_trace").
Visualization and analysis
Once the data is logged to W&B, users can leverage its powerful dashboard to visualize the performance metrics, compare different runs, and gain insights into the model's behavior.
Packaging and sharing
The final step involves packaging the application and sharing it with the community. By sharing the application and its W&B dashboard, developers can foster collaboration and enable others to benefit from the observability enhancements.
What Does Tracing Look Like for This Example?
Below is a snapshot of the W&B dashboard showing the trace table and trace timeline for the Book Recommender application. The trace table provides a structured view of the execution flow, including details such as the system prompt, the user prompt, the response, the start and end times, status, and metadata.

The trace contains whether the query was successful, the time it took to process the query, and the token usage. This information can be used to identify patterns, understand the assistant's behavior, and optimize its performance. We can also compare different runs to see how the assistant's behavior changes with different queries or configurations. The trace timeline provides a visual representation of the execution flow, making it easier to understand the sequence of steps and their dependencies.
Additionally, we can observe network traffic and computational usage:

Try it with Ploomber Cloud
Ploomber Cloud makes it easy to download and deploy applications like the Book Recommender. It takes minimal steps to get started and handles all of the nitty gritty so that you can build and deploy your own apps with no headache. To get started, make sure you've signed up for an account and gotten an API key.
Download the application
Install Ploomber Cloud and set your API key:
pip install ploomber-cloud
ploomber-cloud key YOUR-KEY
Now we can download the book recommender app by running:
ploomber-cloud examples panel/book-recommender
You'll be prompted for a location to download the application files. To download in your current directory, just hit "Enter." Navigate to the book-recommender directory.
API keys
Before we can get the application to work, we need a few things. First, you'll need an OpenAI API Key. Sign up for an account with OpenAI, then get your key here. Create a .env file and enter your key. It should look like this:
OPENAI_API_KEY=your-openai-api-key
You will also need to create an account in Weights & Biases. You can do this by visiting the Weights & Biases sign up page (it's free!). Note your username. You will need to generate an API key, which you can obtain after you log in by visiting https://wandb.ai/authorize.
Once you log in, you can create a new project by visiting your user profile, selecting the project tab, and clicking on the "New Project" button. Note the project name.
The app assumes the following environment variable names. Add them to your .env file. After all this, your .env file should look like this:
OPENAI_API_KEY=your-openai-api-keyWANDB_API_KEY=your-wandb-api-keyWANDB_PROJECT=your-project-nameWANDB_ENTITY=your-wandb-username
Another note: Weights and Biases tracking is disabled by default. To enable it, go to app.py and set WEIGHTS_AND_BIASES_TRACKING to True.
Dataset
Finally we need the dataset. Download the dataset to the book-recommender/ folder, and rename it as goodreads.csv.
Generate lookup files by running the following script:
python generate_assets.py --embeddings --verbose
Running this command should generate author_to_title.json, title_to_description.json and embeddings.json files in the assets/ folder.
If you want to generate embedding for the first N rows only run the below command (here N=100):
python generate_assets.py -n 100 --embeddings --verbose
Deploy
Now we're ready to deploy the application. Initialize the project and follow the prompts. This is a Panel application:
ploomber-cloud init
This should have created a ploomber-cloud.json file. Now, deploy the app:
ploomber-cloud deploy --watch
The deploy process should take a few minutes. Once it's finished, you can navigate to the app and interact with it! You can navigate to your project page on Weights & Biases, where you should see the traces based on your input. Overall, your application should look like this:

Conclusion
That's it! In this blog, we learned about the advantages of tracking and observability through Weights & Biases. We demonstrated the tracking by building a Book Recommender Agent using OpenAI and GPT-3.5. And finally using Ploomber Cloud, we were able to download, deploy, and explore the demo application.
Add a comment
Tags: Articles
Iterate on AI agents and models faster. Try Weights & Biases today.