How to use Azure OpenAI and Azure AI Studio with Weights & Biases Weave
In this step-by-step tutorial, we'll look at how W&B Weave alongside Microsoft's suite of Azure AI offerings
Created on May 28|Last edited on October 29
Comment
Introduction
As the complexity and scale of GenAI models grow, the need for robust experimentation frameworks and scalable infrastructure becomes paramount. This is where the integration of Weights & Biases (W&B) with the Microsoft Azure AI suite provides a powerful solution.
Weights & Biases offers tools to fine-tune these models (W&B Models) and also to build GenAI applications on top of them (W&B Weave). Weave is designed specifically for the complex tasks associated with large language models (LLMs), simplifying the process of building, debugging, and evaluating LLM chains and prompts, making it an essential tool for data scientists and machine learning engineers.
In tandem, Microsoft Azure provides the necessary infrastructure and services to support large-scale AI/ML workloads. Azure's comprehensive suite, including Azure Machine Learning and Azure AI Services, ensures that GenAI models can be developed, trained, and deployed efficiently and effectively.
This tutorial will cover the technical aspects of using Azure OpenAI Service and Azure AI Studio in conjunction with W&B Weave. By exploring the integration of these powerful tools, the tutorial will illustrate how to streamline the GenAI experimentation process, improve model accuracy, and ensure reliable deployment of sophisticated AI applications.
IntroductionUsing W&B WeaveUnderstanding Azure for GenAIAzure AI ServicesAzure OpenAI ServiceAzure Machine LearningAzure AI StudioTracking, tracing, and evaluating LLM outputs from Azure to WeavePrerequisite setup: WeavePrerequisite setup: AzureAzure OpenAI Service configurationAzure AI Studio configurationStep 1: Auto-logging LLM callsStep 2: Building functional LLM appsStep 3: Experimenting on LLM app attributes via model classesStep 4: Use evaluation data to determine the best LLM app configurationConclusion
Using W&B Weave

Weave by Weights & Biases is a robust toolkit designed to help developers manage and evaluate their large language models (LLMs) efficiently. Specifically tailored for the complexities of LLMs, Weave integrates seamlessly with Microsoft Azure to leverage its powerful AI/ML infrastructure. This combination offers a comprehensive solution for tracking, experimenting, and optimizing GenAI workflows.
You can use Weave to:
- Log and version LLM interactions and surrounding data, from development to production
- Experiment with prompting techniques, model changes, and parameters
- Evaluate your models and measure your progress
At a high level Weave tracks:
- Code: ensure all code surrounding generative AI API calls is versioned and stored
- Data: where possible, version and store any datasets, knowledge stores, etc
- Traces: permanently capture traces of functions surrounding generative AI calls
Wrap any Python function with @weave.op() and Weave will capture and version the function’s code, log traces of all calls, including their inputs and outputs.
💡
Understanding Azure for GenAI
Microsoft Azure offers a robust suite of services designed for developing and deploying generative AI applications. Key services include Azure OpenAI Service, Azure AI Studio, Azure Machine Learning, and Azure AI Services.

Azure AI Services
Azure AI Services offer a wide range of APIs for natural language processing, computer vision, speech recognition, and more. These pre-built services allow developers to quickly integrate sophisticated AI capabilities into their applications, enhancing functionalities and automating complex tasks.
Azure OpenAI Service

Azure OpenAI Service, part of Azure AI Service, provides access to advanced AI models from OpenAI like GPT-4o, capable of handling multimodal tasks across text, images, audio, and video. This service is ideal for enhancing applications with advanced language understanding and generation capabilities, supported by Azure's enterprise-grade security and infrastructure.
💡
Azure Machine Learning

Azure Machine Learning is a powerful, flexible end-to-end platform for accelerating data science and machine learning innovation while providing the enterprise governance that every organization needs in the era of AI. ML professionals, data scientists, and engineers can use this platform in their day-to-day workflows to train and deploy models and manage machine learning operations (MLOps).
Azure AI Studio

Azure AI Studio is a hub for developing generative AI solutions. It features a catalog of models from partners like OpenAI and Hugging Face, enabling developers to experiment, compare, and fine-tune models using customer’s own data. This platform simplifies the creation of custom AI solutions by providing a unified interface for Generative AI app development and deployment.
💡
By integrating these Azure services with W&B Weave, developers can streamline their GenAI workflows, improve model accuracy, and ensure reliable deployment. This combination provides a comprehensive solution for tracking, debugging, and evaluating large language models, making AI development more efficient and effective.
Tracking, tracing, and evaluating LLM outputs from Azure to Weave
Prerequisite setup: Weave
To integrate Weave with our LLM vendor SDK, start by installing the necessary packages. This includes Weave and OpenAI:
!pip install weave openai
Next, initialize your Weave project to house all code, data, and traces:
import weaveweave.init('azure-weave-cookbook')
Prerequisite setup: Azure
You can easily access Azure services via the OpenAI Python SDK.
Azure OpenAI Service configuration
Configure the client to use Azure OpenAI services:
from openai import AzureOpenAIclient = AzureOpenAI(api_key=os.getenv("AZURE_OPENAI_API_KEY"),api_version="2024-02-01",azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"))
Azure AI Studio configuration
Alternatively, if you are using Azure AI Studio, configure the client as follows:
from openai import OpenAIclient = OpenAI(base_url=f"{os.getenv('AZURE_AI_STUDIO_API_ENDPOINT')}/v1",api_key=os.getenv('AZURE_AI_STUDIO_API_KEY'))
Step 1: Auto-logging LLM calls

The first step is to establish a robust logging mechanism for all interactions with the LLM. This is crucial for monitoring, debugging, and improving the model's performance over time. By auto-logging LLM calls, you can capture detailed information about each request and response, which simplifies troubleshooting and performance analysis.
@weave.op()def call_azure_chat(model_id: str, messages: list, max_tokens: int = 1000, temperature: float = 0.5):response = client.chat.completions.create(model=model_id,messages=messages,max_tokens=max_tokens,temperature=temperature)return {"status": "success", "response": response.choices[0].message.content}
In the first experiment, we try to make dishes with proper recipes using the LLM. This involves sending a structured prompt to the model and receiving a detailed response, which is then logged for analysis.
messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Create a snack recipe for a dish called the Azure Weav-e-ohs"}]if "mistral" in model_id.lower():messages = format_messages_for_mistral(messages)result = call_azure_chat(model_id, messages)print(result)
Auto-logging provides a clear audit trail for each interaction with the model, making it easier to identify issues, understand model behavior, and improve future iterations.
💡
Step 2: Building functional LLM apps

Next, we focus on creating utility functions that standardize the formatting of prompts. Consistent prompt formatting is essential for reliable model performance and ensures that each input is structured in a way that the LLM can effectively process.
@weave.op()def format_prompt(prompt: str):"A formatting function for OpenAI models"system_prompt_formatted = "You are a helpful assistant."human_prompt = "{prompt}"human_prompt_formatted = human_prompt.format(prompt=prompt)messages = [{"role":"system", "content":system_prompt_formatted}, {"role":"user", "content":human_prompt_formatted}]return messages
We then use this function to run the chat model with formatted prompts:
@weave.op()def run_chat(model_id: str, prompt: str):formatted_messages = format_prompt(prompt=prompt)if "mistral" in model_id.lower():formatted_messages = format_messages_for_mistral(formatted_messages)result = call_azure_chat(model_id, formatted_messages, max_tokens=1000)return result
This step is used for a variety of LLM applications where the prompt needs to be consistently formatted. For instance, generating customer service responses or crafting content based on specific guidelines.
Standardized formatting functions ensure that all prompts are consistent, which enhances the reliability and comparability of model outputs. This consistency is key to maintaining high-quality interactions with the LLM.
💡
Step 3: Experimenting on LLM app attributes via model classes

To further streamline the development process, we encapsulate the logic and parameters associated with our LLMs into reusable classes. This modular approach allows for easier experimentation with different configurations and simplifies the management of model attributes.
from dataclasses import dataclass@dataclassclass PromptTemplate:system_prompt: strhuman_prompt: str@weave.op()def format_prompt(self, email_content: str):system_prompt_formatted = self.system_prompt.format()human_prompt_formatted = self.human_prompt.format(email_content=email_content)messages = [{"role":"system", "content":system_prompt_formatted}, {"role":"user", "content":human_prompt_formatted}]return messages
The AzureEmailAssistant class uses the prompt template to respond to emails, allowing us to encapsulate the model's behavior and parameters in a single, reusable class:
from weave import Modelclass AzureEmailAssistant(Model):model_id: str = model_idprompt_template: PromptTemplatemax_tokens: int = 2048temperature: float = 0.0@weave.op()def format_doc(self, doc: str) -> list:messages = self.prompt_template.format_prompt(doc)return messages@weave.op()def respond(self, doc: str) -> dict:messages = self.format_doc(doc)if "mistral" in self.model_id.lower():messages = format_messages_for_mistral(messages)output = call_azure_chat(self.model_id,messages=messages,max_tokens=self.max_tokens,temperature=self.temperature)return output@weave.op()async def predict(self, email_content: str) -> str:return self.respond(email_content)["response"]
This approach is particularly useful for applications like customer service automation, where responses need to follow a specific format and style. By encapsulating the prompt logic and model parameters, we can easily adapt the system to different scenarios and requirements.
Encapsulation in model classes allows for easy modification and testing of different model parameters and configurations. This modular approach enhances code maintainability and flexibility, making it easier to adapt and scale your LLM applications.
💡
Step 4: Use evaluation data to determine the best LLM app configuration

Finally, we leverage Weave to systematically evaluate the performance of different model configurations using structured datasets. This step is crucial for identifying the best setup for your specific use case.
from weave import Datasetdataset = Dataset(name=eval_dataset_name, rows=[{'id': '1', 'email_content': 'Subject: Inquiry about Order Delay\n\nHello,\n\nI placed an order last week for the new UltraGlow Skin Serum, but I have not received a shipping update yet. My order number is 12345. Could you please update me on the status of my shipment?\n\nThank you,\nJane Doe'},{'id': '2', 'email_content': 'Subject: Damaged Item Received\n\nHello,\n\nI received my order yesterday, but one of the items, a glass vase, was broken. My order number is 67890. How can I get a replacement or a refund?\n\nBest regards,\nJohn Smith'},# Additional rows])weave.publish(dataset)dataset_uri = f"weave:///{wandb_entity}/{weave_project}/object/{eval_dataset_name}:latest"dataset = weave.ref(dataset_uri).get()
To evaluate specific attributes, such as the conciseness of the model's output, we define scoring functions:
@weave.op()def check_conciseness(model_output: str) -> dict:result = len(model_output.split()) < 300return {'conciseness': result}
We then create an evaluation object and run it on our dataset:

evaluation = weave.Evaluation(dataset=dataset, scorers=[check_conciseness],)await evaluation.evaluate(model)
This evaluation framework can be applied to various metrics such as response accuracy, relevance, and user satisfaction. For example, assessing customer service email responses for conciseness and clarity.
Systematic evaluation with structured datasets allows for data-driven decision-making. By assessing model performance across various metrics, you can identify the most effective configurations and optimize your LLM applications accordingly.
💡
Conclusion
The integration of Weave with Azure AI offers a powerful combination of tools for building, debugging, and evaluating large language models. This step-by-step approach not only ensures robust logging and standardized prompt formatting but also facilitates modular development and systematic evaluation. By leveraging these best practices, prompt engineers and machine learning engineers can streamline their workflows, improve model performance, and ensure reliable deployment of sophisticated AI applications.
Add a comment