How Azure, OpenAI, and Weights & Biases work together

Learn why Azure, OpenAI, and W&B work so well together in this introductory report
Created on February 28|Last edited on June 7
Comment
﻿
﻿Source﻿
A quick overview of Azure, OpenAI, and Weights & BiasesToday, we'll be working with Azure, OpenAI's GPT, and Weights & Biases. In case you need an explainer on any, we've got you covered: 
Microsoft Azure, a leading cloud computing platform, delivers an extensive array of services including computing, storage, databases, and AI, supported by a vast network of global data centers and robust infrastructure, ensuring scalability, reliability, and security vital for AI and ML deployments. 
OpenAI, at the forefront of AI research, is celebrated for its pioneering work in natural language processing (NLP), reinforcement learning, and the development of advanced models like the Generative Pre-trained Transformer (GPT), are known for their ability to produce human-like text and execute complex language tasks. 
Weights & Biases specializes in tools that enhance the machine learning workflow, offering functionalities for tracking, visualizing, and optimizing experiments, thus fostering collaboration among data scientists and streamlining model development with capabilities in experiment logging, hyperparameter tuning, and results analysis.
By leveraging Azure's cloud capabilities, developers and data scientists gain access to a scalable infrastructure for deploying and scaling OpenAI's advanced models. Azure's flexibility and security measures ensure the seamless operation of AI applications, regardless of scale or complexity.
The integration enables users to leverage Azure OpenAI's flexible models for fine-tuning and prompt-based interactions with the added benefits of W&B's experiment tracking tools. It ensures that developers can effectively monitor their fine-tuning processes, adjust hyperparameters as needed, and make informed decisions on training. 
Leveraging Azure’s cloud capabilities with OpenAI’s advanced modelsAzure's cloud infrastructure enhances AI model development and deployment through its high-performance computing resources, including GPUs and TPUs, crucial for processing large datasets and training complex language models like OpenAI's GPT series. 
This computational power accelerates AI application development, reducing time-to-market and boosting efficiency. Azure also enables seamless scaling of AI workloads to meet growing computational demands, offering scalable storage and flexible compute options for dynamic resource allocation. This ensures organisations can manage demand fluctuations efficiently, whether during peak periods or in diverse deployment environments.
Streamlining AI development workflowsThe integration simplifies and accelerates the AI development lifecycle, from model training to deployment:
Unified platform for model training: Integrating OpenAI's models with Azure's scalable compute resources allows for efficient model training and fine-tuning on large datasets. Weights & Biases can be used to track these training sessions, log metrics, and save model versions, creating a seamless workflow from experimentation to final model selection.
Collaboration and version control: Weights & Biases's platform acts as a centralized hub for teams to monitor experiments, share results, and collaborate on AI projects. This facilitates knowledge sharing and accelerates the iterative process of model improvement. Version control ensures that changes to models and datasets are tracked, allowing teams to manage the evolution of their AI projects systematically.
Simplified deployment: Azure ML provides tools for deploying models into production environments, whether as web services or part of larger applications. The integration with OpenAI and Weights & Biases means that the transition from model development to deployment is streamlined, with tools for monitoring model performance and managing updates or rollbacks as necessary.
Advantages of using a unified platformEfficiency: A unified platform reduces the complexity of managing different aspects of AI development, from data preprocessing and model training to deployment and monitoring, saving time and resources.
Scalability: Azure's cloud infrastructure provides the flexibility to scale resources up or down based on the project's needs, accommodating everything from small experiments to large-scale production deployments.
Reproducibility: W&B's tracking and versioning capabilities ensure that experiments are reproducible, which is critical for debugging, regulatory compliance, and academic verification.
Collaboration: A shared workspace in Weights & Biases, combined with Azure's collaboration features, enhances teamwork by allowing seamless sharing of datasets, models, and experiments among team members, regardless of their location.
Innovation: Access to OpenAI's advanced models and APIs encourages innovation, allowing teams to experiment with new approaches and applications of AI, leveraging the latest advancements in the field.
Fine-tuning AI models with Weights & Biases on AzureLet's run though a step-by-step tutorial. Before we start, make sure to make an account on Azure with an Azure ML workspace as well as a Weights & Biases account.
In Part I, we'll be fine-tuning our models using our Open API key and logging our results to Weight and Biases. In Part II, we will be deploying our fine-tuned model to Azure. 
Let's get going. 
Part I: Fine-tuning our models and logging our resultsBefore diving deep into it, let's install the necessary libraries for our fine-tuning jobs, and run the following code in your notebook environment. 
!pip install wandb
!pip install openai azureml-core
!pip install openai
Configure and initialize our OpenAI and Azure workspace
Now, let's import the modules that we will be using and configure some basic settings. From your OpenAI account find the API key and either set it as environment variable or use it as we mentioned above. 
Moreover, also configure Weights & Biases and Azure as mentioned below. You can find your details in your respective accounts.
import openai
import wandb
from azureml.core import Workspace
import os
﻿
﻿
# Set OpenAI API Key
openai.api_key = 'your api key'
client = OpenAI(api_key)
Log into Weights & Biases:
For logging into Weights & Biases you will be prompted to add your Weights & Biases key which you can find here
wandb.login()
wandb.init(project='project_name', entity='entity_name')
Convert our data to JSONL:
We will be using the SQuAD dataset which you can download here. 
This dataset is in JSON format, but for fine-tuning, we need the data in JSONL format as this is favored for fine-tuning large language models due to its ability to efficiently handle large datasets by representing each data point in a structured yet flexible manner, and its compatibility with a wide range of machine learning tools, simplifying data processing and model training workflows.
Let's convert it:
import json
﻿
def convert_dataset_to_jsonl(input_json_path, output_jsonl_path):
    with open(input_json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
﻿
﻿
    with open(output_jsonl_path, 'w', encoding='utf-8') as outfile:
        for article in data['data']:
            for paragraph in article['paragraphs']:
                context = paragraph['context']
                for qa in paragraph['qas']:
                    question = qa['question']
                    is_impossible = qa.get('is_impossible', False)
                    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
                    # Handle both possible and impossible questions
                    if is_impossible:
                        completion = " Impossible"
                    else:
                        # Using the first answer for simplicity
                        answer = qa['answers'][0]['text'] if qa['answers'] else "Unknown"
                        completion = f" {answer}"
                    # Write the JSONL entry
                    jsonl_entry = json.dumps({"prompt": prompt, "completion": completion})
                    outfile.write(jsonl_entry + '\n')
﻿
# Convert your dataset
convert_dataset_to_jsonl('/train-v2.0.json', '/train.jsonl')
convert_dataset_to_jsonl('/dev-v2.0.json', '/dev.jsonl')
Upload training and validation datasets to OpenAI for model fine-tuning:
So now we have our dataset ready in JSONL format, it's time to upload that to OpenAI. Make sure to upload the data once and save the train_file_id and dev_file_id so you can use the same dataset for multiple runs, instead of uploading again and again till you run out of memory.
def upload_file_to_openai(file_path, purpose='fine-tune'):
    response = openai.File.create(file=open(file_path), purpose=purpose)
    return response.id
﻿
﻿
train_file_id = upload_file_to_openai("/train.jsonl")
dev_file_id = upload_file_to_openai("/dev.jsonl")
print(train_file_id)
Define hyperparameters and logging to Weights & Biases:
# Define hyperparameters
hyperparameters = {
    "n_epochs": 2,  # Number of training epochs
    "batch_size": 4,  # Batch size for training
    "learning_rate_multiplier": 0.1,  # Learning rate adjustment factor
}
﻿
﻿
# Log hyperparameters to wandb
wandb.config.update(hyperparameters)
Initiate a fine-tuning job on OpenAI and log the Job ID with Weights & Biases:
Now it's time to initiate our fine-tuning job! For that we will be using openai.FineTuningJob.create() method.
fine_tune_response = openai.FineTuningJob.create(
    training_file=train_file_id,
    validation_file=dev_file_id,
    model="babbage-002",
    hyperparameters=hyperparameters
)
﻿
﻿
print(f"Fine-tuning started with ID: {fine_tune_response['id']}")
wandb.log({"fine_tune_id": fine_tune_response["id"]})
﻿
﻿
fine_tune_id= fine_tune_response['id']
﻿
﻿
fine_tune_status = openai.FineTuningJob.retrieve(fine_tune_id)
﻿
﻿
print(f"Fine-tuning job status: {fine_tune_status['status']}")
Monitoring and fetching results of fine-tuning job
Now our fine-tuning job has been started, we are interested in the status of the job, whether it's completed or not, and if completed then we need details of the events, so this script initializes by capturing the fine-tuning job ID, then registers a signal handler to catch interrupt signals (SIGINT). 
Upon receiving an interrupt, it retrieves and reports the current status of the fine-tuning job. It then requests and streams the events associated with the fine-tuning job, formatting and printing each event's timestamp and message. If the streaming process is interrupted or an error occurs, it reports the disruption.
import signal
import datetime
﻿
fine_tune_id= fine_tune_response['id']
def signal_handler(sig, frame):
    status = openai.FineTuningJob.retrieve(fine_tune_id)['status']  # Access status correctly
    print(f"Stream interrupted. Job is still {status}.")
    return
print(f"Streaming events for the fine-tuning job: {fine_tune_id}")
signal.signal(signal.SIGINT, signal_handler)
try:
    events_response = openai.FineTuningJob.list_events(id=fine_tune_id)
    events = events_response['data']  # Access the list of events
   
    for event in events:
        event_time = datetime.datetime.fromtimestamp(event['created_at']).strftime('%Y-%m-%d %H:%M:%S')
        print(f"{event_time} {event['message']}")
except Exception as e:
    print(f"Stream interrupted (client disconnected). Error: {str(e)}")
Here is the output—next we'll log them to Weights & Biases
Source:Author
Log the final metrics on Weights & Biases:
Now we have our final results of the fine-tuning job, it's time to log them on Weights & Biases and see the visualizations that are being created. 
import re
import wandb
# Assuming the messages are stored in a list variable named `messages`
# messages = [
#     "2024-02-21 17:33:45 Step 5901/5937: training loss=1.78, validation loss=0.69"
#     "2024-02-21 17:33:26 Step 5801/5937: training loss=1.29, validation loss=0.62"]
for message in messages:
    # Extract step, training loss, and validation loss using regex
    match = re.search(r"Step (\d+/\d+): training loss=([\d.]+), validation loss=([\d.]+)", message)
    if match:
        step, training_loss, validation_loss = match.groups()
        step = int(step.split('/')[0])  # Extract the current step number
        training_loss = float(training_loss)
        validation_loss = float(validation_loss) 
        # Log the metrics to wandb
        wandb.log({"Step": step, "Training Loss": training_loss, "Validation Loss": validation_loss})
 
# Finish the wandb run
wandb.finish()
Source: Author
Viewing the graphs on Weights & Biases:
So now our fine-tuning has been completed and all the metrics have been logged to Weights & Biases we can open our project dashboard and see the runs like this 
Source: Author
In the charts panel, you can view the comparisons of your different runs 
Source: Author
Here we can see that there are two graphs, these are from two different runs with different hyperparameters. You can also try fine-tuning different models and comparing the runs with each other and you can also create a sweep, which is an amazing feature of Weights & Biases that lets you define a range of hyperparameters and let it find the best ones and compare the results.
Evaluating the fine-tuned model:
Now we have fine-tuned our model, we will now evaluate the model. For the evaluation, we have prepared a simple dataset having questions and answers. It queries the model with each question and its context, collects the model's answers, and then compiles these into a pandas DataFrame.
Finally, the results are logged as a table in a Weights & Biases project and optionally saved to a CSV file for local use.
# Function to normalize answers (removing punctuation, lowercase, etc.)
def normalize_answer(s):
    import re
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punct(text):
        return re.sub(r'[\W]', ' ', text)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punct(lower(s))))
﻿
# Calculate F1 score
def f1_score(prediction, truth):
    prediction_tokens = normalize_answer(prediction).split()
    truth_tokens = normalize_answer(truth).split()
    common_tokens = Counter(prediction_tokens) & Counter(truth_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0: return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1
﻿
# Calculate Exact Match score
def exact_match_score(prediction, truth):
    return int(normalize_answer(prediction) == normalize_answer(truth))
﻿
# Function to query the model and get the answer
def query_model(question, context, model):
    openai.api_key = '<key-here>'
    response = openai.Completion.create(
        model=model,
        prompt=f"Question: {question}\nContext: {context}\nAnswer:",
        temperature=0,
        max_tokens=50,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        stop=["\n"]
    )
    return response.choices[0].text.strip()
﻿
# Test dataset
test_data = [
    {
        "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
        "qas": [
            {
                "question": "In what country is Normandy located?",
                "answer": "France"
            },
            {
                "question": "When were the Normans in Normandy?",
                "answer": "10th and 11th centuries"
            },
            {
                "question": "From which countries did the Norse originate?",
                "answer": "Denmark, Iceland and Norway"
            },
            {
                "question": "Who was the Norse leader?",
                "answer": "Rollo"
            },
            {
                "question": "What century did the Normans first gain their separate identity?",
                "answer": "10th century"
            }
        ]
    },
    {
        "context": "The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
        "qas": [
            {
                "question": "Who was the duke in the battle of Hastings?",
                "answer": "William the Conqueror"
            },
            {
                "question": "Who ruled the duchy of Normandy",
                "answer": "Richard I"
            }
        ]
    }
]
﻿
# Adjusted part of the script to compile results into a DataFrame
detailed_results = []  # List to store detailed results
﻿
for item in test_data:
    context = item['context']
    for qa in item['qas']:
        question = qa['question']
        true_answer = qa['answer']
        model_answer = query_model(question, context, model="modelid")
       
        # Append detailed result for each question
        detailed_results.append({
            "question": question,
            "model_answer": model_answer,
            "true_answer": true_answer        })
﻿
# Convert detailed results list to DataFrame
df_results = pd.DataFrame(detailed_results)
﻿
﻿
﻿
# Log the entire DataFrame as a table to W&B
wandb.log({"results_table": wandb.Table(dataframe=df_results)})
﻿
# Optional: Save the DataFrame to CSV for local use
df_results.to_csv('evaluation_results.csv', index=False)
Here is the final table from Weights & Biases, that shows both the true answer and the answer given by the model. We can use this table for evaluation of our fine-tuned model and the quality of its answers.
Source: Author
PART II: Deploying to AzureIn the Azure portal, create a new Azure Machine Learning workspace if you haven't already. This workspace will be the central place for managing your models, datasets, and deployments. 
Make sure you have the Azure ML SDK installed in your development environment to interact with Azure ML services. You can install it using pip install azureml-sdk
from azureml.core import Workspace
﻿
# Provide your Azure subscription ID, resource group name, and workspace name
subscription_id = 'your-subscription-id'
resource_group = 'your-resource-group'
workspace_name = 'your-workspace-name'
﻿
# Access the workspace
ws = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name)
﻿
print("Loaded workspace:", ws.name)
1: Prepare your model for deployment
Ensure you have the model ID of your fine-tuned OpenAI model. This ID is needed to load and use your model through the OpenAI API. You can find your model id by the openai.FineTuningJob.list() command. 
Write a scoring script (score.py) that loads your fine-tuned model using the OpenAI API and defines how to use the model to make predictions. 
import openai
import json
﻿
def init():
    global openai_model
    openai.api_key = 'your_openai_api_key'
    openai_model = 'your_model_id'  # Replace with your fine-tuned model ID (find it here openai.FineTuningJob.list())
﻿
def run(raw_data):
    data = json.loads(raw_data)
    response = openai.Completion.create(
        model=openai_model,
        prompt=data['prompt'],
        temperature=0.7,
        max_tokens=150
    )
    return response.choices[0].text
2: Prepare our scoring script	
The scoring script plays a crucial role in deploying machine learning models for inference, especially in cloud environments like Azure Machine Learning (Azure ML). It serves as the entry point for processing incoming data, running predictions, and returning the results. 
The following script includes functions for model initialization and running predictions.
%%writefile score.py
﻿
import openai
import json
﻿
def init():
    global openai_model
    openai.api_key = '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'
    openai_model = 'ft:babbage-002:personal::8ug5NMO9'  # Replace with your fine-tuned model ID
﻿
def run(raw_data):
    data = json.loads(raw_data)
    response = openai.Completion.create(
        model=openai_model,
        prompt=data['prompt'],
        temperature=0.7,
        max_tokens=150
    )
    return response.choices[0].text
3: Deploy the model as a web service:
Now it's time to deploy our model on Azure, Following code snippet demonstrates how to deploy a machine learning model as a web service on Azure Machine Learning, utilizing Azure's scalable infrastructure. 
We first created an inference environment from a Conda specification, setting up an inference configuration with a custom scoring script, and deploying the model using Azure Container Instances (ACI) for real-time predictions. 
This encapsulates the model deployment lifecycle on Azure ML, making it accessible for applications to consume the model's capabilities via a REST API endpoint.
from azureml.core import Workspace, Model
from azureml.core import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
﻿
# Create an environment from the yml file
env = Environment.from_conda_specification(name="myenv", file_path="myenv.yml")
﻿
# Create an inference configuration
inference_config = InferenceConfig(entry_script="score.py", environment=env)
﻿
# Set the deployment configuration
deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)
﻿
# Deploy the model as a web service
service = Model.deploy(workspace=ws,
                       name="openai-model-service",
                       models=[],  # No models are registered in Azure ML since we're using an external OpenAI model
                       inference_config=inference_config,
                       deployment_config=deployment_config,
                       overwrite=True)
﻿
service.wait_for_deployment(show_output=True)
print(f"Service state: {service.state}")
print(f"Scoring URI: {service.scoring_uri}")
Conclusion:In conclusion, the synergy between Azure, OpenAI, and Weights & Biases offers a comprehensive and powerful framework for the entire machine learning lifecycle, from development to deployment and monitoring. Azure's robust cloud infrastructure provides the backbone for scalable computing and storage solutions, OpenAI brings AI models within reach, and Weights & Biases enhances the process with experiment tracking and optimization capabilities.
We successfully finet-uned our OpenAI models and logged the metrics to Weights and Biases examined the visualizations and tracked our hyperparameters. We then deployed our fine-tuned model to Azure and hence leveraged the integrated capabilities of Azure, OpenAI, and Weights & Biases for fine-tuning AI models which  heralds a significant evolution in the AI and ML landscape, offering a robust framework for developers, data scientists, and organizations to push the boundaries of artificial intelligence
﻿
Add a comment
Tags: Articles, Framework / Integration, Intermediate, LLM, GenAI, NLP
Iterate on AI agents and models faster. Try Weights & Biases today.