A guide to using the Azure AI model inference API

Using Microsoft Azure AI model inference API for serverless inference!
Created on August 8|Last edited on September 19
Comment
The Azure AI model inference API provides a uniform way to interact with a diverse set of models, enhancing efficiency and flexibility. In this guide, we'll explore how to use the Azure AI model inference API, leveraging Weave to gain insights into the models we're working with.
﻿
Table of contentsWhat is the Azure AI model inference API?What models are available for the Azure AI model inference API?What are the benefits of using the Azure AI model inference API?Tutorial: Implementing Azure AI model inference API with W&B Weave Setting Up Your Azure EnvironmentDeploying a ModelMaking inference requestsUsing Weave to better understand your outputComparing Multiple Models Evaluating AI Models on a Geography Dataset Using WeaveIntroduction to Weave EvaluationsThe Custom Geography DatasetThe Full ScriptExplanation of the scriptThe Evaluations dashboardConclusion 
﻿
What is the Azure AI model inference API?The Azure AI model inference API is a cloud-based service by Microsoft that allows developers to deploy and consume AI models via API endpoints. It provides a uniform interface for interacting with a variety of AI models, making model deployment and inference straightforward and efficient.
It supports a wide range of AI models, enabling developers to leverage advanced machine learning capabilities without needing extensive infrastructure setup. API endpoints are the specific URLs provided by the API through which developers can interact with the deployed models. These endpoints allow for easy integration of AI models into applications, enabling real-time predictions and insights.
The Azure AI model inference API facilitates model deployment and inference by offering a consistent and uniform way to interact with different models. This means developers can switch between models to compare the performance without changing the underlying code, enhancing flexibility and efficiency. By using this API, developers can focus on building intelligent applications without worrying about the complexities of model management and infrastructure.
These models can be deployed to serverless API endpoints or managed inference, providing scalability and flexibility for different use cases. With the Azure AI model inference API, developers can easily determine and deploy the best model for a specific task, improving performance and efficiency while maintaining a consistent interface for model interaction. This capability significantly reduces the overhead associated with managing multiple models and their respective integrations, making it a powerful tool for developers aiming to harness the full potential of AI in their applications.
Note that we will be covering the serverless version of the inference API. Microsoft also provides an API for managed compute, which is prices hourly instead of per token. Using the managed compute API is somewhat similar to the serverless API, with the main differences being the input format for sending queries to the model. 
💡
What models are available for the Azure AI model inference API?The Azure AI model inference API offers a comprehensive selection of models that cater to a wide range of AI applications. These models include traditional AI models and advanced foundation models including small language models (SLMs) like Phi-3 and large language models (LLMs) like GPT-4o. This diversity allows developers to choose the most suitable model for their specific use case, enhancing performance and efficiency.
The API supports models deployed to serverless API endpoints, which are highly scalable and do not require infrastructure management from the user. Serverless endpoints simplify the deployment process, enabling developers to focus on integrating AI capabilities into their applications without worrying about the underlying infrastructure. The models available for serverless deployment include:
Cohere Embed V3 family of models
Cohere Command R family of models
﻿Meta Llama 2 chat family of models
﻿Meta Llama 3 instruct family of models
Mistral-Small
Mistral-Large
Jais family of models
Jamba family of models
﻿Phi-3 family of models
Serverless deployments allow for automatic scaling based on the workload, ensuring that resources are efficiently utilized. This is particularly beneficial for applications with variable or unpredictable traffic, as it eliminates the need for manual scaling and reduces costs associated with idle infrastructure. Managed inference, on the other hand, offers a stable and predictable environment, making it suitable for mission-critical applications where performance consistency is paramount.
What are the benefits of using the Azure AI model inference API?The Azure AI model inference API brings several key benefits to developers and organizations looking to leverage AI in their applications. One of the primary advantages is scalability. The API allows applications to automatically scale based on demand, ensuring that resources are used efficiently and performance remains consistent even during peak usage periods. This scalability is crucial for applications with variable or unpredictable traffic patterns.
Cost-effectiveness is another significant benefit. By utilizing serverless endpoints, developers can reduce the costs associated with maintaining and managing infrastructure. Serverless deployments ensure that you only pay for what you use, eliminating expenses related to idle resources. This model is especially beneficial for startups and smaller organizations that need to manage budgets carefully while still leveraging powerful AI capabilities.
Ease of use is a standout feature of the Azure AI model inference API. The API abstracts much of the complexity involved in deploying and managing AI models. Developers can focus on building and improving their applications without needing deep expertise in AI infrastructure. The API provides a consistent interface for interacting with various models, simplifying the integration process and reducing development time.
W&B Weave is another tool we will utilize, that is used to track the inputs and outputs of methods. By annotating methods with @weave.op, Weave logs the data passing through these functions, including queries sent to the model, the context provided, decisions about web searches, and the final responses generated by the model. These integrations streamline the development process, helping teams to iterate faster and achieve better results.
Tutorial: Implementing Azure AI model inference API with W&B Weave 
Setting Up Your Azure EnvironmentIn order to use the model inference API, you will need an Azure account with billing information added. After you have created your account, search for "Azure AI Studio" in the search bar of the Azure console.
﻿
After navigating to the Azure AI Studio page, click the "New Azure AI hub" button at the top left.
﻿
Next, we will create our workspace. One important detail is the region that you choose for deployment. You can check the availability for each region here.
I will use East US 2 and select my existing Subscription and Resource group. If you don't have a resource group, you can click the “Create new” button to create one.
﻿
After creating your workspace, you can see more details in the main Azure AI Studio page. Here, we are presented with information on our workspace, with a button at the bottom right which will allow us to launch the Azure AI Studio.
﻿
Deploying a ModelOnce inside the studio, we can click the “Model Catalog” button, and filter models that support serverless inference. 
﻿
Note that not all of these models are supported by the Serverless Inference API. Simply click the model that you would like to deploy (and is also supported by the serverless inference API). After clicking the model, you will be presented with another page that will allow you to deploy the model: 
﻿
You will need a project in order to deploy a model. On the right-hand side, above the project name dropdown, there is a button that says "Create a new project" which will allow you to create a new project. 
﻿
After deploying the model, you can navigate to the “Deployments” tab in the Azure AI Studio. Once you arrive at this page, you will be able to see each of the models you’ve deployed.
﻿
﻿
We will need to retrieve our endpoint URL and the primary key for our model, which will allow us to authenticate with the API when making requests. Simply click the model you would like to use, and then you will see a screen that looks like the following:
﻿
Here, we can copy our primary key along with our endpoint URL, and save this for later as we will use it to run inference on the model using python.
Making inference requestsWe will use Python along with the requests library. I decided not to use the Azure AI Studio library, as I felt using the JSON request library was just as effective, while also being simpler (in my humble opinion). Here’s the script that will allow us to run inference on the model. Depending on which model you are using, you can simply change the endpoint URL and primary key, while leaving all other code the same.
﻿
import weave
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
﻿
# Initialize Weave for logging
weave.init('azure-api')
﻿
# Define the endpoint URL and API key (hardcoded for this example)
ENDPOINT_URL = "your endpoint url"
PRIMARY_KEY = "your primary key"
﻿
﻿
PROMPT = "How many languages are in the world?"
﻿
﻿
﻿
# Initialize the ChatCompletionsClient
client = ChatCompletionsClient(
    endpoint=ENDPOINT_URL,
    credential=AzureKeyCredential(PRIMARY_KEY)
)
﻿
@weave.op
def run_inference(prompt, client):
    """
    Function to perform inference using the provided client and prompt.
    """
    response = client.complete(
        messages=[
            SystemMessage(content="You are a helpful assistant."),
            UserMessage(content=prompt),
        ]
    )
    try:
        content = response.choices[0].message.content
        print("Response:", content)
        return content
    except Exception as e:
        print(f"Failed to get response: {e}")
        return None
﻿
﻿
# Perform inference using the hardcoded prompt
run_inference(PROMPT, client)
﻿
﻿
This script is designed to make inference requests to an Azure AI model using Python and the azure.ai.inference library, with integration of Weave for logging. The script imports necessary libraries including ChatCompletionsClient from Azure AI for handling inference requests, weave for logging, and AzureKeyCredential for authentication.
Global variables ENDPOINT_URL and PRIMARY_KEY are defined to hold the Azure AI model's endpoint details, which should be set with actual values. The script initializes a ChatCompletionsClient using these credentials, allowing it to send requests to the Azure AI service.
The core function get_inference handles the process of sending a prompt to the model. This function is decorated with @weave.op, which enables logging of the operation. The function sends a request containing a system message, indicating the assistant's role, and the user's prompt. The function captures the response, extracts the content, and prints it to the console.
The main block of the script defines a hardcoded prompt and calls the run_inference function to get the model's response, displaying the result or an error message if the response fails to be retrieved.
You can run the script using the following command: 
python inference.py
Using Weave to better understand your outputWeave is a powerful tool used for logging inputs and outputs of your model in a central location. The @weave.op() decorator is used to log the inputs and outputs of the decorated function, allowing you to easily track and analyze the data passing through your models. 
The @weave.op decorator is applied to the function that handles the model inference. This decorator logs the function’s inputs and outputs, providing a detailed trace of the function’s execution. By logging this data, Weave helps in debugging and analyzing the model's performance, ensuring that you can trace the flow of data and identify any issues.
Here's what our logs look like inside of Weave:
﻿
﻿
Comparing Multiple Models One of the great aspects of the Inference API is the consistency of the API across different models. This consistency means that once you've integrated with one model, you can easily switch to another model or even test multiple models with minimal changes to your code. The API's uniform interface ensures that the same code structure can be reused, which significantly reduces development time and effort.
This is particularly beneficial in scenarios where you need to experiment with different models to find the best fit for your application. For example, you might start with a smaller model to quickly test your application's functionality and then switch to a more powerful model as your requirements evolve. The consistent API interface allows you to make these transitions smoothly, without the need to rewrite your integration logic.
I followed the previous steps in order to create a serverless instance of the Mistral-Small Instruct and a Phi-3 Instruct model, in order to compare the two models on similar prompts. The script provided below is designed to facilitate this comparison process. Additionally, the script integrates logging with Weave. This allows you to maintain a detailed record of the inputs and outputs for each model, making it easier to understand their behavior and performance. Here is the script: 
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
import weave
﻿
# Initialize Weave for logging
weave.init('azure-api')
﻿
# Define global variables for the two models
MODEL_1_ENDPOINT_URL = "your first model endpoint url"
MODEL_1_PRIMARY_KEY = "your first model key"
﻿
MODEL_2_ENDPOINT_URL = "your second model endpoint url"
MODEL_2_PRIMARY_KEY = "your second model key"
﻿
PROMPT = "How many languages are in the world?"
﻿
# Initialize the ChatCompletionsClient for each model
client1 = ChatCompletionsClient(
    endpoint=MODEL_1_ENDPOINT_URL,
    credential=AzureKeyCredential(MODEL_1_PRIMARY_KEY)
)
﻿
client2 = ChatCompletionsClient(
    endpoint=MODEL_2_ENDPOINT_URL,
    credential=AzureKeyCredential(MODEL_2_PRIMARY_KEY)
)
﻿
﻿
@weave.op
def get_model_prediction(prompt, client):
    response = client.complete(
        messages=[
            SystemMessage(content="You are a helpful assistant."),
            UserMessage(content=prompt),
        ]
    )
    
    try:
        return parse_response(response)
    except Exception as e:
        print(f"Failed to get response: {e}")
        return None
﻿
def parse_response(response):
    if response.choices:
        choice = response.choices[0]
        if choice.message and choice.message.content:
            content = choice.message.content
            print(f"Model response content: {content}")
            return content
        if choice.finish_reason:
            finish_reason = choice.finish_reason
            print(f"Finish reason: {finish_reason}")
    return "No valid response"
﻿
if __name__ == "__main__":
    if not MODEL_1_ENDPOINT_URL or not MODEL_1_PRIMARY_KEY or not MODEL_2_ENDPOINT_URL or not MODEL_2_PRIMARY_KEY:
        print("Error: Missing endpoint URL or primary key for one or both models")
    else:
        print("Response from Model 1:")
        response1 = get_model_prediction(PROMPT, client1)
        if response1:
            print(response1)
        else:
            print("No response or failed to decode the response from Model 1.")
﻿
        print("\nResponse from Model 2:")
        response2 = get_model_prediction(PROMPT, client2)
        if response2:
            print(response2)
        else:
            print("No response or failed to decode the response from Model 2.")
﻿
Inside Weave, we can also display responses based on different inputs and outputs! For this example, we will display only the prompt, endpoint, and output in the Weave trace viewer using a the column manager button on the top right hand side, as shown below: 
﻿
﻿
﻿
﻿
Next, we will add a filter for a specific prompt. In practice, this could be for any specific key passed to the function being logged! We will add a filter for the prompt, as shown below: 
﻿
And after adding this filter, we will see the results down below!
﻿
As we can see, the responses for the Mistral Small Instruct and the Phi-3 Instruct models are quite similar, with the only difference being the format of the integer result, where the Phi-3 model contains a comma in the integer response. Now, this is a very "small" detail, but "small" details mean everything in Machine Learning! Weave allows you to visualize these nuanced responses from your models, and make more informed decisions about choosing the right models and training data for your product!
Evaluating AI Models on a Geography Dataset Using WeaveNow, we’ll demonstrate how to use Weave's evaluation capabilities to compare the performance of two AI models on a custom geography dataset I made using GPT-4o. By leveraging Weave's Evaluation class, we can systematically assess how well each model performs and make informed decisions about which model is better suited for our application.
Introduction to Weave EvaluationsEvaluation-driven development helps you reliably iterate on an application by systematically testing changes against a consistent dataset. The Evaluation class in Weave is designed to assess the performance of a Model on a given dataset using custom scoring functions. This approach ensures that you are comparing results accurately and provides a rich UI to drill into individual outputs and scores, enabling deeper insights into model performance.
The Custom Geography DatasetThe dataset we’ll use consists of geography-related questions paired with their expected answers. Here are a few examples:
[
    {"question": "What is the largest desert in the world?", "expected": "Sahara"},
    {"question": "Which river is the longest in the world?", "expected": "Nile"},
    {"question": "What is the capital of Japan?", "expected": "Tokyo"},
    {"question": "Which ocean is the largest by surface area?", "expected": "Pacific Ocean"},
    {"question": "Which desert is known as the 'Cold Desert'?", "expected": "Gobi Desert"}
]
﻿
This dataset is designed to test the models' ability to generate accurate and relevant responses to common geographical queries.
The Full ScriptBelow is the Python script that sets up the evaluation, defines the models, and runs the comparison using Weave:
import requests
import json
import time
import asyncio
import weave
from weave import Evaluation, Model
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
﻿
﻿
# Initialize Weave for logging
weave.init('azure-api-eval-geo')
﻿
﻿
# Define global variables for the two models
MODEL_1_ENDPOINT_URL = "https://Mistral-small-achqr.eastus2.models.ai.azure.com/chat/completions"
MODEL_1_PRIMARY_KEY = "your key"
﻿
MODEL_2_ENDPOINT_URL = "https://Phi-3-small-128k-instruct-fcsez.eastus2.models.ai.azure.com/chat/completions"
MODEL_2_PRIMARY_KEY = "your key"
﻿
﻿
# Initialize the ChatCompletionsClient for each model
def init_client(endpoint, key):
    return ChatCompletionsClient(
        endpoint=endpoint,
        credential=AzureKeyCredential(key)
    )
﻿
﻿
def get_model_prediction(prompt, client, model_name):
    response = client.complete(
        messages=[
            SystemMessage(content="You are a helpful assistant."),
            UserMessage(content=prompt),
        ]
    )
    try:
        content = response.choices[0].message.content
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        total_tokens = response.usage.total_tokens
        
        # Update the returned result to include the correct token structure
        return {
            'output': content,
            'model': model_name,            
            'usage': {
                'input_tokens': prompt_tokens,
                'output_tokens': completion_tokens,
                'total_tokens' : total_tokens
            }
        }
    except Exception as e:
        print(f"Failed to get response: {e}")
        return {
            'output': "No valid response",
            'usage': {
                'input_tokens': 0,
                'output_tokens': 0,
                'total_tokens': 0
            }
        }
﻿
# Define custom scoring functions
@weave.op()
def substring_match(expected: str, model_output: dict) -> dict:
    match = expected.lower() in model_output['output'].lower()
    return {
        'substring_match': match
    }
﻿
﻿
# Define the base model for Weave
class MistralSmall(Model):
    model_name: str = "Mistral-small"
    endpoint: str = MODEL_1_ENDPOINT_URL
    primary_key: str = MODEL_1_PRIMARY_KEY
    client: ChatCompletionsClient = None  # Declare the client field
﻿
    def __init__(self):
        super().__init__()  # Initialize the base class if needed
        # Initialize the client and assign it explicitly
        object.__setattr__(self, 'client', init_client(self.endpoint, self.primary_key))
﻿
    @weave.op()
    def predict(self, question: str):
        result = get_model_prediction(question, self.client, self.model_name)
        return result
﻿
﻿
class Phi3Small(Model):
    model_name: str = "Phi-3-small"
    endpoint: str = MODEL_2_ENDPOINT_URL
    primary_key: str = MODEL_2_PRIMARY_KEY
    client: ChatCompletionsClient = None  # Declare the client field
﻿
    def __init__(self):
        super().__init__()  # Initialize the base class if needed
        # Initialize the client and assign it explicitly
        object.__setattr__(self, 'client', init_client(self.endpoint, self.primary_key))
    
﻿
    @weave.op()
    def predict(self, question: str):
        result = get_model_prediction(question, self.client, self.model_name)
        return result
﻿
def run_evaluation(model_class):
    # Load the dataset
    with open('dataset.json', 'r') as f:
        dataset = json.load(f)
﻿
    # Instantiate the model
    model = model_class()
﻿
    # Create evaluation for the model
    evaluation = Evaluation(dataset=dataset, scorers=[substring_match])
﻿
    # Run the evaluation for the model
    print(f"Evaluating {model.model_name}:")
    asyncio.run(evaluation.evaluate(model.predict))
    print(f"{model.model_name} evaluation completed.")
﻿
﻿
if __name__ == "__main__":
    # Run evaluation for both models sequentially
    run_evaluation(MistralSmall)
    run_evaluation(Phi3Small)
Explanation of the scriptIn the initialization and setup phase, Weave is initialized for logging with the name azure-api-eval-geo. The endpoints and primary keys for two models, Mistral-Small and Phi-3-small, are defined. For model prediction, the get_model_prediction function sends a prompt to the specified model endpoint and retrieves the response, using the appropriate API key based on the endpoint provided. 
In addition to retrieving the generated text, the get_model_prediction function also captures token usage statistics from the model output. Specifically, it logs the number of input_tokens (tokens used for the prompt), output_tokens (tokens generated in response), and total_tokens (sum of input and output tokens). This information is structured in the model output and passed to Weave for logging and analysis, enabling easy tracking of resource usage during model evaluations.
A custom scoring function, substring_match, is defined to check if the expected answer is a substring of the generated text from the model. This function is used to evaluate the accuracy of the models' responses. Two model classes, MistralSmall and Phi3Small, are defined, each including a predict method decorated with @weave.op() to log the inputs and outputs during evaluation.
In the evaluation execution phase, the run_evaluation function loads the dataset, instantiates the model class, and runs the evaluation, logging all the data in Weave for further analysis. In the __main__ block, evaluations for both models are run sequentially, allowing for a comparison of their performance on the same dataset.
The Evaluations dashboardOnce the evaluations are complete, Weave provides a rich dashboard to visualize and compare the results. In order to access the evaluations dashboard, simply navigate to the project, and click the evaluations tab in the top left corner (note you will need to select the cells corresponding to your evaluation runs, and you will then be able to select the 'Compare' button at the top right-hand corner of the screen):
﻿
Metrics such as substring match accuracy and model latency can be viewed and compared side by side for each model. This helps in identifying which model is more accurate, faster, or more efficient in terms of token usage. The evaluation dashboard presents a detailed comparison of the models, including accuracy (substring match) and latency, aiding in making data-driven decisions on model selection. 
﻿
﻿
Weave's output comparison feature allows for the specific responses generated by each model for the same input to be compared. For example, when asked "Which desert is known as the 'Cold Desert'?", you can see how each model responds and how well it matches the expected answer. 
﻿
By leveraging Weave’s evaluation tools, a comprehensive understanding of the models' strengths and weaknesses is gained, enabling the selection of the best model for any application.
Conclusion The Azure AI Model Inference API provides a robust and flexible solution for deploying and consuming machine learning models via API endpoints. This quick start guide has outlined the key steps to get started, from setting up your Azure environment and deploying models, to making inference requests and leveraging tools like Weave for enhanced logging and analysis.
Additionally, the Weave platform offers powerful evaluation features that enable you to systematically assess your models capabilities. By integrating Weave’s Evaluation class, you can run detailed comparisons between models and quickly visualize key metrics. These features ensure that you can make data-driven decisions when selecting the best model for your specific use case.
By following these steps, developers can efficiently integrate AI capabilities into their applications, benefiting from the scalability, cost-effectiveness, and ease of use provided by Azure's serverless endpoints. The Azure AI Model Inference API simplifies the process, allowing you to focus on building intelligent and responsive applications with pre-trained models.
To maximize your understanding and usage of the Azure AI Model Inference API, it is highly recommended to explore the official Azure documentation. The comprehensive resources available will help you navigate the intricacies of model deployment, API integration, and performance optimization, empowering you to fully harness the potential of Azure AI services in your projects.
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!! 
Training a KANFormer: KAN's Are All You Need? 
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers! 
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
6 "gotchas" in machine learning—and how to avoid them
ML is hard and you can't plan for everything. Here are a few things I've learned and a few tips to avoid common missteps
﻿
Azure Docs: https://learn.microsoft.com/en-us/azure/ai-studio/reference/reference-model-inference-api?tabs=python﻿
﻿
﻿
Add a comment
Tags: Articles, Framework / Integration, Tutorial, Domain Agnostic
Iterate on AI agents and models faster. Try Weights & Biases today.