Comparing GPT Models on Azure AI Foundry with W&B Weave

Learn how to compare and evaluate OpenAI’s GPT models on Azure with W&B Weave on text summarization tasks, leveraging Azure’s managed infrastructure and Weave’s customizable evaluation tools.
Brett Young
Created on November 21|Last edited on December 4
Comment
As organizations increasingly rely on AI to streamline operations, the ability to effectively compare and evaluate language models has become critical. Selecting the right model for specific use cases, such as summarizing research papers, financial reports, or business documents, can significantly impact efficiency and outcomes. While text summarization is a compelling example, the broader focus is on leveraging tools to systematically analyze and compare model performance.
This article provides a guide to evaluating and comparing large language models (LLMs) using OpenAI’s GPT models on Azure AI Foundry, integrated with W&B Weave’s robust evaluation platform. By combining Azure AI Foundry’s scalable infrastructure with Weave’s visualization and analysis tools, organizations can conduct detailed comparisons across models, refine configurations, and optimize workflows to align with their unique goals and operational demands.
﻿
Table of contentsTable of contentsFoundation models available in the Azure AI FoundryComparing GPT model summarization on Azure with W&B WeaveStep 1: Accessing GPT models via the Azure AI FoundryStep 2: Generate a dataset to test summarization by multiple GPT modelsEvaluating the models with Weave Evaluations Comprehensive evaluation metricsCustom scoring integrationIntuitive visualization with WeaveWhy choose Azure AI?Performance evaluation with Azure AI and W&B Weave
﻿
Foundation models available in the Azure AI FoundryAzure AI Foundry provides a comprehensive platform to access a diverse array of foundation models, enabling organizations to tackle summarization and other language tasks with ease. With Azure, you benefit from a managed infrastructure that eliminates the complexity of maintaining and scaling AI systems, allowing them to focus on deriving insights and optimizing workflows.
The platform supports a mix of proprietary and open-source models, offering flexibility for a variety of use cases. Azure’s catalog includes advanced options from providers such as OpenAI, Meta, and Mistral, as well as Microsoft’s own Phi series models. These models are optimized for tasks ranging from conversational AI and summarization to document processing and high-throughput applications. Whether you require cutting-edge performance for complex tasks or cost-effective solutions for simpler needs, Azure’s extensive selection caters to a wide spectrum of operational requirements.
Combined with Weights & Biases Weave’s advanced evaluation tools, Azure AI Foundry empowers you to perform side-by-side comparisons of models, visualize performance metrics, and understand strengths and limitations. This seamless integration allows organizations to make informed decisions, selecting the models that align best with their unique goals and workflows.
Comparing GPT model summarization on Azure with W&B WeaveW&B Weave provides a powerful platform for logging and analyzing generated summaries during evaluation, enabling a centralized dashboard for performance comparison. This setup allows for a detailed side-by-side analysis of model outputs, highlighting differences in coherence, relevance, and overall quality.
To ensure a thorough evaluation, we'll employ a diverse range of metrics:
ROUGE: Measures the overlap of key phrases and word sequences between generated and reference summaries.
BERTScore: Assesses semantic similarity by comparing the contextual embeddings of the texts.
Compression Ratio: Evaluates how concise the generated summaries are.
Coverage: Examines how effectively the summaries capture critical content.
Additionally, a specialized GPT-4o scoring method enhances the analysis by providing qualitative evaluations of the summaries. This method rates each output on a scale from 1 to 5, considering factors such as accuracy, completeness, and alignment with the reference summary. Together, these metrics offer a comprehensive view of model performance, empowering teams to identify strengths, address weaknesses, and select the optimal model for their summarization needs.
Step 1: Accessing GPT models via the Azure AI FoundryTo set up and deploy GPT models like GPT-4o and GPT-4o Mini on Azure, start by navigating to Azure AI Foundry and logging in with your Azure credentials. Once logged in, you’ll arrive at the dashboard where you can begin creating your project.
﻿
Click the "Create project" button to initialize a new project. In the dialog that appears, enter a project name and choose a hub to associate with it, or create a new hub if necessary. Once this is done, click "Create" to finalize the setup.
﻿
After creating your project, open it and navigate to the "Model catalog" from the sidebar on the left. This is the area where you can browse and explore various available AI models.
In the model catalog, filter the options by selecting "Serverless API" under the deployment options. This narrows the list to models that can be deployed with serverless infrastructure. Locate GPT-4o and GPT-4o Mini from the list of models displayed.
﻿
I will select GPT-4o and open its details page. From here, click "Deploy" and enter a deployment name, such as "gpt-4o-deployment." Choose "Global Standard" as the deployment type and confirm the setup. Repeat the process for GPT-4o Mini to deploy both models. 
﻿
Once the deployments are complete, go to the "Models + Endpoints" section in the left-hand sidebar. Click on the deployed models to view their details. Copy the endpoint URL and API key for each model. These will be needed later for integration with your application.
﻿
As a final step, install the following python libraries: 
pip install openai==1.54.5 arxiv==2.1.3 PyMuPDF==1.24.9 weave==0.51.18 fitz==0.0.1.dev2 bert-score==0.3.13 rouge-score==0.1.2
At this point, both GPT-4o and GPT-4o Mini are successfully deployed on Azure and ready to be accessed through their API endpoints. Now we are ready to write some code! 
Step 2: Generate a dataset to test summarization by multiple GPT modelsWe will benchmark the text summarization capabilities of the GPT-4o and GPT-4o mini models by testing their ability to accurately generate an abstract for a research paper when the original abstract is removed. This approach allows us to compare how effectively the models can produce concise, relevant summaries of key information from the main content of each paper.
This is an effective test of the model’s summarization abilities because it mirrors the task a human would face when summarizing complex information: distilling the core objectives, methods, and findings of a paper into a brief, coherent abstract. By withholding the abstract, we can assess whether the model can independently identify and convey the most essential aspects of the paper, demonstrating a human-like ability to process, evaluate, and summarize academic content in a structured and concise format. This setup allows us to evaluate not only the model’s accuracy in capturing information but also its skill in organizing it succinctly, as a human expert would.
To start, I'll share a basic script that will show how to run inference with the model: 
import weave
import os
from openai import AzureOpenAI
import json
﻿
# Initialize Weave for logging
weave.init('azure-api')
﻿
# Initialize the AzureOpenAI client
client = AzureOpenAI(
    azure_endpoint="your enpoint url",
    api_key="your key",
    api_version="2024-09-01-preview"
)
﻿
@weave.op
def run_inference(prompt, client):
    """
    Function to perform inference using the provided client and prompt.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        
        # Parse the response
        response_json = json.loads(response.model_dump_json(indent=2))
        choices = response_json.get("choices", [])
        
        if choices:
            content = choices[0].get("message", {}).get("content", "")
            print("Generated Content:")
            print(content)
            return content
        else:
            print("No content found in response")
            return None
            
    except Exception as e:
        print(f"Failed to get response: {e}")
        return None
﻿
# Define the prompt and perform inference
PROMPT = "What steps should I think about when writing my first Python API?"
run_inference(PROMPT, client)
Here, we also use Weave to track the inputs and outputs of our model. This demonstrates how to use the Traces component of Weave. Later on, I will demonstrate how to use Weave Evaluations which is specifically designed for comparing models on the same dataset. 
To create our benchmark dataset, we first collect research papers from arXiv, focusing on AI and machine learning topics. From each paper, we extract only the first page, where the abstract is typically located, and use GPT-4o to isolate this section and structure it as a JSON object. These extracted abstracts will serve as "gold standard" reference points, saved in a JSONL file format for easy loading and consistent evaluation.
This file format makes it easy to load and process the summaries when we evaluate the GPT-4o models, and ensures our reference data is properly versioned and easily shareable. Here's the code that will download the papers and extract the abstracts from the first page of each paper using GPT-4o:
import os
import arxiv
import fitz  # PyMuPDF
import json
from openai import AzureOpenAI
import weave
import re
from time import sleep
﻿
# Initialize Weave for logging
weave.init('azure_paper_abstract_gen')
﻿
# Set up Azure OpenAI
os.environ["AZURE_OPENAI_ENDPOINT"] = "your endpoint url"
os.environ["AZURE_OPENAI_API_KEY"] = "your api key"
﻿
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-09-01-preview"
)
﻿
# Directory to save downloaded papers
download_dir = "arxiv_papers"
os.makedirs(download_dir, exist_ok=True)
﻿
# Define AI-specific search queries
search_queries = [
    "Large Language Models for vision tasks AND cat:cs.AI",
    "Multimodal AI techniques AND cat:cs.CV",
    "Applications of Transformers in healthcare AI AND cat:cs.LG",
    "Few-shot learning in AI and ML AND cat:cs.LG",
    "Vision and language models integration AND cat:cs.CV",
    "Domain-specific fine-tuning for ML models AND cat:cs.LG",
    "Foundational models in AI and CV applications AND cat:cs.AI",
    "NLP in robotics and vision systems AND cat:cs.AI",
    "Bias and fairness in AI for CV AND cat:cs.CV",
    "Evaluation metrics for multimodal AI AND cat:cs.LG"
]
﻿
def download_papers(max_pages=15, max_attempts_per_query=20):
    """Download one suitable paper for each query, retrying if papers exceed page limit."""
    papers = []
    downloaded_titles = set()
    client = arxiv.Client()
﻿
    for query in search_queries:
        paper_found = False
        attempt = 0
        
        while not paper_found and attempt < max_attempts_per_query:
            search = arxiv.Search(
                query=query,
                max_results=100,
                sort_by=arxiv.SortCriterion.SubmittedDate
            )
            
            try:
                results = list(client.results(search))
                start_idx = attempt * 5
                end_idx = start_idx + 5
                current_batch = results[start_idx:end_idx]
                
                for result in current_batch:
                    if result.title not in downloaded_titles:
                        print(f"Downloading: {result.title}")
                        paper_id = result.entry_id.split('/')[-1]
                        pdf_filename = f"{paper_id}.pdf"
                        pdf_path = os.path.join(download_dir, pdf_filename)
                        
                        result.download_pdf(dirpath=download_dir, filename=pdf_filename)
                        
                        try:
                            with fitz.open(pdf_path) as pdf:
                                if pdf.page_count <= max_pages:
                                    papers.append({
                                        "title": result.title,
                                        "file_path": pdf_path,
                                        "arxiv_id": paper_id
                                    })
                                    downloaded_titles.add(result.title)
                                    print(f"Accepted: {result.title}")
                                    paper_found = True
                                    break
                                else:
                                    os.remove(pdf_path)
                                    print(f"Skipped (too many pages: {pdf.page_count}): {result.title}")
                        except Exception as e:
                            print(f"Error checking PDF {pdf_path}: {e}")
                            if os.path.exists(pdf_path):
                                os.remove(pdf_path)
            
                attempt += 1
                if not paper_found:
                    print(f"Attempt {attempt}/{max_attempts_per_query} for query: {query}")
                    sleep(3)
                    
            except Exception as e:
                print(f"Error during download: {e}")
                sleep(3)
                attempt += 1
                continue
        
        if not paper_found:
            print(f"Failed to find suitable paper for query after {max_attempts_per_query} attempts: {query}")
﻿
    print(f"\nSuccessfully downloaded {len(papers)} papers")
    return papers
﻿
def extract_first_page_text(pdf_path):
    """Extract text from only the first page of the PDF."""
    with fitz.open(pdf_path) as pdf:
        if pdf.page_count > 0:
            page = pdf[0]
            return page.get_text()
    return ""
﻿
@weave.op
def extract_abstract_with_azure(text, title):
    """Extract abstract using Azure GPT-4o with JSON output."""
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a research paper analysis assistant. Extract ONLY the abstract section from the paper content provided. Return the result in JSON format with 'abstract' as the key. If you cannot find the abstract, return an empty string as the value."},
                {"role": "user", "content": f"Paper title: {title}\n\nPaper content:\n\n{text}"}
            ],
            response_format={"type": "json_object"}
        )
        
        content = response.choices[0].message.content
        return json.loads(content)
    except Exception as e:
        print(f"Error extracting abstract for {title}: {e}")
        sleep(3)
        return {"abstract": ""}
﻿
def count_words(text):
    """Count words excluding punctuation and special characters."""
    cleaned_text = re.sub(r'[^\w\s]', ' ', text.lower())
    words = [word for word in cleaned_text.split() if word.strip()]
    return len(words)
﻿
def main():
    # Download papers
    papers = download_papers()
    print(f"\nDownloaded {len(papers)} papers. Processing abstracts...\n")
    
    # Process papers and extract abstracts
    paper_data = []
    for paper in papers:
        title = paper["title"]
        pdf_path = paper["file_path"]
        
        print(f"Processing: {title}")
        first_page_text = extract_first_page_text(pdf_path)
        abstract_json = extract_abstract_with_azure(first_page_text, title)
        abstract_text = abstract_json.get('abstract', '')
        word_count = count_words(abstract_text)
        
        paper_data.append({
            "title": title,
            "file_path": pdf_path,
            "abstract": abstract_text,
            "word_count": word_count,
            "arxiv_id": paper["arxiv_id"]
        })
        
        sleep(2)
﻿
    # Save to JSONL file
    output_file = "paper_abstracts.jsonl"
    with open(output_file, "w") as f:
        for entry in paper_data:
            json.dump(entry, f)
            f.write("\n")
﻿
    print(f"\nProcessed {len(paper_data)} papers. Results saved to {output_file}")
﻿
if __name__ == "__main__":
    main()
This script establishes a dataset of reference summaries to evaluate the Azure GPT-4o models. We extracted the original abstracts directly from AI research papers sourced from arXiv, providing a consistent and authentic ground truth for our evaluation. These abstracts are saved in JSONL format, serving as benchmarks for comparison.
The process involves downloading research papers, extracting only the abstracts from each PDF, and organizing them in a structured format. By creating this structured dataset, we ensure a reliable and standardized basis for assessing the performance of the GPT-4o models. With this ground-truth dataset in place, we can now generate and compare summaries from the GPT-4o models against these original abstracts to comprehensively evaluate their summarization capabilities.
Evaluating the models with Weave Evaluations To evaluate the text summarization capabilities of GPT-4o models in Azure's AI Foundry, we will use a range of metrics to provide a comprehensive analysis. These metrics combine traditional text similarity, neural semantic similarity, and an LLM-based scoring system to effectively assess performance.
GPT-4o will act as an automated evaluator, rating generated abstracts on a scale from 1 to 5 based on how well they capture key elements like objectives, methodologies, and findings. For neural semantic similarity, we will use BERTScore to evaluate contextual alignment, while ROUGE scores will measure lexical and structural overlaps. Additional metrics, such as coverage and compression ratios, will assess information retention and summary conciseness.
I recommend setting the WEAVE_PARALLELISM environment variable to a low value before running the evaluation code. This can be done using the command export WEAVE_PARALLELISM=1, ensuring smooth execution. These metrics will be visualized in Weave's evaluation dashboard, providing a multi-dimensional view of performance and highlighting areas for improvement. Below is the code we will use for the evaluation:
import weave
from weave import Model
import json
from time import sleep
import asyncio
from rouge_score.rouge_scorer import RougeScorer
from typing import Dict, Any
import bert_score
import fitz
from weave.trace.box import unbox
import time
from openai import AzureOpenAI
import time
import json
﻿
from weave import Scorer
import json
﻿
﻿
# Initialize Weave
weave.init('azure_abstract_eval')
﻿
﻿
gpt4o_client = AzureOpenAI(
    azure_endpoint="https://<your-resource-name>.openai.azure.com/openai/deployments/<gpt-4o-deployment-name>/chat/completions?api-version=2024-08-01-preview",
    api_key="<your-api-key>",
    api_version="2024-08-01-preview"
)
﻿
gpt4o_mini_client = AzureOpenAI(
    azure_endpoint="https://<your-resource-name>.openai.azure.com/openai/deployments/<gpt-4o-mini-deployment-name>/chat/completions?api-version=2024-08-01-preview",
    api_key="<your-api-key>",
    api_version="2024-08-01-preview"
)
﻿
def create_prediction_prompt(text, title, target_length):
    """Create a prompt for abstract generation with target length."""
    return (
        f"You are tasked with generating an abstract for a research paper titled '{title}'. "
        f"The abstract should be approximately {target_length} words long.\n\n"
        f"Generate an abstract that summarizes the key points of the paper, including the "
        f"research objective, methodology, and main findings. The abstract should be "
        f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
        f"Paper content:\n\n{text}"
        f"Respond only with the ABSTRACT!"
    )
﻿
def create_evaluation_prompt(gt_abstract: str, generated_abstract: str) -> str:
    """Create standardized evaluation prompt."""
    return f'''You are evaluating how well a generated abstract captures the information from a ground truth abstract.
﻿
Ground Truth Abstract:
{gt_abstract}
﻿
Generated Abstract:
{generated_abstract}
﻿
Rate the generated abstract on a scale from 1-5, where:
1: Poor - Missing most key information or seriously misrepresenting the research
2: Fair - Captures some information but misses crucial elements
3: Good - Captures most key points but has some gaps or inaccuracies
4: Very Good - Accurately captures nearly all key information with minor omissions
5: Excellent - Perfectly captures all key information and maintains accuracy
﻿
Respond ONLY with a JSON object containing a single "score" field with an integer value from 1-5.
Example response format:
{{"score": 4}}'''
﻿
﻿
def extract_text_after_page_one(pdf_path: str) -> str:
    """Extract text from page 2 onwards."""
    if isinstance(pdf_path, weave.trace.box.BoxedStr):
        pdf_path = unbox(pdf_path)
        
    text = ""
    try:
        with fitz.open(pdf_path) as pdf:
            if pdf.page_count > 1:
                for page_num in range(1, pdf.page_count):  # Start from page 2
                    page = pdf[page_num]
                    text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from PDF {pdf_path}: {e}")
    return text
﻿
﻿
def run_inference(client: AzureOpenAI, model: str, prompt: str, max_retries: int = 10, base_wait: int = 10) -> str:
    """
    Function to perform inference using specified Azure model with exponential backoff retry.
    
    Args:
        client: AzureOpenAI client
        model: Model name/id
        prompt: Input prompt
        max_retries: Maximum number of retry attempts (default: 10)
        base_wait: Initial wait time in seconds (default: 10)
    """
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a research paper abstract writer. Write clear, concise, and informative abstracts."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.0
            )
            response_json = json.loads(response.model_dump_json(indent=2))
            choices = response_json.get("choices", [])
            if choices:
                content = choices[0].get("message", {}).get("content", "")
                return content
            else:
                print("No content found in response")
                
        except Exception as e:
            wait_time = base_wait * (2 ** attempt)  # Exponential backoff
            print(f"Attempt {attempt + 1}/{max_retries} failed: {e}")
            print(f"Waiting {wait_time} seconds before retrying...")
            time.sleep(wait_time)
            continue
            
    print(f"Failed to get response after {max_retries} retries")
    return None
﻿
﻿
﻿
﻿
class GPT4o(Model):
    @weave.op
    def predict(self, title: str, pdf_path: str, word_count: int) -> dict:
        """Predict abstract using GPT-4O."""
        paper_text = extract_text_after_page_one(pdf_path)
        prompt = create_prediction_prompt(title, paper_text, word_count)
        prediction = run_inference("gpt-4o-2024-08-06", prompt)
        return {"model_output": prediction}
﻿
class GPT4oMini(Model):
    @weave.op
    def predict(self, title: str, pdf_path: str, word_count: int) -> dict:
        """Predict abstract using GPT-4O-mini."""
        paper_text = extract_text_after_page_one(pdf_path)
        prompt = create_prediction_prompt(title, paper_text, word_count)
        prediction = run_inference("gpt-4o-mini", prompt)
        return {"model_output": prediction}
﻿
@weave.op
def bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate BERTScore for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'bert_score': 0.0}
    
    try:
        P, R, F1 = bert_score.score(
            [model_output['model_output']],
            [gt_abstract],
            lang='en',
            model_type='microsoft/deberta-xlarge-mnli'
        )
        return {'bert_score': float(F1.mean())}
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return {'bert_score': 0.0}
﻿
﻿
# you can also use the "function" format for your scorer 
# @weave.op
# def GPT4oScorer(gt_abstract: str, model_output: dict) -> dict:
#     """Evaluate abstract using GPT-4o."""
#     if not model_output or 'model_output' not in model_output:
#         return {'gpt4o_score': 0.0}
    
#     try:
#         prompt = create_evaluation_prompt(gt_abstract, model_output["model_output"])
#         response = run_inference(gpt4o_client, "gpt-4o", prompt)
#         if response:
#             # Clean the response text
#             response_text = response.strip()
#             # Remove any additional text before or after the JSON
#             response_text = response_text.split('{')[1].split('}')[0]
#             response_text = '{' + response_text + '}'
            
#             try:
#                 result = json.loads(response_text)
#                 if 'score' in result and isinstance(result['score'], (int, float)):
#                     score = float(result['score'])
#                     if 1 <= score <= 5:
#                         return {'gpt4o_score': score}
#             except json.JSONDecodeError:
#                 print(f"Invalid JSON response: {response_text}")
        
#         print("Using default score due to invalid response")
#         return {'gpt4o_score': 0.0}
        
#     except Exception as e:
#         print(f"Error in GPT-4o evaluation: {e}")
#         return {'gpt4o_score': 0.0}
    
﻿
﻿
class GPT4oScorer(Scorer):
    model_id: str = "gpt-4o"
    system_prompt: str = "You are evaluating how well a generated abstract captures the information from a ground truth abstract."
    
    @weave.op
    def create_evaluation_prompt(self, gt_abstract: str, generated_abstract: str) -> str:
        """Create standardized evaluation prompt."""
        return f'''Ground Truth Abstract:
{gt_abstract}
﻿
Generated Abstract:
{generated_abstract}
﻿
Rate the generated abstract on a scale from 1-5, where:
1: Poor - Missing most key information or seriously misrepresenting the research
2: Fair - Captures some information but misses crucial elements
3: Good - Captures most key points but has some gaps or inaccuracies
4: Very Good - Accurately captures nearly all key information with minor omissions
5: Excellent - Perfectly captures all key information and maintains accuracy
﻿
Respond ONLY with a JSON object containing a single "score" field with an integer value from 1-5.
Example response format:
{{"score": 4}}'''
﻿
    @weave.op
    def call_llm(self, gt_abstract: str, model_output: str) -> dict:
        """Call GPT-4o for evaluation."""
        try:
            prompt = self.create_evaluation_prompt(gt_abstract, model_output)
            response = run_inference(gpt4o_client, self.model_id, prompt)
            
            if response:
                # Clean the response text
                response_text = response.strip()
                # Remove any additional text before or after the JSON
                response_text = response_text.split('{')[1].split('}')[0]
                response_text = '{' + response_text + '}'
                
                try:
                    result = json.loads(response_text)
                    if 'score' in result and isinstance(result['score'], (int, float)):
                        score = float(result['score'])
                        if 1 <= score <= 5:
                            return {'gpt4o_score': score}
                except json.JSONDecodeError:
                    print(f"Invalid JSON response: {response_text}")
            
            print("Using default score due to invalid response")
            return {'gpt4o_score': 0.0}
            
        except Exception as e:
            print(f"Error in GPT-4o evaluation: {e}")
            return {'gpt4o_score': 0.0}
﻿
    @weave.op
    def score(self, model_output: dict, gt_abstract: str) -> dict:
        """Score the generated abstract against the ground truth.
        
        Args:
            model_output: Dictionary containing the generated abstract under 'model_output' key
            gt_abstract: The ground truth abstract to compare against
        """
        if not model_output or 'model_output' not in model_output:
            return {'gpt4o_score': 0.0}
        
        return self.call_llm(gt_abstract, model_output['model_output'])
    
﻿
@weave.op
def rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate ROUGE scores for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
    
    try:
        scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(gt_abstract, model_output['model_output'])
        
        return {
            'rouge1_f': float(scores['rouge1'].fmeasure),
            'rouge2_f': float(scores['rouge2'].fmeasure),
            'rougeL_f': float(scores['rougeL'].fmeasure)
        }
    except Exception as e:
        print(f"Error calculating ROUGE scores: {e}")
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
﻿
@weave.op
def compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate compression ratio of the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'compression_ratio': 0.0}
    
    try:
        gt_words = len(gt_abstract.split())
        generated_words = len(model_output['model_output'].split())
        
        compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)
        
        return {'compression_ratio': float(compression_ratio)}
    except Exception as e:
        print(f"Error calculating compression ratio: {e}")
        return {'compression_ratio': 0.0}
﻿
@weave.op
def coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate content coverage using word overlap."""
    if not model_output or 'model_output' not in model_output:
        return {'coverage_score': 0.0}
    
    try:
        gt_words = set(gt_abstract.lower().split())
        generated_words = set(model_output['model_output'].lower().split())
        
        intersection = len(gt_words.intersection(generated_words))
        union = len(gt_words.union(generated_words))
        
        coverage_score = intersection / union if union > 0 else 0.0
        
        return {'coverage_score': float(coverage_score)}
    except Exception as e:
        print(f"Error calculating coverage score: {e}")
        return {'coverage_score': 0.0}
﻿
﻿
def create_evaluation_dataset(gt_file: str):
    """Create dataset from ground truth file."""
    dataset = []
    with open(gt_file, 'r') as f:
        for line in f:
            entry = json.loads(line)
            dataset.append({
                "title": entry["title"],
                "gt_abstract": entry["abstract"],
                "pdf_path": entry["file_path"],
                "word_count": entry["word_count"]                
            })
    return dataset
﻿
﻿
async def run_evaluations(gt_file: str):
    """Run evaluations for each model."""
    eval_dataset = create_evaluation_dataset(gt_file)
    
    # Initialize models
    models = {
        "gpt4o": GPT4o(),
        "gpt4o_mini": GPT4oMini()
    }
﻿
    gpt_scorer = GPT4oScorer()
    # Setup scorers
    scorers = [
        gpt_scorer, # class based scorer 
        rouge_scorer,
        compression_scorer,
        coverage_scorer,
        bert_scorer
    ]
    
    # Run evaluations
    results = {}
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        evaluation = weave.Evaluation(
            dataset=eval_dataset,
            scorers=scorers,
            name=model_name + " Eval"
        )
        results[model_name] = await evaluation.evaluate(model)
    
    # Print results
    print("\nEvaluation Results:")
    for model_name, result in results.items():
        print(f"\n{model_name} Results:")
        print(json.dumps(result, indent=2))
    
    # Save results to file
    output_file = "gpt4o_evaluation_results.json"
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"\nResults saved to {output_file}")
    
    return results
﻿
if __name__ == "__main__":
    gt_file = "paper_abstracts.jsonl"
    asyncio.run(run_evaluations(gt_file))
Our evaluation leverages Weave’s robust framework to systematically compare summaries generated by Azure’s GPT-4o models against ground truth abstracts. By focusing on previously generated outputs, this approach creates a streamlined process, enabling repeated evaluations without incurring additional inference costs. This design simplifies debugging, facilitates iterative refinement of metrics, and ensures consistency in evaluating generated summaries.
To differentiate model performance, we define two distinct classes - GPT4o and GPT4oMini - which are tailored to their respective configurations and displayed as separate entities within Weave’s evaluation dashboard. This ensures clarity and precision when analyzing their outputs side by side.
Comprehensive evaluation metricsThe evaluation incorporates a suite of metrics to provide a holistic view of model performance:
ROUGE: Captures n-gram and sequence overlaps between generated and ground truth summaries.
Coverage: Measures how well key content from the original abstracts is preserved in the generated summaries.
Compression Ratio: Evaluates the trade-off between summary conciseness and detail retention.
BERTScore: Assesses semantic similarity by comparing contextual embeddings of the texts.
LLM Judge: A custom scoring method using GPT-4o to rate summaries on alignment, accuracy, and completeness.
By combining lexical, semantic, and structural evaluations, this framework offers an in-depth assessment of each model’s strengths and weaknesses.
Custom scoring integrationTo enhance the evaluation, a custom LLM judge scorer was developed, allowing GPT-4o to rate its own outputs based on their alignment with ground truth abstracts. Integrated seamlessly into Weave’s pipeline, this scoring method enables automated logging and comparison across datasets, making it indispensable for iterative model analysis. While custom classes provide flexibility for complex workflows, basic scoring functions can also be used for simpler scenarios, offering adaptability to diverse needs.
Intuitive visualization with WeaveWeave’s interactive dashboard simplifies analysis, offering detailed visualizations of evaluation results. Teams can explore model outputs, performance metrics, and comparative insights in an intuitive interface. This enables clear identification of areas for improvement and practical decision-making.
This combination of structured evaluations, tailored scoring methods, and insightful visualizations demonstrates how Weave empowers organizations to analyze and refine GPT models effectively. By bridging quantitative rigor with usability, Weave provides a critical framework for optimizing model performance across use cases.
﻿
﻿
Here, we observe that the GPT-4o model slightly outperforms the GPT-4o Mini model in several key metrics, including the LLM judge score, ROUGE-L, ROUGE-1, compression ratio, and BERTScore. GPT-4o demonstrates better alignment with ground truth abstracts, capturing more structural and phrase-level similarities and achieving a stronger balance between conciseness and information retention. These results highlight its effectiveness in producing semantically accurate and well-structured outputs.
On the other hand, GPT-4o Mini offers advantages in efficiency and detail preservation. It maintains a higher coverage score, indicating better preservation of input details, and is significantly faster, with much lower inference latency, making it more suitable for applications where speed and resource efficiency are critical.
Weave Evaluations has proven itself to be invaluable for uncovering these insights, showcasing where the models excel or fall short and highlighting how these differences can impact practical, real-world decision-making. By enabling detailed, side-by-side comparisons across both metrics and outputs, Weave provides a practical framework for organizations to identify models that best align with their performance and efficiency goals.
This type of analysis is essential for understanding the trade-offs between models, allowing organizations to optimize their AI investments. With Weave’s ability to provide both quantitative and qualitative comparisons, ML engineers can look beyond metrics and focus on operational alignment to select the most suitable model for their specific use case. Here's a screenshot of the comparisons view!
Why choose Azure AI?Azure AI offers a robust platform for working with large language models, providing access to a diverse array of models designed for various tasks. This variety allows you to select models that best align with their unique requirements. Seamless integration with Microsoft Azure's ecosystem enables straightforward deployment and data management, enhancing productivity and operational efficiency.
Azure AI's scalability ensures that model deployments can be easily adjusted to meet fluctuating demands, delivering optimal performance across diverse workloads. The platform prioritizes security and compliance, implementing strong data protection measures that adhere to Microsoft's rigorous standards, ensuring the safety of sensitive information. Supported by Azure's high-performance cloud infrastructure, the platform guarantees reliable and efficient model operations, delivering consistent, high-quality results for a wide range of applications.
Performance evaluation with Azure AI and W&B WeaveBy leveraging Azure AI and W&B Weave, we assessed the performance of GPT-4o and GPT-4o Mini across critical metrics, including an LLM-based judge score. The results reveal that GPT-4o excels in producing accurate and concise summaries, outperforming GPT-4o Mini on metrics such as ROUGE-L, ROUGE-1, and compression ratio, as well as the LLM judge score.
However, GPT-4o Mini showcases its strengths in efficiency, offering faster inference times and achieving higher coverage scores, which highlight its ability to preserve critical input details. This makes it a highly effective option for speed-sensitive or resource-constrained applications.
Through the combined power of Azure AI’s scalable infrastructure and Weave’s advanced evaluation tools, these insights empower users to select models that align with their priorities, whether the focus is on speed, accuracy, or overall efficiency.
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
Ensembling and ensemble learning methods
We'll explore how to combine multiple models together in order to create a more powerful AI model with ensemble learning.
How to fine-tune Phi-3 Vision on a custom dataset
Here's how to fine tune a state of the art multimodal LLM on a custom dataset
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Weave, Framework / Integration, GPT, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.