Comparing GPT Models on Azure AI Foundry with W&B Weave
Learn how to compare and evaluate OpenAI’s GPT models on Azure with W&B Weave on text summarization tasks, leveraging Azure’s managed infrastructure and Weave’s customizable evaluation tools.
Created on November 21|Last edited on December 4
Comment
As organizations increasingly rely on AI to streamline operations, the ability to effectively compare and evaluate language models has become critical. Selecting the right model for specific use cases, such as summarizing research papers, financial reports, or business documents, can significantly impact efficiency and outcomes. While text summarization is a compelling example, the broader focus is on leveraging tools to systematically analyze and compare model performance.
This article provides a guide to evaluating and comparing large language models (LLMs) using OpenAI’s GPT models on Azure AI Foundry, integrated with W&B Weave’s robust evaluation platform. By combining Azure AI Foundry’s scalable infrastructure with Weave’s visualization and analysis tools, organizations can conduct detailed comparisons across models, refine configurations, and optimize workflows to align with their unique goals and operational demands.

Table of contents
Table of contentsFoundation models available in the Azure AI FoundryComparing GPT model summarization on Azure with W&B WeaveStep 1: Accessing GPT models via the Azure AI FoundryStep 2: Generate a dataset to test summarization by multiple GPT modelsEvaluating the models with Weave Evaluations Comprehensive evaluation metricsCustom scoring integrationIntuitive visualization with WeaveWhy choose Azure AI?Performance evaluation with Azure AI and W&B Weave
Foundation models available in the Azure AI Foundry
Azure AI Foundry provides a comprehensive platform to access a diverse array of foundation models, enabling organizations to tackle summarization and other language tasks with ease. With Azure, you benefit from a managed infrastructure that eliminates the complexity of maintaining and scaling AI systems, allowing them to focus on deriving insights and optimizing workflows.
The platform supports a mix of proprietary and open-source models, offering flexibility for a variety of use cases. Azure’s catalog includes advanced options from providers such as OpenAI, Meta, and Mistral, as well as Microsoft’s own Phi series models. These models are optimized for tasks ranging from conversational AI and summarization to document processing and high-throughput applications. Whether you require cutting-edge performance for complex tasks or cost-effective solutions for simpler needs, Azure’s extensive selection caters to a wide spectrum of operational requirements.
Combined with Weights & Biases Weave’s advanced evaluation tools, Azure AI Foundry empowers you to perform side-by-side comparisons of models, visualize performance metrics, and understand strengths and limitations. This seamless integration allows organizations to make informed decisions, selecting the models that align best with their unique goals and workflows.
Comparing GPT model summarization on Azure with W&B Weave
W&B Weave provides a powerful platform for logging and analyzing generated summaries during evaluation, enabling a centralized dashboard for performance comparison. This setup allows for a detailed side-by-side analysis of model outputs, highlighting differences in coherence, relevance, and overall quality.
To ensure a thorough evaluation, we'll employ a diverse range of metrics:
- ROUGE: Measures the overlap of key phrases and word sequences between generated and reference summaries.
- BERTScore: Assesses semantic similarity by comparing the contextual embeddings of the texts.
- Compression Ratio: Evaluates how concise the generated summaries are.
- Coverage: Examines how effectively the summaries capture critical content.
Additionally, a specialized GPT-4o scoring method enhances the analysis by providing qualitative evaluations of the summaries. This method rates each output on a scale from 1 to 5, considering factors such as accuracy, completeness, and alignment with the reference summary. Together, these metrics offer a comprehensive view of model performance, empowering teams to identify strengths, address weaknesses, and select the optimal model for their summarization needs.
Step 1: Accessing GPT models via the Azure AI Foundry
To set up and deploy GPT models like GPT-4o and GPT-4o Mini on Azure, start by navigating to Azure AI Foundry and logging in with your Azure credentials. Once logged in, you’ll arrive at the dashboard where you can begin creating your project.

Click the "Create project" button to initialize a new project. In the dialog that appears, enter a project name and choose a hub to associate with it, or create a new hub if necessary. Once this is done, click "Create" to finalize the setup.

After creating your project, open it and navigate to the "Model catalog" from the sidebar on the left. This is the area where you can browse and explore various available AI models.
In the model catalog, filter the options by selecting "Serverless API" under the deployment options. This narrows the list to models that can be deployed with serverless infrastructure. Locate GPT-4o and GPT-4o Mini from the list of models displayed.

I will select GPT-4o and open its details page. From here, click "Deploy" and enter a deployment name, such as "gpt-4o-deployment." Choose "Global Standard" as the deployment type and confirm the setup. Repeat the process for GPT-4o Mini to deploy both models.

Once the deployments are complete, go to the "Models + Endpoints" section in the left-hand sidebar. Click on the deployed models to view their details. Copy the endpoint URL and API key for each model. These will be needed later for integration with your application.

As a final step, install the following python libraries:
pip install openai==1.54.5 arxiv==2.1.3 PyMuPDF==1.24.9 weave==0.51.18 fitz==0.0.1.dev2 bert-score==0.3.13 rouge-score==0.1.2
At this point, both GPT-4o and GPT-4o Mini are successfully deployed on Azure and ready to be accessed through their API endpoints. Now we are ready to write some code!
Step 2: Generate a dataset to test summarization by multiple GPT models
We will benchmark the text summarization capabilities of the GPT-4o and GPT-4o mini models by testing their ability to accurately generate an abstract for a research paper when the original abstract is removed. This approach allows us to compare how effectively the models can produce concise, relevant summaries of key information from the main content of each paper.
This is an effective test of the model’s summarization abilities because it mirrors the task a human would face when summarizing complex information: distilling the core objectives, methods, and findings of a paper into a brief, coherent abstract. By withholding the abstract, we can assess whether the model can independently identify and convey the most essential aspects of the paper, demonstrating a human-like ability to process, evaluate, and summarize academic content in a structured and concise format. This setup allows us to evaluate not only the model’s accuracy in capturing information but also its skill in organizing it succinctly, as a human expert would.
To start, I'll share a basic script that will show how to run inference with the model:
import weaveimport osfrom openai import AzureOpenAIimport json# Initialize Weave for loggingweave.init('azure-api')# Initialize the AzureOpenAI clientclient = AzureOpenAI(azure_endpoint="your enpoint url",api_key="your key",api_version="2024-09-01-preview")@weave.opdef run_inference(prompt, client):"""Function to perform inference using the provided client and prompt."""try:response = client.chat.completions.create(model="gpt-4o",messages=[{"role": "user", "content": prompt}])# Parse the responseresponse_json = json.loads(response.model_dump_json(indent=2))choices = response_json.get("choices", [])if choices:content = choices[0].get("message", {}).get("content", "")print("Generated Content:")print(content)return contentelse:print("No content found in response")return Noneexcept Exception as e:print(f"Failed to get response: {e}")return None# Define the prompt and perform inferencePROMPT = "What steps should I think about when writing my first Python API?"run_inference(PROMPT, client)
Here, we also use Weave to track the inputs and outputs of our model. This demonstrates how to use the Traces component of Weave. Later on, I will demonstrate how to use Weave Evaluations which is specifically designed for comparing models on the same dataset.
To create our benchmark dataset, we first collect research papers from arXiv, focusing on AI and machine learning topics. From each paper, we extract only the first page, where the abstract is typically located, and use GPT-4o to isolate this section and structure it as a JSON object. These extracted abstracts will serve as "gold standard" reference points, saved in a JSONL file format for easy loading and consistent evaluation.
This file format makes it easy to load and process the summaries when we evaluate the GPT-4o models, and ensures our reference data is properly versioned and easily shareable. Here's the code that will download the papers and extract the abstracts from the first page of each paper using GPT-4o:
import osimport arxivimport fitz # PyMuPDFimport jsonfrom openai import AzureOpenAIimport weaveimport refrom time import sleep# Initialize Weave for loggingweave.init('azure_paper_abstract_gen')# Set up Azure OpenAIos.environ["AZURE_OPENAI_ENDPOINT"] = "your endpoint url"os.environ["AZURE_OPENAI_API_KEY"] = "your api key"client = AzureOpenAI(azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),api_key=os.getenv("AZURE_OPENAI_API_KEY"),api_version="2024-09-01-preview")# Directory to save downloaded papersdownload_dir = "arxiv_papers"os.makedirs(download_dir, exist_ok=True)# Define AI-specific search queriessearch_queries = ["Large Language Models for vision tasks AND cat:cs.AI","Multimodal AI techniques AND cat:cs.CV","Applications of Transformers in healthcare AI AND cat:cs.LG","Few-shot learning in AI and ML AND cat:cs.LG","Vision and language models integration AND cat:cs.CV","Domain-specific fine-tuning for ML models AND cat:cs.LG","Foundational models in AI and CV applications AND cat:cs.AI","NLP in robotics and vision systems AND cat:cs.AI","Bias and fairness in AI for CV AND cat:cs.CV","Evaluation metrics for multimodal AI AND cat:cs.LG"]def download_papers(max_pages=15, max_attempts_per_query=20):"""Download one suitable paper for each query, retrying if papers exceed page limit."""papers = []downloaded_titles = set()client = arxiv.Client()for query in search_queries:paper_found = Falseattempt = 0while not paper_found and attempt < max_attempts_per_query:search = arxiv.Search(query=query,max_results=100,sort_by=arxiv.SortCriterion.SubmittedDate)try:results = list(client.results(search))start_idx = attempt * 5end_idx = start_idx + 5current_batch = results[start_idx:end_idx]for result in current_batch:if result.title not in downloaded_titles:print(f"Downloading: {result.title}")paper_id = result.entry_id.split('/')[-1]pdf_filename = f"{paper_id}.pdf"pdf_path = os.path.join(download_dir, pdf_filename)result.download_pdf(dirpath=download_dir, filename=pdf_filename)try:with fitz.open(pdf_path) as pdf:if pdf.page_count <= max_pages:papers.append({"title": result.title,"file_path": pdf_path,"arxiv_id": paper_id})downloaded_titles.add(result.title)print(f"Accepted: {result.title}")paper_found = Truebreakelse:os.remove(pdf_path)print(f"Skipped (too many pages: {pdf.page_count}): {result.title}")except Exception as e:print(f"Error checking PDF {pdf_path}: {e}")if os.path.exists(pdf_path):os.remove(pdf_path)attempt += 1if not paper_found:print(f"Attempt {attempt}/{max_attempts_per_query} for query: {query}")sleep(3)except Exception as e:print(f"Error during download: {e}")sleep(3)attempt += 1continueif not paper_found:print(f"Failed to find suitable paper for query after {max_attempts_per_query} attempts: {query}")print(f"\nSuccessfully downloaded {len(papers)} papers")return papersdef extract_first_page_text(pdf_path):"""Extract text from only the first page of the PDF."""with fitz.open(pdf_path) as pdf:if pdf.page_count > 0:page = pdf[0]return page.get_text()return ""@weave.opdef extract_abstract_with_azure(text, title):"""Extract abstract using Azure GPT-4o with JSON output."""try:response = client.chat.completions.create(model="gpt-4o",messages=[{"role": "system", "content": "You are a research paper analysis assistant. Extract ONLY the abstract section from the paper content provided. Return the result in JSON format with 'abstract' as the key. If you cannot find the abstract, return an empty string as the value."},{"role": "user", "content": f"Paper title: {title}\n\nPaper content:\n\n{text}"}],response_format={"type": "json_object"})content = response.choices[0].message.contentreturn json.loads(content)except Exception as e:print(f"Error extracting abstract for {title}: {e}")sleep(3)return {"abstract": ""}def count_words(text):"""Count words excluding punctuation and special characters."""cleaned_text = re.sub(r'[^\w\s]', ' ', text.lower())words = [word for word in cleaned_text.split() if word.strip()]return len(words)def main():# Download paperspapers = download_papers()print(f"\nDownloaded {len(papers)} papers. Processing abstracts...\n")# Process papers and extract abstractspaper_data = []for paper in papers:title = paper["title"]pdf_path = paper["file_path"]print(f"Processing: {title}")first_page_text = extract_first_page_text(pdf_path)abstract_json = extract_abstract_with_azure(first_page_text, title)abstract_text = abstract_json.get('abstract', '')word_count = count_words(abstract_text)paper_data.append({"title": title,"file_path": pdf_path,"abstract": abstract_text,"word_count": word_count,"arxiv_id": paper["arxiv_id"]})sleep(2)# Save to JSONL fileoutput_file = "paper_abstracts.jsonl"with open(output_file, "w") as f:for entry in paper_data:json.dump(entry, f)f.write("\n")print(f"\nProcessed {len(paper_data)} papers. Results saved to {output_file}")if __name__ == "__main__":main()
This script establishes a dataset of reference summaries to evaluate the Azure GPT-4o models. We extracted the original abstracts directly from AI research papers sourced from arXiv, providing a consistent and authentic ground truth for our evaluation. These abstracts are saved in JSONL format, serving as benchmarks for comparison.
The process involves downloading research papers, extracting only the abstracts from each PDF, and organizing them in a structured format. By creating this structured dataset, we ensure a reliable and standardized basis for assessing the performance of the GPT-4o models. With this ground-truth dataset in place, we can now generate and compare summaries from the GPT-4o models against these original abstracts to comprehensively evaluate their summarization capabilities.
Evaluating the models with Weave Evaluations
To evaluate the text summarization capabilities of GPT-4o models in Azure's AI Foundry, we will use a range of metrics to provide a comprehensive analysis. These metrics combine traditional text similarity, neural semantic similarity, and an LLM-based scoring system to effectively assess performance.
GPT-4o will act as an automated evaluator, rating generated abstracts on a scale from 1 to 5 based on how well they capture key elements like objectives, methodologies, and findings. For neural semantic similarity, we will use BERTScore to evaluate contextual alignment, while ROUGE scores will measure lexical and structural overlaps. Additional metrics, such as coverage and compression ratios, will assess information retention and summary conciseness.
I recommend setting the WEAVE_PARALLELISM environment variable to a low value before running the evaluation code. This can be done using the command export WEAVE_PARALLELISM=1, ensuring smooth execution. These metrics will be visualized in Weave's evaluation dashboard, providing a multi-dimensional view of performance and highlighting areas for improvement. Below is the code we will use for the evaluation:
import weavefrom weave import Modelimport jsonfrom time import sleepimport asynciofrom rouge_score.rouge_scorer import RougeScorerfrom typing import Dict, Anyimport bert_scoreimport fitzfrom weave.trace.box import unboximport timefrom openai import AzureOpenAIimport timeimport jsonfrom weave import Scorerimport json# Initialize Weaveweave.init('azure_abstract_eval')gpt4o_client = AzureOpenAI(azure_endpoint="https://<your-resource-name>.openai.azure.com/openai/deployments/<gpt-4o-deployment-name>/chat/completions?api-version=2024-08-01-preview",api_key="<your-api-key>",api_version="2024-08-01-preview")gpt4o_mini_client = AzureOpenAI(azure_endpoint="https://<your-resource-name>.openai.azure.com/openai/deployments/<gpt-4o-mini-deployment-name>/chat/completions?api-version=2024-08-01-preview",api_key="<your-api-key>",api_version="2024-08-01-preview")def create_prediction_prompt(text, title, target_length):"""Create a prompt for abstract generation with target length."""return (f"You are tasked with generating an abstract for a research paper titled '{title}'. "f"The abstract should be approximately {target_length} words long.\n\n"f"Generate an abstract that summarizes the key points of the paper, including the "f"research objective, methodology, and main findings. The abstract should be "f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"f"Paper content:\n\n{text}"f"Respond only with the ABSTRACT!")def create_evaluation_prompt(gt_abstract: str, generated_abstract: str) -> str:"""Create standardized evaluation prompt."""return f'''You are evaluating how well a generated abstract captures the information from a ground truth abstract.Ground Truth Abstract:{gt_abstract}Generated Abstract:{generated_abstract}Rate the generated abstract on a scale from 1-5, where:1: Poor - Missing most key information or seriously misrepresenting the research2: Fair - Captures some information but misses crucial elements3: Good - Captures most key points but has some gaps or inaccuracies4: Very Good - Accurately captures nearly all key information with minor omissions5: Excellent - Perfectly captures all key information and maintains accuracyRespond ONLY with a JSON object containing a single "score" field with an integer value from 1-5.Example response format:{{"score": 4}}'''def extract_text_after_page_one(pdf_path: str) -> str:"""Extract text from page 2 onwards."""if isinstance(pdf_path, weave.trace.box.BoxedStr):pdf_path = unbox(pdf_path)text = ""try:with fitz.open(pdf_path) as pdf:if pdf.page_count > 1:for page_num in range(1, pdf.page_count): # Start from page 2page = pdf[page_num]text += page.get_text()except Exception as e:print(f"Error extracting text from PDF {pdf_path}: {e}")return textdef run_inference(client: AzureOpenAI, model: str, prompt: str, max_retries: int = 10, base_wait: int = 10) -> str:"""Function to perform inference using specified Azure model with exponential backoff retry.Args:client: AzureOpenAI clientmodel: Model name/idprompt: Input promptmax_retries: Maximum number of retry attempts (default: 10)base_wait: Initial wait time in seconds (default: 10)"""for attempt in range(max_retries):try:response = client.chat.completions.create(model=model,messages=[{"role": "system", "content": "You are a research paper abstract writer. Write clear, concise, and informative abstracts."},{"role": "user", "content": prompt}],temperature=0.0)response_json = json.loads(response.model_dump_json(indent=2))choices = response_json.get("choices", [])if choices:content = choices[0].get("message", {}).get("content", "")return contentelse:print("No content found in response")except Exception as e:wait_time = base_wait * (2 ** attempt) # Exponential backoffprint(f"Attempt {attempt + 1}/{max_retries} failed: {e}")print(f"Waiting {wait_time} seconds before retrying...")time.sleep(wait_time)continueprint(f"Failed to get response after {max_retries} retries")return Noneclass GPT4o(Model):@weave.opdef predict(self, title: str, pdf_path: str, word_count: int) -> dict:"""Predict abstract using GPT-4O."""paper_text = extract_text_after_page_one(pdf_path)prompt = create_prediction_prompt(title, paper_text, word_count)prediction = run_inference("gpt-4o-2024-08-06", prompt)return {"model_output": prediction}class GPT4oMini(Model):@weave.opdef predict(self, title: str, pdf_path: str, word_count: int) -> dict:"""Predict abstract using GPT-4O-mini."""paper_text = extract_text_after_page_one(pdf_path)prompt = create_prediction_prompt(title, paper_text, word_count)prediction = run_inference("gpt-4o-mini", prompt)return {"model_output": prediction}@weave.opdef bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate BERTScore for the abstract."""if not model_output or 'model_output' not in model_output:return {'bert_score': 0.0}try:P, R, F1 = bert_score.score([model_output['model_output']],[gt_abstract],lang='en',model_type='microsoft/deberta-xlarge-mnli')return {'bert_score': float(F1.mean())}except Exception as e:print(f"Error calculating BERTScore: {e}")return {'bert_score': 0.0}# you can also use the "function" format for your scorer# @weave.op# def GPT4oScorer(gt_abstract: str, model_output: dict) -> dict:# """Evaluate abstract using GPT-4o."""# if not model_output or 'model_output' not in model_output:# return {'gpt4o_score': 0.0}# try:# prompt = create_evaluation_prompt(gt_abstract, model_output["model_output"])# response = run_inference(gpt4o_client, "gpt-4o", prompt)# if response:# # Clean the response text# response_text = response.strip()# # Remove any additional text before or after the JSON# response_text = response_text.split('{')[1].split('}')[0]# response_text = '{' + response_text + '}'# try:# result = json.loads(response_text)# if 'score' in result and isinstance(result['score'], (int, float)):# score = float(result['score'])# if 1 <= score <= 5:# return {'gpt4o_score': score}# except json.JSONDecodeError:# print(f"Invalid JSON response: {response_text}")# print("Using default score due to invalid response")# return {'gpt4o_score': 0.0}# except Exception as e:# print(f"Error in GPT-4o evaluation: {e}")# return {'gpt4o_score': 0.0}class GPT4oScorer(Scorer):model_id: str = "gpt-4o"system_prompt: str = "You are evaluating how well a generated abstract captures the information from a ground truth abstract."@weave.opdef create_evaluation_prompt(self, gt_abstract: str, generated_abstract: str) -> str:"""Create standardized evaluation prompt."""return f'''Ground Truth Abstract:{gt_abstract}Generated Abstract:{generated_abstract}Rate the generated abstract on a scale from 1-5, where:1: Poor - Missing most key information or seriously misrepresenting the research2: Fair - Captures some information but misses crucial elements3: Good - Captures most key points but has some gaps or inaccuracies4: Very Good - Accurately captures nearly all key information with minor omissions5: Excellent - Perfectly captures all key information and maintains accuracyRespond ONLY with a JSON object containing a single "score" field with an integer value from 1-5.Example response format:{{"score": 4}}'''@weave.opdef call_llm(self, gt_abstract: str, model_output: str) -> dict:"""Call GPT-4o for evaluation."""try:prompt = self.create_evaluation_prompt(gt_abstract, model_output)response = run_inference(gpt4o_client, self.model_id, prompt)if response:# Clean the response textresponse_text = response.strip()# Remove any additional text before or after the JSONresponse_text = response_text.split('{')[1].split('}')[0]response_text = '{' + response_text + '}'try:result = json.loads(response_text)if 'score' in result and isinstance(result['score'], (int, float)):score = float(result['score'])if 1 <= score <= 5:return {'gpt4o_score': score}except json.JSONDecodeError:print(f"Invalid JSON response: {response_text}")print("Using default score due to invalid response")return {'gpt4o_score': 0.0}except Exception as e:print(f"Error in GPT-4o evaluation: {e}")return {'gpt4o_score': 0.0}@weave.opdef score(self, model_output: dict, gt_abstract: str) -> dict:"""Score the generated abstract against the ground truth.Args:model_output: Dictionary containing the generated abstract under 'model_output' keygt_abstract: The ground truth abstract to compare against"""if not model_output or 'model_output' not in model_output:return {'gpt4o_score': 0.0}return self.call_llm(gt_abstract, model_output['model_output'])@weave.opdef rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate ROUGE scores for the abstract."""if not model_output or 'model_output' not in model_output:return {'rouge1_f': 0.0,'rouge2_f': 0.0,'rougeL_f': 0.0}try:scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)scores = scorer.score(gt_abstract, model_output['model_output'])return {'rouge1_f': float(scores['rouge1'].fmeasure),'rouge2_f': float(scores['rouge2'].fmeasure),'rougeL_f': float(scores['rougeL'].fmeasure)}except Exception as e:print(f"Error calculating ROUGE scores: {e}")return {'rouge1_f': 0.0,'rouge2_f': 0.0,'rougeL_f': 0.0}@weave.opdef compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate compression ratio of the abstract."""if not model_output or 'model_output' not in model_output:return {'compression_ratio': 0.0}try:gt_words = len(gt_abstract.split())generated_words = len(model_output['model_output'].split())compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)return {'compression_ratio': float(compression_ratio)}except Exception as e:print(f"Error calculating compression ratio: {e}")return {'compression_ratio': 0.0}@weave.opdef coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate content coverage using word overlap."""if not model_output or 'model_output' not in model_output:return {'coverage_score': 0.0}try:gt_words = set(gt_abstract.lower().split())generated_words = set(model_output['model_output'].lower().split())intersection = len(gt_words.intersection(generated_words))union = len(gt_words.union(generated_words))coverage_score = intersection / union if union > 0 else 0.0return {'coverage_score': float(coverage_score)}except Exception as e:print(f"Error calculating coverage score: {e}")return {'coverage_score': 0.0}def create_evaluation_dataset(gt_file: str):"""Create dataset from ground truth file."""dataset = []with open(gt_file, 'r') as f:for line in f:entry = json.loads(line)dataset.append({"title": entry["title"],"gt_abstract": entry["abstract"],"pdf_path": entry["file_path"],"word_count": entry["word_count"]})return datasetasync def run_evaluations(gt_file: str):"""Run evaluations for each model."""eval_dataset = create_evaluation_dataset(gt_file)# Initialize modelsmodels = {"gpt4o": GPT4o(),"gpt4o_mini": GPT4oMini()}gpt_scorer = GPT4oScorer()# Setup scorersscorers = [gpt_scorer, # class based scorerrouge_scorer,compression_scorer,coverage_scorer,bert_scorer]# Run evaluationsresults = {}for model_name, model in models.items():print(f"\nEvaluating {model_name}...")evaluation = weave.Evaluation(dataset=eval_dataset,scorers=scorers,name=model_name + " Eval")results[model_name] = await evaluation.evaluate(model)# Print resultsprint("\nEvaluation Results:")for model_name, result in results.items():print(f"\n{model_name} Results:")print(json.dumps(result, indent=2))# Save results to fileoutput_file = "gpt4o_evaluation_results.json"with open(output_file, 'w') as f:json.dump(results, f, indent=2)print(f"\nResults saved to {output_file}")return resultsif __name__ == "__main__":gt_file = "paper_abstracts.jsonl"asyncio.run(run_evaluations(gt_file))
Our evaluation leverages Weave’s robust framework to systematically compare summaries generated by Azure’s GPT-4o models against ground truth abstracts. By focusing on previously generated outputs, this approach creates a streamlined process, enabling repeated evaluations without incurring additional inference costs. This design simplifies debugging, facilitates iterative refinement of metrics, and ensures consistency in evaluating generated summaries.
To differentiate model performance, we define two distinct classes - GPT4o and GPT4oMini - which are tailored to their respective configurations and displayed as separate entities within Weave’s evaluation dashboard. This ensures clarity and precision when analyzing their outputs side by side.
Comprehensive evaluation metrics
The evaluation incorporates a suite of metrics to provide a holistic view of model performance:
- Coverage: Measures how well key content from the original abstracts is preserved in the generated summaries.
- Compression Ratio: Evaluates the trade-off between summary conciseness and detail retention.
- BERTScore: Assesses semantic similarity by comparing contextual embeddings of the texts.
- LLM Judge: A custom scoring method using GPT-4o to rate summaries on alignment, accuracy, and completeness.
By combining lexical, semantic, and structural evaluations, this framework offers an in-depth assessment of each model’s strengths and weaknesses.
Custom scoring integration
To enhance the evaluation, a custom LLM judge scorer was developed, allowing GPT-4o to rate its own outputs based on their alignment with ground truth abstracts. Integrated seamlessly into Weave’s pipeline, this scoring method enables automated logging and comparison across datasets, making it indispensable for iterative model analysis. While custom classes provide flexibility for complex workflows, basic scoring functions can also be used for simpler scenarios, offering adaptability to diverse needs.
Intuitive visualization with Weave
Weave’s interactive dashboard simplifies analysis, offering detailed visualizations of evaluation results. Teams can explore model outputs, performance metrics, and comparative insights in an intuitive interface. This enables clear identification of areas for improvement and practical decision-making.
This combination of structured evaluations, tailored scoring methods, and insightful visualizations demonstrates how Weave empowers organizations to analyze and refine GPT models effectively. By bridging quantitative rigor with usability, Weave provides a critical framework for optimizing model performance across use cases.


Here, we observe that the GPT-4o model slightly outperforms the GPT-4o Mini model in several key metrics, including the LLM judge score, ROUGE-L, ROUGE-1, compression ratio, and BERTScore. GPT-4o demonstrates better alignment with ground truth abstracts, capturing more structural and phrase-level similarities and achieving a stronger balance between conciseness and information retention. These results highlight its effectiveness in producing semantically accurate and well-structured outputs.
On the other hand, GPT-4o Mini offers advantages in efficiency and detail preservation. It maintains a higher coverage score, indicating better preservation of input details, and is significantly faster, with much lower inference latency, making it more suitable for applications where speed and resource efficiency are critical.
Weave Evaluations has proven itself to be invaluable for uncovering these insights, showcasing where the models excel or fall short and highlighting how these differences can impact practical, real-world decision-making. By enabling detailed, side-by-side comparisons across both metrics and outputs, Weave provides a practical framework for organizations to identify models that best align with their performance and efficiency goals.
This type of analysis is essential for understanding the trade-offs between models, allowing organizations to optimize their AI investments. With Weave’s ability to provide both quantitative and qualitative comparisons, ML engineers can look beyond metrics and focus on operational alignment to select the most suitable model for their specific use case. Here's a screenshot of the comparisons view!
Why choose Azure AI?
Azure AI offers a robust platform for working with large language models, providing access to a diverse array of models designed for various tasks. This variety allows you to select models that best align with their unique requirements. Seamless integration with Microsoft Azure's ecosystem enables straightforward deployment and data management, enhancing productivity and operational efficiency.
Azure AI's scalability ensures that model deployments can be easily adjusted to meet fluctuating demands, delivering optimal performance across diverse workloads. The platform prioritizes security and compliance, implementing strong data protection measures that adhere to Microsoft's rigorous standards, ensuring the safety of sensitive information. Supported by Azure's high-performance cloud infrastructure, the platform guarantees reliable and efficient model operations, delivering consistent, high-quality results for a wide range of applications.
Performance evaluation with Azure AI and W&B Weave
By leveraging Azure AI and W&B Weave, we assessed the performance of GPT-4o and GPT-4o Mini across critical metrics, including an LLM-based judge score. The results reveal that GPT-4o excels in producing accurate and concise summaries, outperforming GPT-4o Mini on metrics such as ROUGE-L, ROUGE-1, and compression ratio, as well as the LLM judge score.
However, GPT-4o Mini showcases its strengths in efficiency, offering faster inference times and achieving higher coverage scores, which highlight its ability to preserve critical input details. This makes it a highly effective option for speed-sensitive or resource-constrained applications.
Through the combined power of Azure AI’s scalable infrastructure and Weave’s advanced evaluation tools, these insights empower users to select models that align with their priorities, whether the focus is on speed, accuracy, or overall efficiency.
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
Ensembling and ensemble learning methods
We'll explore how to combine multiple models together in order to create a more powerful AI model with ensemble learning.
How to fine-tune Phi-3 Vision on a custom dataset
Here's how to fine tune a state of the art multimodal LLM on a custom dataset
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.