Skip to main content

LLM Evaluation on Google Vertex AI

This guide explores large language model (LLM) evaluation with Google Vertex AI and W&B Weave, focusing on comparing different Gemini models for text summarization!
Created on December 3|Last edited on February 27
Through this article we will build a robust framework for LLM evaluation that empowers you to rigorously assess and compare language model performance on Google Vertex AI. By focusing on creating an end-to-end pipeline, this tutorial demonstrates how to leverage Vertex AI’s scalable infrastructure alongside W&B Weave’s powerful evaluation toolset.
While we'll illustrate the process with a text summarization use case, the principles and methodologies discussed here are applicable to a wide range of LLM evaluation tasks. You'll learn how to set up your Vertex AI environment, deploy different Gemini models, and define detailed evaluation metrics - establishing a reproducible workflow.
By the end of this tutorial, you'll be equipped to not only quantify model performance with metrics like ROUGE and BERTScore but also to qualitatively analyze outputs, ensuring you select the best model for your needs.
Jump to the tutorial




Table of contents



Foundation models available on Google Vertex AI

Google Vertex AI offers a variety of models suited for summarization, our primary use case for LLM evaluation in this example, as well as other AI-driven tasks. Among them, the Gemini series is particularly optimized for summarization.
  • Gemini-1.5-Flash provides fast processing with high-quality summarization, making it well-suited for handling large volumes of text efficiently. As of February 15, 2025, its pricing is $0.075 per 1 million input tokens and $0.30 per 1 million output tokens for prompts up to 128,000 tokens, with higher rates for longer inputs.
  • Gemini-2.0-Flash builds upon these capabilities with even faster response times and enhanced multimodal features. It is priced at $0.10 per 1 million input tokens and $0.40 per 1 million output tokens, making it ideal for applications requiring rapid, high-quality responses at scale.
Beyond the Gemini models, Vertex AI provides access to models from other providers:
  • Meta’s Llama series: A versatile alternative for text summarization, offering unique performance characteristics compared to Google’s native models.
  • Anthropic’s Claude models, such as Claude 3.5 Haiku, which excel in generating efficient, conversational outputs - ideal for interactive summarization tasks where quick response times are essential.
  • Text embedding models (e.g., E5 Text Embedding): These models convert text into vector representations, enabling tasks like semantic search, classification, and clustering.
In addition to text-based models, Vertex AI supports other modalities such as Stable Diffusion, a high-quality text-to-image generation model for creating or modifying images based on textual prompts.
With this diverse selection, Vertex AI allows users to choose the most suitable models for text processing, image generation, or embedding-based applications, offering flexibility across a wide range of AI-driven tasks.

Evaluating LLM summarization on Vertex AI using W&B Weave

When evaluating multiple language models for tasks like text summarization, robust tools are essential for effectively assessing and comparing their performance. W&B Weave offers a comprehensive LLM evaluation framework that simplifies this process by enabling you to define evaluation criteria, automatically collect results, and visualize model performance across various dimensions.
Weave supports tracing, allowing you to monitor LLM inputs, outputs, and model behavior throughout the workflow using the @weave.op decorator. By adding @weave.op above any function, you can automatically log and track inputs and outputs, creating a detailed execution trace. This capability is highly useful for debugging, as it captures each stage of the data flow, making it easy to see precisely how inputs transform into outputs. The trace data is logged and visualized within Weave, offering insights into model behavior and highlighting areas that may need tuning.
In addition to tracing, Weave Evaluations streamline the comparison of language model outputs by providing a structured approach to define evaluation metrics and gather results without requiring manual setup or custom evaluation loops.
The evaluation process in Weave involves several key components:
  • Models: You can define models by subclassing the Model class and implementing a predict function that processes input examples and returns outputs. This setup allows for versioning and tracking of model attributes such as prompts and temperatures.
  • Datasets: A collection of examples, often representing failure cases or specific scenarios, is organized into datasets. These examples serve as test cases to evaluate model performance systematically.
  • Scorers: Evaluation metrics are defined using scorers, which can be simple Python functions decorated with @weave.op or more complex classes inheriting from weave.Scorer. Scorers analyze model outputs and return dictionaries containing evaluation metrics, facilitating the assessment of various aspects of model performance.
  • Evaluations: By combining models, datasets, and scorers, you can create Evaluation objects that manage the evaluation process. The evaluate method runs the model's predict function on each example in the dataset and applies the defined scorers to assess the outputs.
This structured approach allows for consistent and repeatable evaluations, making it easier to compare different models or model versions. Weave's visualization capabilities further enhance this process by providing interactive dashboards that display evaluation results, enabling you to drill down into specific examples, analyze performance metrics, and identify areas for improvement.
By integrating Weave Evaluations into your workflow, you can build rigorous, apples-to-apples evaluations for language model use cases, organize information generated across the LLM workflow, and confidently iterate on your applications.

Evaluating Gemini models on text summarization

Next, we'll use Google Vertex AI and W&B Weave to perform LLM evaluation by comparing the performance of two Gemini models—Gemini-1.5-Flash and Gemini-2.0-Flash—on text summarization tasks.
We'll start by generating a ground truth dataset to serve as a benchmark, giving us a reliable baseline to assess each model's strengths and limitations. Then, using Weave’s metrics and visualization tools, we’ll analyze model outputs to understand which model best suits different summarization needs. Then, we will use Weave’s visualization tools to analyze model outputs and determine which model best suits our task.
First, we'll cover the setup of Vertex AI, starting with creating a Google Cloud project, enabling the necessary APIs, and configuring the Google Cloud CLI. This foundation will ensure that you have the tools and permissions required to fully utilize Vertex AI's features. I'll cover the main steps for setting up you Google Cloud Project and development environment below:

Step 1: Create a Google Cloud project

Begin by creating a new project in the Google Cloud console. Navigate to the project selector page and either select an existing project or create a new one. Ensure that billing is enabled for your project, as this is required for using Vertex AI services. If you haven't yet created a project, simply search 'create project' in the Google Cloud search bar and you can easily click the first result which will guide you to create a project.



Step 2: Enable the Vertex AI API

Next, enable the Vertex AI API for your project. In the Google Cloud console, enter “Vertex AI” in the search bar. Select Vertex AI from the results, which will bring you to the Vertex AI dashboard. Click on “Enable All Recommended APIs” to activate the necessary APIs for Vertex AI. (This process may take a few moments to complete.)



Step 3: Set up the Google Cloud CLI

To interact with Google Cloud services from your local development environment, you need to install the Google Cloud CLI. Download and install the CLI from the Google Cloud documentation. Once installed, initialize the CLI by running gcloud init in your terminal. This command will guide you through selecting your project and configuring your settings.
You can update the CLI components to ensure you have the latest tools and features by running:
gcloud components update
gcloud components install beta
gcloud auth login

Step 4: Configure IAM Roles

The administrator must ensure the appropriate IAM roles are assigned. These roles include:
  • Vertex AI User or Vertex AI Administrator, and
  • Service Account User
Depending on your specific needs and intended use of Vertex AI. I recommend Vertex AI Administrator and Service Account User permissions for this tutorial.
To accomplish this, simply search "IAM" in the Google Cloud Search bar, and you will be able to

You will then select the edit button next to your user account, which looks like the following:

And assign the appropriate roles:


To support this summarization evaluation workflow using Google Vertex AI and W&B Weave, you'll need to install several key Python packages. Here’s a command that will install the main libraries to set up your environment:
pip install google-cloud google-cloud-aiplatform openai wandb weave arxiv pymupdf rouge-score
After this initial setup, you can now view the "Model Garden" in the Google Cloud console, which will allow you to view all of the models available on Vertex AI. We are almost ready to utilize some of these models for evaluation, but first, we must create a dataset which we will utilize for comparing the performance of our models.

Generating ground-truth summaries

We will benchmark the Gemini models' summarization capabilities by testing their ability to accurately generate an abstract for a research paper when the original abstract is removed. This approach allows us to directly evaluate how well Gemini models can produce concise, relevant summaries of key information from the main content of each paper.
This is an effective test of the model’s summarization abilities because it mirrors the task a human would face when summarizing complex information: distilling the core objectives, methods, and findings of a paper into a brief, coherent abstract. By withholding the abstract, we can assess whether the model can independently identify and convey the most essential aspects of the paper, demonstrating a human-like capacity to process, evaluate, and summarize academic content in a structured and concise format. This setup allows us to evaluate not only the model’s accuracy in capturing information but also its skill in organizing it succinctly, as a human expert would.
To create our benchmark dataset, we first collect research papers from arXiv, focusing on AI and machine learning topics. From each paper, we extract only the first page, where the abstract is typically located, and use Gemini-1.5 to isolate this section and structure it as a JSON object. These extracted abstracts will serve as "gold standard" reference points, saved in a JSONL file format for easy loading and consistent evaluation.
This file format makes it easy to load and process the summaries when we evaluate the Gemini models, and ensures our reference data is properly versioned and easily shareable.
Here's the code that will download the papers, and extract the abstracts from the first page of each paper, using Gemini 1.5 Pro:
import os
import arxiv
import fitz # PyMuPDF
import json
from vertexai.generative_models import GenerativeModel, GenerationConfig
import vertexai
import weave; weave.init('paper_abstract_gen')
import re
from time import sleep


# Set up Vertex AI
PROJECT_ID = "dsports-6ab79"
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)


# Directory to save downloaded papers
download_dir = "arxiv_papers"
os.makedirs(download_dir, exist_ok=True)


# Define AI-specific search queries
search_queries = [
"Large Language Models for vision tasks AND cat:cs.AI",
"Multimodal AI techniques AND cat:cs.CV",
"Applications of Transformers in healthcare AI AND cat:cs.LG",
"Few-shot learning in AI and ML AND cat:cs.LG",
"Vision and language models integration AND cat:cs.CV",
"Domain-specific fine-tuning for ML models AND cat:cs.LG",
"Foundational models in AI and CV applications AND cat:cs.AI",
"NLP in robotics and vision systems AND cat:cs.AI",
"Bias and fairness in AI for CV AND cat:cs.CV",
"Evaluation metrics for multimodal AI AND cat:cs.LG"
]


def download_papers(max_pages=15, max_attempts_per_query=20):
"""Download papers or use existing ones from the download directory."""
papers = []
downloaded_titles = set()
# First check for existing papers
if os.path.exists(download_dir):
existing_pdfs = [f for f in os.listdir(download_dir) if f.endswith('.pdf')]
if existing_pdfs:
print(f"Found {len(existing_pdfs)} existing papers, checking validity...")
for pdf_file in existing_pdfs:
pdf_path = os.path.join(download_dir, pdf_file)
try:
with fitz.open(pdf_path) as pdf:
if pdf.page_count <= max_pages:
arxiv_id = pdf_file.replace('.pdf', '')
# Try to get a clean title from the first page
title = pdf[0].get_text().split('\n')[0].strip()
papers.append({
"title": title,
"file_path": pdf_path,
"arxiv_id": arxiv_id
})
downloaded_titles.add(title)
print(f"Using existing paper: {title}")
except Exception as e:
print(f"Error checking existing PDF {pdf_path}: {e}")
if os.path.exists(pdf_path):
os.remove(pdf_path)
# If we have enough papers (one per query), return early
if len(papers) >= len(search_queries):
print(f"\nUsing {len(papers)} existing papers")
return papers[:len(search_queries)] # Return only what we need
# Otherwise, download remaining papers
print(f"\nNeed {len(search_queries) - len(papers)} more papers, downloading...")
client = arxiv.Client()
for query in search_queries[len(papers):]: # Only process remaining queries
paper_found = False
attempt = 0
while not paper_found and attempt < max_attempts_per_query:
search = arxiv.Search(
query=query,
max_results=100,
sort_by=arxiv.SortCriterion.SubmittedDate
)
try:
results = list(client.results(search))
start_idx = attempt * 5
end_idx = start_idx + 5
current_batch = results[start_idx:end_idx]
for result in current_batch:
if result.title not in downloaded_titles:
print(f"Downloading: {result.title}")
paper_id = result.entry_id.split('/')[-1]
pdf_filename = f"{paper_id}.pdf"
pdf_path = os.path.join(download_dir, pdf_filename)
result.download_pdf(dirpath=download_dir, filename=pdf_filename)
try:
with fitz.open(pdf_path) as pdf:
if pdf.page_count <= max_pages:
papers.append({
"title": result.title,
"file_path": pdf_path,
"arxiv_id": paper_id
})
downloaded_titles.add(result.title)
print(f"Accepted: {result.title}")
paper_found = True
break
else:
os.remove(pdf_path)
print(f"Skipped (too many pages: {pdf.page_count}): {result.title}")
except Exception as e:
print(f"Error checking PDF {pdf_path}: {e}")
if os.path.exists(pdf_path):
os.remove(pdf_path)
attempt += 1
if not paper_found:
print(f"Attempt {attempt}/{max_attempts_per_query} for query: {query}")
sleep(3)
except Exception as e:
print(f"Error during download: {e}")
sleep(3)
attempt += 1
continue
if not paper_found:
print(f"Failed to find suitable paper for query after {max_attempts_per_query} attempts: {query}")
print(f"\nTotal papers available: {len(papers)} ({len(papers) - len(search_queries)} existing, {len(search_queries) - (len(papers) - len(search_queries))} new)")
return papers


def extract_first_page_text(pdf_path):
"""Extract text from only the first page of the PDF."""
with fitz.open(pdf_path) as pdf:
if pdf.page_count > 0:
page = pdf[0]
return page.get_text()
return ""

def extract_abstract_with_gemini(text, title, max_retries=3):
"""Extract abstract using Gemini with simple retry logic."""
model = GenerativeModel("gemini-1.5-pro-002")
prompt = (
f"From the following first page of the research paper titled '{title}', "
f"extract ONLY the abstract section. Return the result in JSON format with 'abstract' as the key. "
f"If you cannot find the abstract, return an empty string as the value.\n\n"
f"Paper content:\n\n{text}"
)
for attempt in range(max_retries):
try:
response = model.generate_content(
prompt,
generation_config=GenerationConfig(
temperature=0,
response_mime_type="application/json"
)
)
return json.loads(response.text)
except Exception as e:
if attempt == max_retries - 1: # Last attempt
print(f"Error extracting abstract for {title}: {e}")
return {"abstract": ""}
# Simple exponential backoff: 5s, 10s, 20s
sleep_time = 5 * (2 ** attempt)
print(f"Attempt {attempt + 1} failed, retrying in {sleep_time}s...")
sleep(sleep_time)
return {"abstract": ""}

def count_words(text):
"""Count words excluding punctuation and special characters."""
cleaned_text = re.sub(r'[^\w\s]', ' ', text.lower())
words = [word for word in cleaned_text.split() if word.strip()]
return len(words)


def main():
# Download papers
papers = download_papers()
print(f"\nDownloaded {len(papers)} papers. Processing abstracts...\n")
# Process papers and extract abstracts
paper_data = []
for paper in papers:
title = paper["title"]
pdf_path = paper["file_path"]
print(f"Processing: {title}")
first_page_text = extract_first_page_text(pdf_path)
abstract_json = extract_abstract_with_gemini(first_page_text, title)
abstract_text = abstract_json.get('abstract', '')
word_count = count_words(abstract_text)
paper_data.append({
"title": title,
"file_path": pdf_path,
"abstract": abstract_text,
"word_count": word_count,
"arxiv_id": paper["arxiv_id"]
})
sleep(2)


# Save to JSONL file
output_file = "paper_abstracts.jsonl"
with open(output_file, "w") as f:
for entry in paper_data:
json.dump(entry, f)
f.write("\n")


print(f"\nProcessed {len(paper_data)} papers. Results saved to {output_file}")


if __name__ == "__main__":
main()

This script establishes a dataset of reference summaries to evaluate the Gemini models. We extracted the original abstracts directly from AI research papers sourced from arXiv, providing a consistent and authentic ground truth for our evaluation. These abstracts are saved in JSONL format, serving as benchmarks for comparison. The process involves downloading research papers, extracting only the abstracts from each PDF, and organizing them in a structured format. With this ground-truth dataset in place, we can now generate and compare summaries from the Gemini models against these original abstracts to assess their performance.

Generation of summaries using Gemini models

We will compare the performance and cost of two Gemini models - Gemini-1.5-Flash and Gemini-2.0-Flash - on text summarization tasks using Google Vertex AI and W&B Weave. Gemini-2.0-Flash is the newer, slightly more expensive version of the Gemini Flash model, offering potential improvements in efficiency and multimodal capabilities.
To maintain consistency, we use a structured prompt that instructs both models to generate concise abstracts summarizing the key content of each research paper. In our setup, we start text extraction from the second page onward, omitting the original abstract to prompt the models to create their own summaries. This approach ensures that the generated abstracts are self-contained and provide a clear summary of the paper’s contributions.
For consistency in evaluation, both models are set to a temperature of 0.0, ensuring deterministic outputs for a direct performance comparison. Each model’s generated abstracts are saved in JSONL files - gemini_1_5_flash_abstract_predictions.jsonl and gemini_2_flash_abstract_predictions.jsonl - following the same format as the reference abstracts. This structured output makes it easier to compare the models' performance side by side.
We’ve also included safeguards for API rate limits, with a 2-second delay between requests to avoid throttling. By processing the same set of papers through both models, we build a dataset that enables a structured evaluation of their performance in abstract generation. This approach allows us to identify which model better captures the core content and the types of content where each model excels.
import json
import fitz # PyMuPDF
import time
from vertexai.generative_models import GenerativeModel, GenerationConfig
import vertexai
import weave; weave.init('vertex_abstract_prediction')


# Configuration
PROJECT_ID = "dsports-6ab79"
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)


# Model configurations
MODELS = {
"gemini_2_flash": {
"name": "gemini-2.0-flash-001",
"type": "vertex",
"delay": 2,
"temperature": 0.0
},
"gemini_1_5_flash": {
"name": "gemini-1.5-flash-002",
"type": "vertex",
"delay": 2,
"temperature": 0.0
}
}


def load_paper_data():
"""Load papers with their abstracts and word counts."""
with open("paper_abstracts.jsonl", "r") as f:
return [json.loads(line) for line in f]


def extract_text_after_page_one(pdf_path):
"""Extract text from page 2 onwards."""
text = ""
try:
with fitz.open(pdf_path) as pdf:
if pdf.page_count > 1:
for page_num in range(1, pdf.page_count): # Start from page 2
page = pdf[page_num]
text += page.get_text()
except Exception as e:
print(f"Error extracting text from PDF {pdf_path}: {e}")
return text


def create_abstract_prompt(text, title, target_length):
"""Create a prompt for abstract generation with target length."""
return (
f"You are tasked with generating an abstract for a research paper titled '{title}'. "
f"The abstract should be approximately {target_length} words long.\n\n"
f"Generate an abstract that summarizes the key points of the paper, including the "
f"research objective, methodology, and main findings. The abstract should be "
f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
f"Paper content:\n\n{text}"
f"Respond only with the ABSTRACT!"
)


@weave.op
def predict_abstract_with_gemini(text, title, target_length, model_info):
"""Generate abstract prediction using specified Gemini model."""
model = GenerativeModel(model_info["name"])
generation_config = GenerationConfig(
temperature=model_info["temperature"]
)
try:
response = model.generate_content(
create_abstract_prompt(text, title, target_length),
generation_config=generation_config
)
time.sleep(model_info["delay"])
return response.text
except Exception as e:
print(f"Error generating abstract with {model_info['name']}: {e}")
return ""


def process_papers(start_index=0, end_index=None):
"""Process papers and generate abstract predictions."""
papers = load_paper_data()
model_predictions = {model_name: [] for model_name in MODELS.keys()}
papers_to_process = papers[start_index:end_index] if end_index else papers[start_index:]
for i, paper in enumerate(papers_to_process, start=start_index):
title = paper["title"]
pdf_path = paper["file_path"]
target_length = paper["word_count"]
print(f"\nProcessing paper {i+1}/{len(papers)}: {title}")


# Extract text from page 2 onwards
paper_text = extract_text_after_page_one(pdf_path)
if not paper_text:
print(f"Skipping file {pdf_path} - no text found after page 1.")
continue


# Generate predictions using each model
for model_name, model_info in MODELS.items():
try:
print(f"Generating abstract prediction using {model_name}...")
predicted_abstract = predict_abstract_with_gemini(
paper_text,
title,
target_length,
model_info
)
model_predictions[model_name].append({
"title": title,
"file_path": pdf_path,
"abstract": predicted_abstract
})
print(f"Successfully generated {model_name} abstract prediction")
except Exception as e:
print(f"Error processing paper {title} with {model_name}: {e}")


# Save progress after each paper
for model_name, predictions in model_predictions.items():
output_file = f"{model_name}_abstract_predictions.jsonl"
with open(output_file, "w") as f:
for entry in predictions:
json.dump(entry, f)
f.write("\n")


return model_predictions


if __name__ == "__main__":
print("Starting abstract prediction...")
print(f"Using temperature settings:")
for model, config in MODELS.items():
print(f"- {model}: temperature={config['temperature']}, delay={config['delay']}s")
predictions = process_papers()
print("\nProcessing completed. Predictions saved to individual JSONL files.")

In this script, we generated summaries for research papers using Gemini-1.5-Flash and Gemini-2.0-Flash models. Both models were tasked with creating concise, structured summaries based on consistent prompts and settings, ensuring a fair comparison. Outputs were saved in JSONL files in a similar format to our ground truth dataset, enabling straightforward evaluation. This provides a solid foundation for analyzing each model’s performance across key summarization criteria in the next stages.

Defining evaluation metrics in Weave for summarization

To compare the abstracts generated by our Gemini models against the ground truth abstracts from the original papers, we use a range of evaluation metrics in Weave. These metrics combine traditional text similarity scores, neural semantic similarity, and an LLM-based scoring system to give us a comprehensive view of summary quality.
One core metric is using Gemini-1.5-Pro as an automated LLM evaluator. We prompt this model to rate each generated abstract on a scale from 1 to 5 based on how accurately it captures the main points of the original abstract. The model considers key elements such as research objectives, methodologies, datasets, findings, implications, limitations, and future directions. This LLM-as-judge approach allows us to go beyond word overlap, as it considers the semantic understanding of each abstract.
For neural semantic similarity, we use BERTScore, which leverages BERT’s contextual embeddings to measure similarity between the generated and original abstracts. BERTScore is especially useful here because it captures semantic similarity even when different words are used to express the same concept, which is valuable in research summarization where technical terminology may vary.
We also use ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation), a standard metric in summarization tasks. ROUGE-1 measures word overlap, ROUGE-2 captures phrase matches, and ROUGE-L identifies the longest matching sequences between the generated abstracts and the original ones. This approach provides insight into lexical overlap and structural similarities.
To gain additional insight into summary quality, we implement coverage and compression scores. Coverage is calculated using Jaccard similarity, comparing the set of words in each generated abstract to the original abstract to assess information preservation. A high coverage score suggests that the generated summary retains most of the core content from the original, though it’s based on word overlap rather than deeper semantic similarity.
Finally, we use a compression ratio to examine length and conciseness. This metric compares the lengths of the generated and original abstracts by calculating the ratio of the shorter to the longer one, yielding a score between 0 and 1. A score closer to 1 indicates that the generated summary matches the length of the original, whereas lower scores may suggest over-compression or missing information.
Viewed together on Weave’s evaluation dashboard, these metrics provide a multi-dimensional view of each Gemini model’s performance. By combining LLM-based semantic evaluation, neural similarity metrics, and traditional text overlap measures, we gain a comprehensive understanding of not only which model performs better overall but also where each model excels or needs improvement.
Before running the following code, I recommend setting the WEAVE_PARALLELISM environment variable to a low value, depending on the capabilities of the system you are using. For my M1 Macbook Pro, a value of 1 worked well to prevent memory issues when calculating the BERT score. You can set this value with the following command export WEAVE_PARALLELISM=1. Here is the code for the evaluation:
import weave
from weave import Model
import json
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig
from time import sleep
import asyncio
from rouge_score.rouge_scorer import RougeScorer
from typing import Dict, Any
import bert_score


# Initialize Vertex AI and Weave
PROJECT_ID = "dsports-6ab79"
LOCATION = "us-central1"
vertexai.init(project=PROJECT_ID, location=LOCATION)
weave.init('abstract_metrics_eval')


class BaseJsonModel(Model):
"""Base model class for loading abstracts from JSON files."""
abstract_file: str = ""


def get_abstracts(self) -> dict:
"""Load abstracts from the JSON file."""
abstracts = {}
with open(self.abstract_file, 'r') as f:
for line in f:
entry = json.loads(line)
abstracts[entry['title']] = entry['abstract']
return abstracts


class GeminiFlash_1_5_Model(BaseJsonModel):
"""Specific model class for Gemini Flash 1.5 abstracts."""
@weave.op
def predict(self, title: str) -> dict:
"""Return the pre-generated abstract for a given title."""
abstracts = self.get_abstracts()
return {"model_output": abstracts.get(title, "")}


class GeminiFlash2Model(BaseJsonModel):
"""Specific model class for Gemini Flash abstracts."""


@weave.op
def predict(self, title: str) -> dict:
"""Return the pre-generated abstract for a given title."""
abstracts = self.get_abstracts()
return {"model_output": abstracts.get(title, "")}


@weave.op
def bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
"""Calculate BERTScore for the abstract."""
if not model_output or 'model_output' not in model_output:
return {'bert_score': 0.0}
try:
P, R, F1 = bert_score.score(
[model_output['model_output']],
[gt_abstract],
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {'bert_score': float(F1.mean())}
except Exception as e:
print(f"Error calculating BERTScore: {e}")
return {'bert_score': 0.0}


async def gemini_scorer(gt_abstract: str, model_output: dict) -> dict:
"""Evaluate abstract using Gemini model."""
if not model_output or 'model_output' not in model_output:
print("Invalid model output")
return {'gemini_score': 0}

model_id = "gemini-1.5-pro-002"
model = GenerativeModel(model_id)
response_schema = {
"type": "object",
"properties": {
"score": {"type": "integer", "minimum": 1, "maximum": 5}
},
"required": ["score"]
}

formatted_text = (
f"Given these two research paper abstracts:\n\n"
f"Ground Truth Abstract:\n{gt_abstract}\n\n"
f"Generated Abstract:\n{model_output['model_output']}\n\n"
f"Rate how well the generated abstract captures the key information from the ground truth abstract "
f"on a scale from 1-5, where 1 is poor and 5 is excellent. Consider:\n"
f"Respond with ONLY a JSON object in this format: {{'score': X}} where X is your integer rating."
)
max_attempts = 10
attempt = 0
base_delay = 3
while attempt < max_attempts:
print(f"evaluating with gemini attempt: {attempt}")
try:
response = model.generate_content(
formatted_text,
generation_config=GenerationConfig(
temperature=0.0,
response_mime_type="application/json",
response_schema=response_schema
)
)
eval_result = json.loads(response.text)
score = int(eval_result.get('score', 0))
if not 1 <= score <= 5:
raise ValueError(f"Invalid score: {score}")
print(f"Sleeping for {base_delay}s between calls")
sleep(base_delay)
return {'gemini_score': score}
except Exception as e:
attempt += 1
if "429" in str(e) and attempt < max_attempts:
delay = base_delay * (2 ** attempt)
print(f"Rate limit hit. Attempt {attempt}/{max_attempts}. "
f"Retrying in {delay}s...")
sleep(delay)
else:
if attempt == max_attempts:
print(f"Max attempts reached")
print(f"Error in evaluation: {e}")
return {'gemini_score': 0}
return {'gemini_score': 0}

@weave.op
def rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
"""Calculate ROUGE scores for the abstract."""
if not model_output or 'model_output' not in model_output:
return {
'rouge1_f': 0.0,
'rouge2_f': 0.0,
'rougeL_f': 0.0
}
try:
scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(gt_abstract, model_output['model_output'])
return {
'rouge1_f': float(scores['rouge1'].fmeasure),
'rouge2_f': float(scores['rouge2'].fmeasure),
'rougeL_f': float(scores['rougeL'].fmeasure)
}
except Exception as e:
print(f"Error calculating ROUGE scores: {e}")
return {
'rouge1_f': 0.0,
'rouge2_f': 0.0,
'rougeL_f': 0.0
}


@weave.op
def compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
"""Calculate compression ratio of the abstract."""
if not model_output or 'model_output' not in model_output:
return {'compression_ratio': 0.0}
try:
gt_words = len(gt_abstract.split())
generated_words = len(model_output['model_output'].split())
compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)
return {'compression_ratio': float(compression_ratio)}
except Exception as e:
print(f"Error calculating compression ratio: {e}")
return {'compression_ratio': 0.0}


@weave.op
def coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
"""Calculate content coverage using word overlap."""
if not model_output or 'model_output' not in model_output:
return {'coverage_score': 0.0}
try:
gt_words = set(gt_abstract.lower().split())
generated_words = set(model_output['model_output'].lower().split())
intersection = len(gt_words.intersection(generated_words))
union = len(gt_words.union(generated_words))
coverage_score = intersection / union if union > 0 else 0.0
return {'coverage_score': float(coverage_score)}
except Exception as e:
print(f"Error calculating coverage score: {e}")
return {'coverage_score': 0.0}


def create_evaluation_dataset(gt_file: str):
"""Create dataset from ground truth file."""
dataset = []
with open(gt_file, 'r') as f:
for line in f:
entry = json.loads(line)
dataset.append({
"title": entry["title"],
"gt_abstract": entry["abstract"]
})
return dataset


async def run_evaluations(gt_file: str):
"""Run separate evaluations for each model."""
eval_dataset = create_evaluation_dataset(gt_file)
scorers = [
gemini_scorer,
rouge_scorer,
compression_scorer,
coverage_scorer,
bert_scorer
]
# Create and evaluate Gemini flash 1.5 model
print("\nEvaluating Gemini Flash 1.5 abstracts...")
flash_1_5_model = GeminiFlash_1_5_Model(abstract_file="gemini_1_5_flash_abstract_predictions.jsonl")
flash_1_5_evaluation = weave.Evaluation(
dataset=eval_dataset,
scorers=scorers
)
flash1_5_results = await flash_1_5_evaluation.evaluate(flash_1_5_model)
# Create and evaluate Gemini Flash model
print("\nEvaluating Gemini Flash abstracts...")
flash_model = GeminiFlash2Model(abstract_file="gemini_2_flash_abstract_predictions.jsonl")
flash_2_evaluation = weave.Evaluation(
dataset=eval_dataset,
scorers=scorers
)
flash2_results = await flash_2_evaluation.evaluate(flash_model)
# Print results
print("\nEvaluation Results:")
print("\nGemini Flash 1.5 Results:")
print(json.dumps(flash1_5_results, indent=2))
print("\nGemini Flash 2.0 Results:")
print(json.dumps(flash2_results, indent=2))
return {
"gemini_flash1_5": flash1_5_results,
"gemini_flash2": flash2_results
}


if __name__ == "__main__":
gt_file = "paper_abstracts.jsonl"
asyncio.run(run_evaluations(gt_file))
If you are running this evaluation without a GPU, the BERT scorer can be a bit slow depending on the speed of your CPU, so I recommend using a GPU if available, or omitting the BERT Scorer from the list of scorers if you are running this eval on the CPU.
💡
Our evaluation implementation leverages Weave's framework to compare summaries from both Gemini models against our ground truth abstracts. By creating model classes that read from our previously generated abstracts (gemini_pro_summaries.jsonl and gemini_flash_summaries.jsonl), we've separated the model inference stage from evaluation. This decoupling provides several advantages: it allows us to run evaluations multiple times without incurring additional API costs, enables easier debugging and analysis of model outputs, and lets us iteratively refine our evaluation metrics without regenerating summaries.
We create two specific model classes, GeminiProModel and GeminiFlashModel, which inherit from BaseJsonModel. This structure allows each model to appear distinctly in Weave's evaluation dashboard.
The evaluation process runs for both models using our suite of metrics: ROUGE scores for lexical overlap, coverage scores for content preservation, compression ratios for length analysis, BERTScore for capturing semantic similarity, and intelligent similarity scores using Gemini-1.5-Pro as a judge. Weave’s dashboard presents these results in an interactive format, enabling detailed comparisons between models across different papers and metrics.
After running our evaluation script, we can analyze the results inside Weave. Here's a screenshot showing the evaluation platform. The interface provides a very clear and direct picture to how the models compare on a variety of metrics.



The Gemini-2.0-Flash model demonstrates improved performance over the Gemini-1.5-Flash model across key evaluation metrics, including the LLM judge score, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and coverage, when compared against ground truth abstracts. While both models maintain high efficiency, Gemini-2.0-Flash generates summaries that better preserve semantic content and structural fidelity, indicating advancements in coherence and alignment with the original abstracts.
Weave Evaluations also provides a really nice dashboard for directly comparing the outputs of the models for the various entries in the dataset. This will enable us to gain a much more "hands on" understanding as to how different models are performing when given identical inputs.
Choosing an LLM is a bit like hiring an employee. Employers typically look not only at basic data like qualifications and experience, but also seek to understand how a candidate "feels" as a fit for the role. This blend of quantitative and qualitative assessment ensures the selection of someone who not only has the skills but also aligns with the environment and expectations. Similarly, selecting the right LLM involves looking at both numerical scores—how well each model performs on key metrics—and the actual quality of the outputs they generate.
Weave Evaluations facilitates this selection by providing a comprehensive dashboard for directly comparing model outputs across various entries in the dataset. This allows ML engineers to gain a more "hands-on" understanding of each model's strengths and weaknesses when faced with identical tasks. While metrics provide an important quantitative foundation, this comparison dashboard captures the LLM's "qualitative interview", where the nuanced approaches of each model can be assessed. Here's a screenshot of the comparisons view!


Catching bugs with Weave

Initially, I ran this LLM evaluation with an older model from the Gemini Family, Gemini 1.0 Pro, and upon examining some of the outputs from the model, I noticed that the abstracts seemed to look more like summaries of each section in the paper as opposed to just the abstract, as I requested in my prompt. However, the same issue did not appear to be happening for Gemini Flash. Now, without the Weave comparisons dashboard, I probably would have never noticed this, but luckily the comparisons dashboard made this issue quite clear.
After further examination of my code, it looked like the prompt was mostly correct, however, due to the fact that the entire research paper was appended to the original instruction, the model seemed to be 'forgetting' my original instruction. Here's the original prompt that was resulting in the bug:
def create_abstract_prompt(text, title, target_length):
"""Create a prompt for abstract generation with target length."""
return (
f"You are tasked with generating an abstract for a research paper titled '{title}'. "
f"The abstract should be approximately {target_length} words long.\n\n"
f"Generate an abstract that summarizes the key points of the paper, including the "
f"research objective, methodology, and main findings. The abstract should be "
f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
f"Paper content:\n\n{text}"
)
Here, you see that the paper content is the last chunk of text added to the prompt. Evidently, Gemini Flash can handle this, but Gemini Pro 1.0 struggles with this large amount of context. In order to resolve this, I re-added a final instruction to only write the abstract, and this resolved the issue. Here's the new prompt:
def create_abstract_prompt(text, title, target_length):
"""Create a prompt for abstract generation with target length."""
return (
f"You are tasked with generating an abstract for a research paper titled '{title}'. "
f"The abstract should be approximately {target_length} words long.\n\n"
f"Generate an abstract that summarizes the key points of the paper, including the "
f"research objective, methodology, and main findings. The abstract should be "
f"self-contained and clearly communicate the paper's contribution. Respond only with the ABSTRACT, and NOT the title\n\n"
f"Paper content:\n\n{text}"
f"Respond only with the ABSTRACT!" # prompt 'engineering'
)

Why choose Google Vertex AI?

Google Vertex AI offers a comprehensive platform for working with large language models, providing access to a diverse array tailored to various tasks. This variety enables users to select models that best align with their specific requirements. The platform's seamless integration with Google Cloud facilitates straightforward deployment and data management within the Google ecosystem, enhancing operational efficiency.
Vertex AI's scalability allows for easy adjustment of model deployments to accommodate varying demands, ensuring optimal performance across different workloads. Security and compliance are also prioritized, with robust data protection measures that adhere to Google Cloud's stringent standards, safeguarding sensitive information. Additionally, the platform's high-performance infrastructure, supported by Google's cloud technology, ensures reliable and efficient model operations, contributing to consistent and dependable outcomes.

Conclusion

Using Google Vertex AI and W&B Weave, we evaluated the performance of Gemini-2.0-Flash and Gemini-1.5-Flash across key metrics, including an LLM-based judge score. The analysis shows that Gemini-2.0-Flash consistently outperforms Gemini-1.5-Flash on ROUGE, coverage, and BERTScore, demonstrating its ability to generate summaries that more accurately reflect the reference content. The LLM judge score also favors Gemini-2.0-Flash, indicating improvements in semantic alignment and overall coherence.
Gemini-1.5-Flash, while slightly behind in these metrics, remains a competitive option, particularly for use cases where a balance of efficiency and quality is required. Overall, Vertex AI’s model advancements and Weave’s evaluation framework continue to provide meaningful insights, enabling users to select the best model for their summarization needs, whether prioritizing accuracy, semantic richness, or processing speed.



Iterate on AI agents and models faster. Try Weights & Biases today.