Evaluating LLMs on Amazon Bedrock

Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.  
Brett Young
Created on November 20|Last edited on February 27
Comment
As organizations grapple with vast volumes of information, summarization tasks have become a critical component of workflows across industries. Whether summarizing research papers, compressing financial reports, or extracting insights from business documents, the ability to generate concise and coherent summaries is key to staying competitive in a fast-paced world. Large language models are at the forefront of this challenge, offering automated summarization capabilities that rival human performance.
Weave Evaluations dashboard.
This article dives into using Amazon Bedrock and W&B Weave for LLM evaluation on summarization tasks. By combining Bedrock’s infrastructure with Weave’s visualization and analysis tools, we can systematically compare models to find the best fit for different use cases.
Jump to the tutorial﻿
﻿
What we're coveringFoundation models available on Amazon BedrockEvaluating LLM summarization on Bedrock using W&B WeaveEvaluating LLM summarization on Bedrock using W&B WeaveStep 1: Set up an AWS account and billingStep 2: Access Amazon BedrockStep 3: Configure the AWS CLIStep 4: Install the required Python librariesStep 5: Request access to Bedrock models and configure inference profilesGenerating a datasetEvaluating Llama models with Weave for summarizationLlama vs. Amazon NovaWhy choose Amazon Bedrock?ConclusionRelated Articles 
﻿
Foundation models available on Amazon BedrockAmazon Bedrock offers a flexible platform to access a wide variety of LLMs, catering to diverse requirements without the need to manage underlying infrastructure. The platform includes both closed-source and open-source models, providing options for a broad range of use cases. Bedrock supports models from leading providers such as Anthropic, Meta, and Mistral, as well as Amazon’s own Titan models. This diverse selection ensures users have access to tools optimized for tasks like conversational AI, summarization, and high-throughput applications, offering the flexibility to choose models that align with specific operational goals.
Beyond generative tasks, Bedrock also supports specialized models for embeddings, enabling semantic search, clustering, and classification tasks. With this diverse array of offerings, Bedrock ensures that users can select the most suitable tools, whether they require cutting-edge closed-source performance or customizable open-source flexibility, along with support for both text generation and embedding-based applications.
Evaluating LLM summarization on Bedrock using W&B Weave﻿Weave Evaluations, a dedicated tool within the Weave framework, is designed to benchmark and compare generative AI models effectively. When combined with Amazon Bedrock, which provides streamlined access to a wide variety of models, this pairing becomes a powerful solution for evaluating model performance. 
Bedrock’s ability to provide easy access to both open-source and proprietary models enables users to quickly experiment with numerous options. By leveraging Weave Evaluations alongside Bedrock, you can efficiently benchmark these models, analyze outputs, and visualize performance across key metrics. This combination allows for a deeper understanding of the tradeoffs between models, such as differences in cost, accuracy, speed, and output quality. 
Through side-by-side comparisons and dynamic visualizations, Weave Evaluations empowers users to make informed decisions about which model best suits their specific use case, streamlining the process of navigating Bedrock’s extensive model catalog.
Here’s an example of the dashboard of Weave evaluations, which provides a fantastic way to visualize performance of our models:
﻿
Evaluating LLM summarization on Bedrock using W&B WeaveGetting started with Amazon Bedrock involves a few straightforward steps to set up your account, access foundational AI models, and configure the tools needed for integration. Here we'll walk through the initial setup process, from creating an AWS account and enabling billing to configuring the AWS CLI and exploring Bedrock’s diverse range of models.
If you already have an AWS account with Bedrock access, you can:
Jump past the AWS & Bedrock set up﻿
﻿
By following these steps, you can quickly begin leveraging Bedrock’s powerful AI capabilities for tasks like text generation, embeddings, and multimodal processing.
Step 1: Set up an AWS account and billingStart by creating an AWS account if you do not already have one. Sign up on the AWS website and ensure that billing is enabled within your account. Billing is essential for accessing AWS services—including Bedrock—which operates on a pay-per-use model. Navigate to the Billing and Cost Management section in the AWS Console to verify that your account is ready for use.
Step 2: Access Amazon BedrockIn the AWS Management Console, search for "Bedrock" in the services menu. Once you open the Bedrock console, you will see an overview of the available foundation models. Bedrock provides access to models from Anthropic, Meta, AI21 Labs, Mistral, Stability AI, and Amazon’s Titan. These models support tasks like text generation, embeddings, and multimodal processing, giving you a range of options for your use case.
﻿
﻿
Step 3: Configure the AWS CLITo interact with Bedrock programmatically, install the AWS Command Line Interface. Follow the AWS CLI installation guide for your operating system. After installation, initialize the CLI by running the following command in your terminal:
aws configure
Provide your access key, secret key, default region (e.g., us-east-1 for this tutoiral), and output format (json). These keys can be generated in the AWS Console under Security Credentials by creating a new access key. To create a new access key, click on your account name in the top-right corner of the AWS Management Console and select Security Credentials from the dropdown menu. Scroll down to the Access Keys section and click on Create Access Key. Once created, save the access key ID and secret key in a secure location, as you will need them to configure the AWS CLI or SDK.
﻿
﻿
Next, you will see the "Access Keys" section: 
﻿
Step 4: Install the required Python librariesTo integrate Bedrock with W&B Weave, install the necessary Python libraries. Run the following command in your Python environment:
pip install boto3 botocore weave wandb
These libraries allow you to send requests to Bedrock models, log responses, and visualize evaluation results in W&B Weave. 
Step 5: Request access to Bedrock models and configure inference profilesNext, you will need to request access to your desired models. To do so, you can click the "Providers" tab in the screenshot below, and available models will have a "Request Model Access" button that will allow you to request access to the model. 
Next, in the Bedrock console, navigate to the "Cross-Region Inference" section to view the available inference profiles for your models. Here, you will find descriptions of each model and their corresponding profile IDs. You will use these IDs to route API requests to specific models. 
For instance, if you are evaluating both Claude and Llama, you will need their respective profile IDs when making requests via the Bedrock API.
﻿
﻿
Here are some of my inference profiles: 
﻿
Generating a datasetTo evaluate the performance of LLMs on summarization tasks, we need a reliable dataset that reflects real-world challenges. In this setup, we use research papers from arXiv, a rich repository of academic content, as our data source.
The goal is to extract and summarize papers relevant to machine learning and artificial intelligence topics. These papers serve as a diverse and challenging testbed for assessing the summarization abilities of LLMs accessed through Amazon Bedrock. Using a combination of automated paper downloads and summaries generated by Anthropic's Claude model on Bedrock, we create a structured dataset that includes research titles, extracted text, and concise summaries for each paper.
Claude's advanced capabilities ensure that the generated summaries are not only coherent but also capture the essence of the research effectively. The dataset is stored in JSONL format for ease of processing and evaluation, leveraging Claude's ability to synthesize critical information into a compact and structured output. Here's the code for generating our dataset: 
import os
import arxiv
import fitz  # PyMuPDF
import json
import boto3
from botocore.exceptions import ClientError
import re
import random
import time
from time import sleep
﻿
# Directory to save downloaded papers
download_dir = "arxiv_papers"
os.makedirs(download_dir, exist_ok=True)
﻿
# Set up Amazon Bedrock client
bedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1")
MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"
﻿
# Fixed questions for paper analysis
FIXED_QUESTIONS = """
What is the primary objective of this research?
What methodologies or algorithms are proposed or evaluated?
What datasets or experimental setups are used in this study?
What are the key findings and contributions of this research?
What are the implications of these findings for the broader field of AI?
What limitations or challenges are acknowledged by the authors?
What are the proposed future directions or next steps in this research?
"""
﻿
# Define AI-specific search queries
search_queries = [
    "Large Language Models for vision tasks AND cat:cs.AI",
    "Multimodal AI techniques AND cat:cs.CV",
    "Applications of Transformers in healthcare AI AND cat:cs.LG",
    "Few-shot learning in AI and ML AND cat:cs.LG",
    "Vision and language models integration AND cat:cs.CV",
    "Domain-specific fine-tuning for ML models AND cat:cs.LG",
    "Foundational models in AI and CV applications AND cat:cs.AI",
    "NLP in robotics and vision systems AND cat:cs.AI",
    "Bias and fairness in AI for CV AND cat:cs.CV",
    "Evaluation metrics for multimodal AI AND cat:cs.LG"
]
﻿
def download_papers(max_pages=15, max_attempts_per_query=20):
    """Download one suitable paper for each query, retrying if papers exceed page limit."""
    papers = []
    downloaded_titles = set()
    client = arxiv.Client()
﻿
    for query in search_queries:
        paper_found = False
        attempt = 0
        
        while not paper_found and attempt < max_attempts_per_query:
            search = arxiv.Search(
                query=query,
                max_results=100,
                sort_by=arxiv.SortCriterion.SubmittedDate
            )
            
            try:
                results = list(client.results(search))
                start_idx = attempt * 5
                end_idx = start_idx + 5
                current_batch = results[start_idx:end_idx]
                
                for result in current_batch:
                    if result.title not in downloaded_titles:
                        print(f"Downloading: {result.title}")
                        paper_id = result.entry_id.split('/')[-1]
                        pdf_filename = f"{paper_id}.pdf"
                        pdf_path = os.path.join(download_dir, pdf_filename)
                        
                        result.download_pdf(dirpath=download_dir, filename=pdf_filename)
                        
                        try:
                            with fitz.open(pdf_path) as pdf:
                                if pdf.page_count <= max_pages:
                                    papers.append({
                                        "title": result.title,
                                        "file_path": pdf_path,
                                        "arxiv_id": paper_id
                                    })
                                    downloaded_titles.add(result.title)
                                    print(f"Accepted: {result.title}")
                                    paper_found = True
                                    break
                                else:
                                    os.remove(pdf_path)
                                    print(f"Skipped (too many pages: {pdf.page_count}): {result.title}")
                        except Exception as e:
                            print(f"Error checking PDF {pdf_path}: {e}")
                            if os.path.exists(pdf_path):
                                os.remove(pdf_path)
            
                attempt += 1
                if not paper_found:
                    print(f"Attempt {attempt}/{max_attempts_per_query} for query: {query}")
                    sleep(3)
                    
            except Exception as e:
                print(f"Error during download: {e}")
                sleep(3)
                attempt += 1
                continue
        
        if not paper_found:
            print(f"Failed to find suitable paper for query after {max_attempts_per_query} attempts: {query}")
﻿
    print(f"\nSuccessfully downloaded {len(papers)} papers")
    return papers
﻿
def extract_text(pdf_path):
    """Extract text from the entire PDF."""
    with fitz.open(pdf_path) as pdf:
        text = ""
        for page in pdf:
            text += page.get_text()
        return text
﻿
def generate_summary_with_claude(text, title):
    """Generate a 300-word summary using Claude via Amazon Bedrock with exponential backoff."""
    prompt = (
        f"Please analyze the following research paper titled '{title}' and provide a comprehensive 300-word summary. "
        f"Consider these key aspects when analyzing the paper:\n\n{FIXED_QUESTIONS}\n\n"
        f"Based on these questions, synthesize a coherent summary that captures the essential elements "
        f"of the research while maintaining a natural flow. Ensure the summary is 300 words.\n\n"
        f"Paper content:\n\n{text}"
    )
    
    request = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,
        "temperature": 0.0,
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": prompt}]
            }
        ]
    }
    
    max_retries = 15
    backoff_time = 10  # Start with a 10-second delay
    
    for attempt in range(max_retries):
        try:
            response = bedrock_client.invoke_model(
                modelId=MODEL_ID,
                body=json.dumps(request)
            )
            response_body = json.loads(response["body"].read())
            summary = response_body["content"][0]["text"]
            return {"summary": summary}
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                print(f"ThrottlingException encountered. Retrying in {backoff_time} seconds...")
                time.sleep(backoff_time + random.uniform(0, 1))  # Add jitter
                backoff_time *= 2  # Exponential backoff
            else:
                print(f"Error generating summary for {title}: {e}")
                break
    print(f"Failed to generate summary for {title} after {max_retries} retries.")
    return {"summary": ""}
﻿
def count_words(text):
    """Count words excluding punctuation and special characters."""
    cleaned_text = re.sub(r'[^\w\s]', ' ', text.lower())
    words = [word for word in cleaned_text.split() if word.strip()]
    return len(words)
﻿
def main():
    # Download papers
    papers = download_papers()
    print(f"\nDownloaded {len(papers)} papers. Generating summaries...\n")
    
    # Process papers and generate summaries
    paper_data = []
    for paper in papers:
        title = paper["title"]
        pdf_path = paper["file_path"]
        
        print(f"Processing: {title}")
        paper_text = extract_text(pdf_path)
        summary_json = generate_summary_with_claude(paper_text, title)
        summary_text = summary_json.get('summary', '')
        word_count = count_words(summary_text)
        
        paper_data.append({
            "title": title,
            "file_path": pdf_path,
            "summary": summary_text,
            "word_count": word_count,
            "arxiv_id": paper["arxiv_id"]
        })
        
        sleep(5)  # Wait before processing the next paper to avoid throttling
﻿
    # Save to JSONL file
    output_file = "paper_summaries.jsonl"
    with open(output_file, "w") as f:
        for entry in paper_data:
            json.dump(entry, f)
            f.write("\n")
﻿
    print(f"\nProcessed {len(paper_data)} papers. Results saved to {output_file}")
﻿
if __name__ == "__main__":
    main()
﻿
The script begins by setting up a directory to store downloaded research papers and initializing the Amazon Bedrock client for accessing models like Anthropic’s Claude. It defines fixed questions for guiding the summarization process, ensuring consistent and targeted outputs across all papers. These questions address key aspects such as the research objectives, methodologies, datasets, and findings, serving as a framework for generating structured and comprehensive summaries.
The script automates the process of downloading research papers using the arxiv library. Specific search queries are used to filter papers on machine learning and AI topics. The function download_papers retrieves papers, ensuring that only those within a specified page limit are processed. Extracted PDFs are stored locally, and text is extracted using the PyMuPDF library.
For each downloaded paper, the generate_summary_with_claude function sends the extracted text to the Claude model via Amazon Bedrock, using a structured prompt designed to elicit a 300-word summary. The prompt emphasizes clarity and coherence, encouraging the model to summarize the research while addressing predefined questions. The script includes a retry mechanism with increasing wait times to handle cases where the Bedrock API is temporarily overloaded, ensuring smooth and reliable communication with the service.
The script takes each paper, extracts its title and content, generates a summary using the Claude model, and combines these elements into a well-organized dataset containing the title, full text, and the corresponding summary. Summaries are stored in JSONL format, enabling easy retrieval and use later on for our evaluation workflows. 
Evaluating Llama models with Weave for summarizationTo effectively compare the performance of multiple Llama models available through Amazon Bedrock, we employ W&B Weave as the evaluation framework. Of course, you can extend this to run comparisons on any models you're interested in exploring.
The Weave platform provides a structured and efficient system for benchmarking by enabling detailed analysis of model outputs against predefined scoring metrics. In this evaluation, we compare three distinct models:
Llama-1B, a lightweight model designed for cost-effective and high-throughput applications;
Llama-8B, a balanced model offering strong performance and efficiency; and
Llama-11B, a high-capacity model optimized for generating detailed and comprehensive outputs.
This setup ensures a repeatable process for identifying trade-offs and strengths between these LLMs, providing insights into their suitability for different summarization tasks.
Here's the evaluation script: 
import weave
from weave import Model
import json
import boto3
from botocore.exceptions import ClientError
from time import sleep
import asyncio
from rouge_score.rouge_scorer import RougeScorer
from typing import Dict, Any
import bert_score
import fitz
import os
from weave.trace.box import unbox
import time
﻿
# Initialize Weave
weave.init('bedrock_abstract_eval')
client = boto3.client("bedrock-runtime", region_name="us-east-1")
﻿
def extract_paper_text(pdf_path: str) -> str:
    """Extract text from PDF paper."""
    if isinstance(pdf_path, weave.trace.box.BoxedStr):
        pdf_path = unbox(pdf_path)
﻿
    text = ""
    try:
        with fitz.open(unbox(pdf_path)) as pdf:
            for page in pdf:
                text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from PDF {pdf_path}: {e}")
    return text
﻿
def format_prompt(text: str, title: str) -> str:
    """Format prompt for model."""
    return f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Please analyze this research paper and provide a comprehensive 300-word summary that covers:
    - Primary research objective
    - Methodology and approach
    - Key findings and results
    - Main contributions to the field
﻿
    Title: {title}
﻿
    Content: {text}
﻿
    Please analyze this research paper and provide a comprehensive 300-word summary that covers:
    - Primary research objective
    - Methodology and approach
    - Key findings and results
    - Main contributions to the field
﻿
    Generate a clear, coherent summary that captures the essence of the research.
    <|eot_id|>
    <|start_header_id|>assistant<|end_header_id|>
    """
﻿
def model_forward(model_id: str, title: str, pdf_path: str) -> str:
    """Core prediction logic to be called by predict methods."""
    max_retries = 15
    backoff_time = 10  # Start with 10 seconds delay
﻿
    for attempt in range(max_retries):
        try:
            # Extract text from paper
            paper_text = extract_paper_text(pdf_path)
﻿
            # Prepare request
            request = {
                "prompt": format_prompt(paper_text, title),
                "max_gen_len": 4096,
                "temperature": 0.0,
            }
            print(f"Invoking model (Attempt {attempt + 1}/{max_retries})...")
            
            # Make prediction
            response = client.invoke_model(
                modelId=model_id,
                body=json.dumps(request)
            )
            print("Done invoking")
﻿
            # Extract and clean prediction
            response_body = json.loads(response["body"].read())
            prediction = response_body["generation"].strip()
            
            return prediction
﻿
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                print(f"ThrottlingException encountered. Retrying in {backoff_time} seconds...")
                time.sleep(backoff_time)
                backoff_time *= 2  # Exponential backoff
            else:
                print(f"Error generating prediction with {model_id}: {e}")
                break
        except Exception as e:
            print(f"Unexpected error: {e}")
            break
﻿
    print(f"Failed to generate prediction after {max_retries} retries.")
    return ""
﻿
class Llama8B(Model):
    @weave.op
    def predict(self, title: str, pdf_path: str) -> dict:
        prediction = model_forward("us.meta.llama3-1-8b-instruct-v1:0", title, pdf_path)
        return {"model_output": prediction}
﻿
class Llama11B(Model):
    """Llama 11B model."""
    @weave.op
    def predict(self, title: str, pdf_path: str) -> dict:
        """Generate prediction using Llama 11B."""
        prediction = model_forward("us.meta.llama3-2-11b-instruct-v1:0", title, pdf_path)
        return {"model_output": prediction}
﻿
class Llama1B(Model):
    """Llama 1B model."""    
    @weave.op
    def predict(self, title: str, pdf_path: str) -> dict:
        """Generate prediction using Llama 1B."""
        prediction = model_forward("us.meta.llama3-2-1b-instruct-v1:0", title, pdf_path)
        return {"model_output": prediction}
﻿
@weave.op
def bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate BERTScore for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'bert_score': 0.0}
    
    try:
        P, R, F1 = bert_score.score(
            [model_output['model_output']],
            [gt_abstract],
            lang='en',
            model_type='microsoft/deberta-xlarge-mnli'
        )
        return {'bert_score': float(F1.mean())}
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return {'bert_score': 0.0}
﻿
@weave.op
def claude_scorer(gt_abstract: str, model_output: dict) -> dict:
    """Evaluate abstract using Claude."""
    if not model_output or 'model_output' not in model_output:
        return {'claude_score': 0.0}
    print("claude evaluating")
    # client = boto3.client("bedrock-runtime", region_name="us-east-1")
    
    prompt = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "temperature": 0.0,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": f'''Rate how well this generated abstract captures the key information from the ground truth abstract on a scale from 1-5, where:
1: Poor - Missing most key information or seriously misrepresenting the research
2: Fair - Captures some information but misses crucial elements
3: Good - Captures most key points but has some gaps or inaccuracies
4: Very Good - Accurately captures nearly all key information with minor omissions
5: Excellent - Perfectly captures all key information and maintains accuracy
﻿
Ground Truth Abstract:
{gt_abstract}
﻿
Generated Abstract:
{model_output["model_output"]}
﻿
Provide your rating as a JSON object with this schema:
{{"score": <integer 1-5>}}'''}
            ]
        }]
    })
    
    try:
        response = client.invoke_model(
            modelId="anthropic.claude-3-sonnet-20240229-v1:0",
            body=prompt
        )
        
        result = json.loads(response["body"].read())
        score = json.loads(result["content"][0]["text"])["score"]
        print(score)
        sleep(2)  # Rate limiting
        return {'claude_score': float(score)}
        
    except Exception as e:
        print(f"Error in Claude evaluation: {e}")
        return {'claude_score': 0.0}
﻿
@weave.op
def rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate ROUGE scores for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
    
    try:
        scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(gt_abstract, model_output['model_output'])
        
        return {
            'rouge1_f': float(scores['rouge1'].fmeasure),
            'rouge2_f': float(scores['rouge2'].fmeasure),
            'rougeL_f': float(scores['rougeL'].fmeasure)
        }
    except Exception as e:
        print(f"Error calculating ROUGE scores: {e}")
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
﻿
@weave.op
def compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate compression ratio of the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'compression_ratio': 0.0}
    
    try:
        gt_words = len(gt_abstract.split())
        generated_words = len(model_output['model_output'].split())
        
        compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)
        
        return {'compression_ratio': float(compression_ratio)}
    except Exception as e:
        print(f"Error calculating compression ratio: {e}")
        return {'compression_ratio': 0.0}
﻿
@weave.op
def coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate content coverage using word overlap."""
    if not model_output or 'model_output' not in model_output:
        return {'coverage_score': 0.0}
    
    try:
        gt_words = set(gt_abstract.lower().split())
        generated_words = set(model_output['model_output'].lower().split())
        
        intersection = len(gt_words.intersection(generated_words))
        union = len(gt_words.union(generated_words))
        
        coverage_score = intersection / union if union > 0 else 0.0
        
        return {'coverage_score': float(coverage_score)}
    except Exception as e:
        print(f"Error calculating coverage score: {e}")
        return {'coverage_score': 0.0}
﻿
def create_evaluation_dataset(gt_file: str):
    """Create dataset from ground truth file."""
    dataset = []
    with open(gt_file, 'r') as f:
        for line in f:
            entry = json.loads(line)
            dataset.append({
                "title": entry["title"],
                "gt_abstract": entry["summary"],
                "pdf_path": entry["file_path"]
            })
    return dataset
﻿
async def run_evaluations(gt_file: str):
    """Run evaluations for each model."""
    eval_dataset = create_evaluation_dataset(gt_file)
    
    # Initialize models
    models = {
        "llama_8b": Llama8B(),
        "llama_11b": Llama11B(),
        "llama_1b": Llama1B()
    }
﻿
    # Setup scorers
    scorers = [
        claude_scorer,
        rouge_scorer,
        compression_scorer,
        coverage_scorer,
        bert_scorer
    ]
    
    # Run evaluations
    results = {}
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        evaluation = weave.Evaluation(
            dataset=eval_dataset,
            scorers=scorers,
            name=model_name + " Eval"
        )
        results[model_name] = await evaluation.evaluate(model)
    
    # Print results
    print("\nEvaluation Results:")
    for model_name, result in results.items():
        print(f"\n{model_name} Results:")
        print(json.dumps(result, indent=2))
    
    # Save results to file
    output_file = "llama_evaluation_results.json"
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"\nResults saved to {output_file}")
    
    return results
﻿
if __name__ == "__main__":
    gt_file = "paper_summaries.jsonl"
    asyncio.run(run_evaluations(gt_file))
We start by loading a pre-generated dataset of ground-truth summaries from a JSON file, which we created earlier with Claude. This dataset contains research paper titles, file paths to the full text, and manually created summaries that serve as benchmarks for evaluation. Loading the dataset ensures the workflow focuses on analyzing model outputs rather than duplicating efforts in data preparation.
Once the dataset is loaded, the script processes each entry, sending the text of research papers to different Llama models for summarization. The outputs are then compared against the ground-truth summaries using a range of evaluation metrics.
These metrics include:
ROUGE scores to measure lexical overlap,
BERTScore for semantic similarity,
Compression ratios for assessing conciseness, and
Coverage metrics to evaluate the amount of retained information.
Additionally, a scoring function powered by Anthropic's Claude, rates the summaries based on human-like understanding and alignment with the original content.
After scoring the outputs, Weave’s dynamic dashboard can be used to visualize the results. This interface allows for an in-depth comparison of metrics across models and examples, helping to identify trends and trade-offs, ultimately empowering developers to make informed decisions about which model best fits their specific requirements.
By combining the model diversity of Amazon Bedrock with the benchmarking capabilities of W&B Weave, this workflow delivers a robust and scalable solution for evaluating LLMs on summarization tasks:
﻿
﻿
The results show that Llama-8B outperforms both Llama-1B and Llama-11B across several evaluation metrics, including the Claude Score, ROUGE-1, ROUGE-2, and Coverage Score. This demonstrates its ability to generate summaries that align more closely with the ground truth, retaining key content and ensuring structural coherence.
Llama-11B, however, shows slight advantages over Llama-8B in ROUGE-L, Compression Ratio, and BERTScore. Llama-1B, on the other hand, falls behind both Llama-8B and Llama-11B across all metrics, affirming its role as a lightweight model optimized for efficiency rather than maximum performance.
Overall, Llama-8B and Llama-11B perform quite well in this evaluation, excelling in critical dimensions while maintaining reasonable latency and efficiency. One nice feature about Weave evaluations is that it allows you to dive deeper into the exact responses for each model, using the comparisons view. For example, we can compare the responses of each model side-by-side, and also in comparison to the ground truth in one single UI interface.
Here's a screenshot of the comparison view: 
﻿
Llama vs. Amazon NovaAmazon recently unveiled the Nova series, a diverse lineup of AI models aimed at meeting a wide range of needs. Nova Micro focuses on speed and affordability, making it ideal for straightforward tasks like summarization and translation, while Nova Lite handles multimodal input, including text, images, and video, for real-time analysis. Nova Pro strikes a balance between cost and performance, excelling in complex reasoning and multimodal workflows. Meanwhile, Nova Premier, expected to launch in early 2025, promises to handle the most intricate challenges with advanced capabilities. Let's see how these Nova models stack up against Llama.
We will write an evaluation script similar to the above script to test Amazon Nova Pro against the previous Llama Models. Note that since we used Weave Evaluations, we can simply write a new script using just the Nova Pro model, then select the previous evaluations that we will be comparing later on in the Weave dashboard. Here's the code: 
import weave
from weave import Model
import json
import boto3
from botocore.exceptions import ClientError
from time import sleep
import asyncio
from rouge_score.rouge_scorer import RougeScorer
from typing import Dict, Any
import bert_score
import fitz
import os
from weave.trace.box import unbox
import time
import logging
logging.basicConfig(level=logging.DEBUG)
# Initialize Weave
weave.init('bedrock_abstract_eval')
﻿
client = boto3.client(service_name="bedrock-runtime", region_name="us-east-1")
﻿
def extract_paper_text(pdf_path: str) -> str:
    """Extract text from PDF paper."""
    if isinstance(pdf_path, weave.trace.box.BoxedStr):
        pdf_path = unbox(pdf_path)
﻿
    text = ""
    try:
        with fitz.open(unbox(pdf_path)) as pdf:
            for page in pdf:
                text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from PDF {pdf_path}: {e}")
    return text
﻿
def format_prompt(text: str, title: str) -> str:
    """Format prompt for model."""
    return f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    Please analyze this research paper and provide a comprehensive 300-word summary that covers:
    - Primary research objective
    - Methodology and approach
    - Key findings and results
    - Main contributions to the field
﻿
    Title: {title}
﻿
    Content: {text}
﻿
    Please analyze this research paper and provide a comprehensive 300-word summary that covers:
    - Primary research objective
    - Methodology and approach
    - Key findings and results
    - Main contributions to the field
﻿
    Generate a clear, coherent summary that captures the essence of the research.
    <|eot_id|>
    <|start_header_id|>assistant<|end_header_id|>
    """
﻿
﻿
def model_forward_nova(title: str, pdf_path: str) -> str:
    """Core prediction logic for Amazon Nova Pro."""
    max_retries = 15
    backoff_time = 10  # Start with 10 seconds delay
﻿
    for attempt in range(max_retries):
        try:
            paper_text = extract_paper_text(pdf_path)
            messages = [
                {"role": "user", "content": [{"text": format_prompt(paper_text, title)}]},
            ]
﻿
            # Make prediction
            print(f"Invoking Nova Pro model (Attempt {attempt + 1}/{max_retries})...")
            response = client.converse(
                modelId="us.amazon.nova-pro-v1:0",
                messages=messages
            )
            prediction = response["output"]["message"]["content"][0]["text"].strip()
            print("Done invoking")
            return prediction
﻿
        except Exception as e:
            print(f"Error generating prediction with Nova Pro (Attempt {attempt + 1}/{max_retries}): {e}")
            if attempt < max_retries - 1:  # Avoid sleeping on the last attempt
                time.sleep(backoff_time)
                backoff_time *= 2  # Exponential backoff
﻿
    print(f"Failed to generate prediction after {max_retries} retries.")
    return ""
﻿
class NovaPro(Model):
    """Amazon Nova Pro model."""
    @weave.op
    def predict(self, title: str, pdf_path: str) -> dict:
        """Generate prediction using Amazon Nova Pro."""
        prediction = model_forward_nova(title, pdf_path)
        return {"model_output": prediction}
﻿
@weave.op
def bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate BERTScore for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'bert_score': 0.0}
    
    try:
        P, R, F1 = bert_score.score(
            [model_output['model_output']],
            [gt_abstract],
            lang='en',
            model_type='microsoft/deberta-xlarge-mnli'
        )
        return {'bert_score': float(F1.mean())}
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return {'bert_score': 0.0}
﻿
@weave.op
def rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate ROUGE scores for the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
    
    try:
        scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(gt_abstract, model_output['model_output'])
        
        return {
            'rouge1_f': float(scores['rouge1'].fmeasure),
            'rouge2_f': float(scores['rouge2'].fmeasure),
            'rougeL_f': float(scores['rougeL'].fmeasure)
        }
    except Exception as e:
        print(f"Error calculating ROUGE scores: {e}")
        return {
            'rouge1_f': 0.0,
            'rouge2_f': 0.0,
            'rougeL_f': 0.0
        }
﻿
﻿
﻿
@weave.op
def compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate compression ratio of the abstract."""
    if not model_output or 'model_output' not in model_output:
        return {'compression_ratio': 0.0}
    
    try:
        gt_words = len(gt_abstract.split())
        generated_words = len(model_output['model_output'].split())
        
        compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)
        
        return {'compression_ratio': float(compression_ratio)}
    except Exception as e:
        print(f"Error calculating compression ratio: {e}")
        return {'compression_ratio': 0.0}
﻿
@weave.op
def coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:
    """Calculate content coverage using word overlap."""
    if not model_output or 'model_output' not in model_output:
        return {'coverage_score': 0.0}
    
    try:
        gt_words = set(gt_abstract.lower().split())
        generated_words = set(model_output['model_output'].lower().split())
        
        intersection = len(gt_words.intersection(generated_words))
        union = len(gt_words.union(generated_words))
        
        coverage_score = intersection / union if union > 0 else 0.0
        
        return {'coverage_score': float(coverage_score)}
    except Exception as e:
        print(f"Error calculating coverage score: {e}")
        return {'coverage_score': 0.0}
﻿
﻿
@weave.op
def claude_scorer(gt_abstract: str, model_output: dict) -> dict:
    """Evaluate abstract using Claude."""
    if not model_output or 'model_output' not in model_output:
        return {'claude_score': 0.0}
    print("claude evaluating")
    # client = boto3.client("bedrock-runtime", region_name="us-east-1")
    
    prompt = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "temperature": 0.0,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": f'''Rate how well this generated abstract captures the key information from the ground truth abstract on a scale from 1-5, where:
1: Poor - Missing most key information or seriously misrepresenting the research
2: Fair - Captures some information but misses crucial elements
3: Good - Captures most key points but has some gaps or inaccuracies
4: Very Good - Accurately captures nearly all key information with minor omissions
5: Excellent - Perfectly captures all key information and maintains accuracy
﻿
Ground Truth Abstract:
{gt_abstract}
﻿
Generated Abstract:
{model_output["model_output"]}
﻿
Provide your rating as a JSON object with this schema:
{{"score": <integer 1-5>}}'''}
            ]
        }]
    })
    
    try:
        response = client.invoke_model(
            modelId="anthropic.claude-3-sonnet-20240229-v1:0",
            body=prompt
        )
        
        result = json.loads(response["body"].read())
        score = json.loads(result["content"][0]["text"])["score"]
        print(score)
        sleep(2)  # Rate limiting
        return {'claude_score': float(score)}
        
    except Exception as e:
        print(f"Error in Claude evaluation: {e}")
        return {'claude_score': 0.0}
﻿
﻿
def create_evaluation_dataset(gt_file: str):
    """Create dataset from ground truth file."""
    dataset = []
    with open(gt_file, 'r') as f:
        for line in f:
            entry = json.loads(line)
            dataset.append({
                "title": entry["title"],
                "gt_abstract": entry["summary"],
                "pdf_path": entry["file_path"]
            })
    return dataset
﻿
﻿
﻿
async def run_evaluations(gt_file: str):
    """Run evaluations for each model."""
    eval_dataset = create_evaluation_dataset(gt_file)
    
    # Initialize models
    models = {
        "nova_pro": NovaPro(),
    }
﻿
    # Setup scorers
    scorers = [
        claude_scorer,
        rouge_scorer,
        compression_scorer,
        coverage_scorer,
        bert_scorer
    ]
    # Run evaluations
    results = {}
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        evaluation = weave.Evaluation(
            dataset=eval_dataset,
            scorers=scorers,
            name=model_name + " Eval"
        )
        results[model_name] = await evaluation.evaluate(model)
    
    # Print results
    print("\nEvaluation Results:")
    for model_name, result in results.items():
        print(f"\n{model_name} Results:")
        print(json.dumps(result, indent=2))
    
    # Save results to file
    output_file = "llama_vs_nova_evaluation_results.json"
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"\nResults saved to {output_file}")
    
    return results
﻿
if __name__ == "__main__":
    gt_file = "paper_summaries.jsonl"
    asyncio.run(run_evaluations(gt_file))
Here are the results as seen in the Weave Evaluations comparisons dashboard: 
﻿
﻿
The results indicate that Nova Pro generally outperforms the two Llama-based models on multiple evaluation metrics. Nova Pro’s Claude Score of 4.8 surpasses Llama-8B (4.6) and Llama-11B (4.4), suggesting its summaries align more closely with human-like judgment as reflected by the Claude evaluation criteria. It also excels in ROUGE metrics, achieving a ROUGE-1 score of 0.6308, a ROUGE-2 score of 0.3038, and a ROUGE-L score of 0.3543—higher than both Llama models.
Although Nova Pro’s compression ratio (0.8198) is slightly lower than Llama-8B (0.822) and Llama-11B (0.8233), it compensates with stronger content retention, as evidenced by a coverage score of 0.31, exceeding Llama-8B’s 0.2819 and Llama-11B’s 0.2791. Nova Pro also attains the highest BERTScore of 0.7252, reflecting a closer semantic match to the reference summaries.
It’s important to note that while metrics like ROUGE and BERTScore capture lexical and semantic overlap, they do not fully measure readability, coherence, or overall user satisfaction. The Claude Score provides a more human-aligned benchmark, helping to highlight Nova Pro’s ability to produce well-structured, coherent, and appealing summaries.
Overall, Nova Pro’s strong showing across these metrics—combined with its superior alignment with human evaluators—makes it a compelling choice for tasks that prioritize both technical precision and human-centered quality.
Why choose Amazon Bedrock?Amazon Bedrock combines a diverse selection of foundation models with seamless integration into the AWS ecosystem, making it a powerful platform for generative AI. It offers access to models from leading providers like Anthropic, Meta, AI21 Labs, Mistral, and Amazon’s Titan, allowing users to tailor their model choice to tasks like text summarization, embedding generation, or multimodal processing. By integrating with AWS services, Bedrock simplifies workflows, leverages trusted infrastructure, and enables scalable deployments without the complexity of managing underlying systems.
With a focus on scalability, security, and reliability, Bedrock ensures businesses can easily adjust resources to meet growing demands while maintaining data protection through advanced encryption and compliance with industry standards. Backed by AWS’s robust global infrastructure, it delivers consistent performance and uptime, making it an ideal solution for organizations seeking a dependable and flexible platform for their AI initiatives.
ConclusionAmazon Bedrock and W&B Weave represent a compelling combination for evaluating and comparing the performance of large language models in diverse use cases such as text summarization. Bedrock’s expansive array of foundation models, coupled with its integration into the AWS ecosystem, provides flexibility and scalability for businesses seeking robust AI solutions.
By leveraging Weave’s sophisticated benchmarking and visualization capabilities, organizations can methodically analyze the trade-offs between models, gaining insights that go beyond raw metrics to understand practical strengths and limitations. As AI technologies continue to evolve, platforms like Bedrock and Weave pave the way for businesses to harness the power of generative AI, ensuring they remain competitive in a data-driven world. 
Related Articles 
Building an LLM Python debugger agent with the new Claude 3.5 Sonnet  
Building a AI powered coding agent with Claude 3.5 Sonnet!
Training a KANFormer: KAN's Are All You Need? 
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers! 
Building reliable apps with GPT-4o and structured outputs
Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps. 
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
﻿
﻿
Add a comment
Tags: Articles, Weave, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.