Evaluating LLMs on Amazon Bedrock
Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.
Created on November 20|Last edited on February 27
Comment
As organizations grapple with vast volumes of information, summarization tasks have become a critical component of workflows across industries. Whether summarizing research papers, compressing financial reports, or extracting insights from business documents, the ability to generate concise and coherent summaries is key to staying competitive in a fast-paced world. Large language models are at the forefront of this challenge, offering automated summarization capabilities that rival human performance.

Weave Evaluations dashboard.
This article dives into using Amazon Bedrock and W&B Weave for LLM evaluation on summarization tasks. By combining Bedrock’s infrastructure with Weave’s visualization and analysis tools, we can systematically compare models to find the best fit for different use cases.
Jump to the tutorial
What we're covering
Foundation models available on Amazon BedrockEvaluating LLM summarization on Bedrock using W&B WeaveEvaluating LLM summarization on Bedrock using W&B WeaveStep 1: Set up an AWS account and billingStep 2: Access Amazon BedrockStep 3: Configure the AWS CLIStep 4: Install the required Python librariesStep 5: Request access to Bedrock models and configure inference profilesGenerating a datasetEvaluating Llama models with Weave for summarizationLlama vs. Amazon NovaWhy choose Amazon Bedrock?ConclusionRelated Articles
Foundation models available on Amazon Bedrock
Amazon Bedrock offers a flexible platform to access a wide variety of LLMs, catering to diverse requirements without the need to manage underlying infrastructure. The platform includes both closed-source and open-source models, providing options for a broad range of use cases. Bedrock supports models from leading providers such as Anthropic, Meta, and Mistral, as well as Amazon’s own Titan models. This diverse selection ensures users have access to tools optimized for tasks like conversational AI, summarization, and high-throughput applications, offering the flexibility to choose models that align with specific operational goals.
Beyond generative tasks, Bedrock also supports specialized models for embeddings, enabling semantic search, clustering, and classification tasks. With this diverse array of offerings, Bedrock ensures that users can select the most suitable tools, whether they require cutting-edge closed-source performance or customizable open-source flexibility, along with support for both text generation and embedding-based applications.
Evaluating LLM summarization on Bedrock using W&B Weave
Weave Evaluations, a dedicated tool within the Weave framework, is designed to benchmark and compare generative AI models effectively. When combined with Amazon Bedrock, which provides streamlined access to a wide variety of models, this pairing becomes a powerful solution for evaluating model performance.
Bedrock’s ability to provide easy access to both open-source and proprietary models enables users to quickly experiment with numerous options. By leveraging Weave Evaluations alongside Bedrock, you can efficiently benchmark these models, analyze outputs, and visualize performance across key metrics. This combination allows for a deeper understanding of the tradeoffs between models, such as differences in cost, accuracy, speed, and output quality.
Through side-by-side comparisons and dynamic visualizations, Weave Evaluations empowers users to make informed decisions about which model best suits their specific use case, streamlining the process of navigating Bedrock’s extensive model catalog.
Here’s an example of the dashboard of Weave evaluations, which provides a fantastic way to visualize performance of our models:

Evaluating LLM summarization on Bedrock using W&B Weave
Getting started with Amazon Bedrock involves a few straightforward steps to set up your account, access foundational AI models, and configure the tools needed for integration. Here we'll walk through the initial setup process, from creating an AWS account and enabling billing to configuring the AWS CLI and exploring Bedrock’s diverse range of models.
If you already have an AWS account with Bedrock access, you can:
Jump past the AWS & Bedrock set up
By following these steps, you can quickly begin leveraging Bedrock’s powerful AI capabilities for tasks like text generation, embeddings, and multimodal processing.
Step 1: Set up an AWS account and billing
Start by creating an AWS account if you do not already have one. Sign up on the AWS website and ensure that billing is enabled within your account. Billing is essential for accessing AWS services—including Bedrock—which operates on a pay-per-use model. Navigate to the Billing and Cost Management section in the AWS Console to verify that your account is ready for use.
Step 2: Access Amazon Bedrock
In the AWS Management Console, search for "Bedrock" in the services menu. Once you open the Bedrock console, you will see an overview of the available foundation models. Bedrock provides access to models from Anthropic, Meta, AI21 Labs, Mistral, Stability AI, and Amazon’s Titan. These models support tasks like text generation, embeddings, and multimodal processing, giving you a range of options for your use case.

Step 3: Configure the AWS CLI
To interact with Bedrock programmatically, install the AWS Command Line Interface. Follow the AWS CLI installation guide for your operating system. After installation, initialize the CLI by running the following command in your terminal:
aws configure
Provide your access key, secret key, default region (e.g., us-east-1 for this tutoiral), and output format (json). These keys can be generated in the AWS Console under Security Credentials by creating a new access key. To create a new access key, click on your account name in the top-right corner of the AWS Management Console and select Security Credentials from the dropdown menu. Scroll down to the Access Keys section and click on Create Access Key. Once created, save the access key ID and secret key in a secure location, as you will need them to configure the AWS CLI or SDK.

Next, you will see the "Access Keys" section:

Step 4: Install the required Python libraries
To integrate Bedrock with W&B Weave, install the necessary Python libraries. Run the following command in your Python environment:
pip install boto3 botocore weave wandb
These libraries allow you to send requests to Bedrock models, log responses, and visualize evaluation results in W&B Weave.
Step 5: Request access to Bedrock models and configure inference profiles
Next, you will need to request access to your desired models. To do so, you can click the "Providers" tab in the screenshot below, and available models will have a "Request Model Access" button that will allow you to request access to the model.
Next, in the Bedrock console, navigate to the "Cross-Region Inference" section to view the available inference profiles for your models. Here, you will find descriptions of each model and their corresponding profile IDs. You will use these IDs to route API requests to specific models.
For instance, if you are evaluating both Claude and Llama, you will need their respective profile IDs when making requests via the Bedrock API.

Here are some of my inference profiles:

Generating a dataset
To evaluate the performance of LLMs on summarization tasks, we need a reliable dataset that reflects real-world challenges. In this setup, we use research papers from arXiv, a rich repository of academic content, as our data source.
The goal is to extract and summarize papers relevant to machine learning and artificial intelligence topics. These papers serve as a diverse and challenging testbed for assessing the summarization abilities of LLMs accessed through Amazon Bedrock. Using a combination of automated paper downloads and summaries generated by Anthropic's Claude model on Bedrock, we create a structured dataset that includes research titles, extracted text, and concise summaries for each paper.
Claude's advanced capabilities ensure that the generated summaries are not only coherent but also capture the essence of the research effectively. The dataset is stored in JSONL format for ease of processing and evaluation, leveraging Claude's ability to synthesize critical information into a compact and structured output. Here's the code for generating our dataset:
import osimport arxivimport fitz # PyMuPDFimport jsonimport boto3from botocore.exceptions import ClientErrorimport reimport randomimport timefrom time import sleep# Directory to save downloaded papersdownload_dir = "arxiv_papers"os.makedirs(download_dir, exist_ok=True)# Set up Amazon Bedrock clientbedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1")MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0"# Fixed questions for paper analysisFIXED_QUESTIONS = """What is the primary objective of this research?What methodologies or algorithms are proposed or evaluated?What datasets or experimental setups are used in this study?What are the key findings and contributions of this research?What are the implications of these findings for the broader field of AI?What limitations or challenges are acknowledged by the authors?What are the proposed future directions or next steps in this research?"""# Define AI-specific search queriessearch_queries = ["Large Language Models for vision tasks AND cat:cs.AI","Multimodal AI techniques AND cat:cs.CV","Applications of Transformers in healthcare AI AND cat:cs.LG","Few-shot learning in AI and ML AND cat:cs.LG","Vision and language models integration AND cat:cs.CV","Domain-specific fine-tuning for ML models AND cat:cs.LG","Foundational models in AI and CV applications AND cat:cs.AI","NLP in robotics and vision systems AND cat:cs.AI","Bias and fairness in AI for CV AND cat:cs.CV","Evaluation metrics for multimodal AI AND cat:cs.LG"]def download_papers(max_pages=15, max_attempts_per_query=20):"""Download one suitable paper for each query, retrying if papers exceed page limit."""papers = []downloaded_titles = set()client = arxiv.Client()for query in search_queries:paper_found = Falseattempt = 0while not paper_found and attempt < max_attempts_per_query:search = arxiv.Search(query=query,max_results=100,sort_by=arxiv.SortCriterion.SubmittedDate)try:results = list(client.results(search))start_idx = attempt * 5end_idx = start_idx + 5current_batch = results[start_idx:end_idx]for result in current_batch:if result.title not in downloaded_titles:print(f"Downloading: {result.title}")paper_id = result.entry_id.split('/')[-1]pdf_filename = f"{paper_id}.pdf"pdf_path = os.path.join(download_dir, pdf_filename)result.download_pdf(dirpath=download_dir, filename=pdf_filename)try:with fitz.open(pdf_path) as pdf:if pdf.page_count <= max_pages:papers.append({"title": result.title,"file_path": pdf_path,"arxiv_id": paper_id})downloaded_titles.add(result.title)print(f"Accepted: {result.title}")paper_found = Truebreakelse:os.remove(pdf_path)print(f"Skipped (too many pages: {pdf.page_count}): {result.title}")except Exception as e:print(f"Error checking PDF {pdf_path}: {e}")if os.path.exists(pdf_path):os.remove(pdf_path)attempt += 1if not paper_found:print(f"Attempt {attempt}/{max_attempts_per_query} for query: {query}")sleep(3)except Exception as e:print(f"Error during download: {e}")sleep(3)attempt += 1continueif not paper_found:print(f"Failed to find suitable paper for query after {max_attempts_per_query} attempts: {query}")print(f"\nSuccessfully downloaded {len(papers)} papers")return papersdef extract_text(pdf_path):"""Extract text from the entire PDF."""with fitz.open(pdf_path) as pdf:text = ""for page in pdf:text += page.get_text()return textdef generate_summary_with_claude(text, title):"""Generate a 300-word summary using Claude via Amazon Bedrock with exponential backoff."""prompt = (f"Please analyze the following research paper titled '{title}' and provide a comprehensive 300-word summary. "f"Consider these key aspects when analyzing the paper:\n\n{FIXED_QUESTIONS}\n\n"f"Based on these questions, synthesize a coherent summary that captures the essential elements "f"of the research while maintaining a natural flow. Ensure the summary is 300 words.\n\n"f"Paper content:\n\n{text}")request = {"anthropic_version": "bedrock-2023-05-31","max_tokens": 4096,"temperature": 0.0,"messages": [{"role": "user","content": [{"type": "text", "text": prompt}]}]}max_retries = 15backoff_time = 10 # Start with a 10-second delayfor attempt in range(max_retries):try:response = bedrock_client.invoke_model(modelId=MODEL_ID,body=json.dumps(request))response_body = json.loads(response["body"].read())summary = response_body["content"][0]["text"]return {"summary": summary}except ClientError as e:if e.response['Error']['Code'] == 'ThrottlingException':print(f"ThrottlingException encountered. Retrying in {backoff_time} seconds...")time.sleep(backoff_time + random.uniform(0, 1)) # Add jitterbackoff_time *= 2 # Exponential backoffelse:print(f"Error generating summary for {title}: {e}")breakprint(f"Failed to generate summary for {title} after {max_retries} retries.")return {"summary": ""}def count_words(text):"""Count words excluding punctuation and special characters."""cleaned_text = re.sub(r'[^\w\s]', ' ', text.lower())words = [word for word in cleaned_text.split() if word.strip()]return len(words)def main():# Download paperspapers = download_papers()print(f"\nDownloaded {len(papers)} papers. Generating summaries...\n")# Process papers and generate summariespaper_data = []for paper in papers:title = paper["title"]pdf_path = paper["file_path"]print(f"Processing: {title}")paper_text = extract_text(pdf_path)summary_json = generate_summary_with_claude(paper_text, title)summary_text = summary_json.get('summary', '')word_count = count_words(summary_text)paper_data.append({"title": title,"file_path": pdf_path,"summary": summary_text,"word_count": word_count,"arxiv_id": paper["arxiv_id"]})sleep(5) # Wait before processing the next paper to avoid throttling# Save to JSONL fileoutput_file = "paper_summaries.jsonl"with open(output_file, "w") as f:for entry in paper_data:json.dump(entry, f)f.write("\n")print(f"\nProcessed {len(paper_data)} papers. Results saved to {output_file}")if __name__ == "__main__":main()
The script begins by setting up a directory to store downloaded research papers and initializing the Amazon Bedrock client for accessing models like Anthropic’s Claude. It defines fixed questions for guiding the summarization process, ensuring consistent and targeted outputs across all papers. These questions address key aspects such as the research objectives, methodologies, datasets, and findings, serving as a framework for generating structured and comprehensive summaries.
The script automates the process of downloading research papers using the arxiv library. Specific search queries are used to filter papers on machine learning and AI topics. The function download_papers retrieves papers, ensuring that only those within a specified page limit are processed. Extracted PDFs are stored locally, and text is extracted using the PyMuPDF library.
For each downloaded paper, the generate_summary_with_claude function sends the extracted text to the Claude model via Amazon Bedrock, using a structured prompt designed to elicit a 300-word summary. The prompt emphasizes clarity and coherence, encouraging the model to summarize the research while addressing predefined questions. The script includes a retry mechanism with increasing wait times to handle cases where the Bedrock API is temporarily overloaded, ensuring smooth and reliable communication with the service.
The script takes each paper, extracts its title and content, generates a summary using the Claude model, and combines these elements into a well-organized dataset containing the title, full text, and the corresponding summary. Summaries are stored in JSONL format, enabling easy retrieval and use later on for our evaluation workflows.
Evaluating Llama models with Weave for summarization
To effectively compare the performance of multiple Llama models available through Amazon Bedrock, we employ W&B Weave as the evaluation framework. Of course, you can extend this to run comparisons on any models you're interested in exploring.
The Weave platform provides a structured and efficient system for benchmarking by enabling detailed analysis of model outputs against predefined scoring metrics. In this evaluation, we compare three distinct models:
- Llama-1B, a lightweight model designed for cost-effective and high-throughput applications;
- Llama-8B, a balanced model offering strong performance and efficiency; and
- Llama-11B, a high-capacity model optimized for generating detailed and comprehensive outputs.
This setup ensures a repeatable process for identifying trade-offs and strengths between these LLMs, providing insights into their suitability for different summarization tasks.
Here's the evaluation script:
import weavefrom weave import Modelimport jsonimport boto3from botocore.exceptions import ClientErrorfrom time import sleepimport asynciofrom rouge_score.rouge_scorer import RougeScorerfrom typing import Dict, Anyimport bert_scoreimport fitzimport osfrom weave.trace.box import unboximport time# Initialize Weaveweave.init('bedrock_abstract_eval')client = boto3.client("bedrock-runtime", region_name="us-east-1")def extract_paper_text(pdf_path: str) -> str:"""Extract text from PDF paper."""if isinstance(pdf_path, weave.trace.box.BoxedStr):pdf_path = unbox(pdf_path)text = ""try:with fitz.open(unbox(pdf_path)) as pdf:for page in pdf:text += page.get_text()except Exception as e:print(f"Error extracting text from PDF {pdf_path}: {e}")return textdef format_prompt(text: str, title: str) -> str:"""Format prompt for model."""return f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>Please analyze this research paper and provide a comprehensive 300-word summary that covers:- Primary research objective- Methodology and approach- Key findings and results- Main contributions to the fieldTitle: {title}Content: {text}Please analyze this research paper and provide a comprehensive 300-word summary that covers:- Primary research objective- Methodology and approach- Key findings and results- Main contributions to the fieldGenerate a clear, coherent summary that captures the essence of the research.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""def model_forward(model_id: str, title: str, pdf_path: str) -> str:"""Core prediction logic to be called by predict methods."""max_retries = 15backoff_time = 10 # Start with 10 seconds delayfor attempt in range(max_retries):try:# Extract text from paperpaper_text = extract_paper_text(pdf_path)# Prepare requestrequest = {"prompt": format_prompt(paper_text, title),"max_gen_len": 4096,"temperature": 0.0,}print(f"Invoking model (Attempt {attempt + 1}/{max_retries})...")# Make predictionresponse = client.invoke_model(modelId=model_id,body=json.dumps(request))print("Done invoking")# Extract and clean predictionresponse_body = json.loads(response["body"].read())prediction = response_body["generation"].strip()return predictionexcept ClientError as e:if e.response['Error']['Code'] == 'ThrottlingException':print(f"ThrottlingException encountered. Retrying in {backoff_time} seconds...")time.sleep(backoff_time)backoff_time *= 2 # Exponential backoffelse:print(f"Error generating prediction with {model_id}: {e}")breakexcept Exception as e:print(f"Unexpected error: {e}")breakprint(f"Failed to generate prediction after {max_retries} retries.")return ""class Llama8B(Model):@weave.opdef predict(self, title: str, pdf_path: str) -> dict:prediction = model_forward("us.meta.llama3-1-8b-instruct-v1:0", title, pdf_path)return {"model_output": prediction}class Llama11B(Model):"""Llama 11B model."""@weave.opdef predict(self, title: str, pdf_path: str) -> dict:"""Generate prediction using Llama 11B."""prediction = model_forward("us.meta.llama3-2-11b-instruct-v1:0", title, pdf_path)return {"model_output": prediction}class Llama1B(Model):"""Llama 1B model."""@weave.opdef predict(self, title: str, pdf_path: str) -> dict:"""Generate prediction using Llama 1B."""prediction = model_forward("us.meta.llama3-2-1b-instruct-v1:0", title, pdf_path)return {"model_output": prediction}@weave.opdef bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate BERTScore for the abstract."""if not model_output or 'model_output' not in model_output:return {'bert_score': 0.0}try:P, R, F1 = bert_score.score([model_output['model_output']],[gt_abstract],lang='en',model_type='microsoft/deberta-xlarge-mnli')return {'bert_score': float(F1.mean())}except Exception as e:print(f"Error calculating BERTScore: {e}")return {'bert_score': 0.0}@weave.opdef claude_scorer(gt_abstract: str, model_output: dict) -> dict:"""Evaluate abstract using Claude."""if not model_output or 'model_output' not in model_output:return {'claude_score': 0.0}print("claude evaluating")# client = boto3.client("bedrock-runtime", region_name="us-east-1")prompt = json.dumps({"anthropic_version": "bedrock-2023-05-31","max_tokens": 1024,"temperature": 0.0,"messages": [{"role": "user","content": [{"type": "text", "text": f'''Rate how well this generated abstract captures the key information from the ground truth abstract on a scale from 1-5, where:1: Poor - Missing most key information or seriously misrepresenting the research2: Fair - Captures some information but misses crucial elements3: Good - Captures most key points but has some gaps or inaccuracies4: Very Good - Accurately captures nearly all key information with minor omissions5: Excellent - Perfectly captures all key information and maintains accuracyGround Truth Abstract:{gt_abstract}Generated Abstract:{model_output["model_output"]}Provide your rating as a JSON object with this schema:{{"score": <integer 1-5>}}'''}]}]})try:response = client.invoke_model(modelId="anthropic.claude-3-sonnet-20240229-v1:0",body=prompt)result = json.loads(response["body"].read())score = json.loads(result["content"][0]["text"])["score"]print(score)sleep(2) # Rate limitingreturn {'claude_score': float(score)}except Exception as e:print(f"Error in Claude evaluation: {e}")return {'claude_score': 0.0}@weave.opdef rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate ROUGE scores for the abstract."""if not model_output or 'model_output' not in model_output:return {'rouge1_f': 0.0,'rouge2_f': 0.0,'rougeL_f': 0.0}try:scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)scores = scorer.score(gt_abstract, model_output['model_output'])return {'rouge1_f': float(scores['rouge1'].fmeasure),'rouge2_f': float(scores['rouge2'].fmeasure),'rougeL_f': float(scores['rougeL'].fmeasure)}except Exception as e:print(f"Error calculating ROUGE scores: {e}")return {'rouge1_f': 0.0,'rouge2_f': 0.0,'rougeL_f': 0.0}@weave.opdef compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate compression ratio of the abstract."""if not model_output or 'model_output' not in model_output:return {'compression_ratio': 0.0}try:gt_words = len(gt_abstract.split())generated_words = len(model_output['model_output'].split())compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)return {'compression_ratio': float(compression_ratio)}except Exception as e:print(f"Error calculating compression ratio: {e}")return {'compression_ratio': 0.0}@weave.opdef coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate content coverage using word overlap."""if not model_output or 'model_output' not in model_output:return {'coverage_score': 0.0}try:gt_words = set(gt_abstract.lower().split())generated_words = set(model_output['model_output'].lower().split())intersection = len(gt_words.intersection(generated_words))union = len(gt_words.union(generated_words))coverage_score = intersection / union if union > 0 else 0.0return {'coverage_score': float(coverage_score)}except Exception as e:print(f"Error calculating coverage score: {e}")return {'coverage_score': 0.0}def create_evaluation_dataset(gt_file: str):"""Create dataset from ground truth file."""dataset = []with open(gt_file, 'r') as f:for line in f:entry = json.loads(line)dataset.append({"title": entry["title"],"gt_abstract": entry["summary"],"pdf_path": entry["file_path"]})return datasetasync def run_evaluations(gt_file: str):"""Run evaluations for each model."""eval_dataset = create_evaluation_dataset(gt_file)# Initialize modelsmodels = {"llama_8b": Llama8B(),"llama_11b": Llama11B(),"llama_1b": Llama1B()}# Setup scorersscorers = [claude_scorer,rouge_scorer,compression_scorer,coverage_scorer,bert_scorer]# Run evaluationsresults = {}for model_name, model in models.items():print(f"\nEvaluating {model_name}...")evaluation = weave.Evaluation(dataset=eval_dataset,scorers=scorers,name=model_name + " Eval")results[model_name] = await evaluation.evaluate(model)# Print resultsprint("\nEvaluation Results:")for model_name, result in results.items():print(f"\n{model_name} Results:")print(json.dumps(result, indent=2))# Save results to fileoutput_file = "llama_evaluation_results.json"with open(output_file, 'w') as f:json.dump(results, f, indent=2)print(f"\nResults saved to {output_file}")return resultsif __name__ == "__main__":gt_file = "paper_summaries.jsonl"asyncio.run(run_evaluations(gt_file))
We start by loading a pre-generated dataset of ground-truth summaries from a JSON file, which we created earlier with Claude. This dataset contains research paper titles, file paths to the full text, and manually created summaries that serve as benchmarks for evaluation. Loading the dataset ensures the workflow focuses on analyzing model outputs rather than duplicating efforts in data preparation.
Once the dataset is loaded, the script processes each entry, sending the text of research papers to different Llama models for summarization. The outputs are then compared against the ground-truth summaries using a range of evaluation metrics.
These metrics include:
- ROUGE scores to measure lexical overlap,
- BERTScore for semantic similarity,
- Compression ratios for assessing conciseness, and
- Coverage metrics to evaluate the amount of retained information.
Additionally, a scoring function powered by Anthropic's Claude, rates the summaries based on human-like understanding and alignment with the original content.
After scoring the outputs, Weave’s dynamic dashboard can be used to visualize the results. This interface allows for an in-depth comparison of metrics across models and examples, helping to identify trends and trade-offs, ultimately empowering developers to make informed decisions about which model best fits their specific requirements.
By combining the model diversity of Amazon Bedrock with the benchmarking capabilities of W&B Weave, this workflow delivers a robust and scalable solution for evaluating LLMs on summarization tasks:


The results show that Llama-8B outperforms both Llama-1B and Llama-11B across several evaluation metrics, including the Claude Score, ROUGE-1, ROUGE-2, and Coverage Score. This demonstrates its ability to generate summaries that align more closely with the ground truth, retaining key content and ensuring structural coherence.
Llama-11B, however, shows slight advantages over Llama-8B in ROUGE-L, Compression Ratio, and BERTScore. Llama-1B, on the other hand, falls behind both Llama-8B and Llama-11B across all metrics, affirming its role as a lightweight model optimized for efficiency rather than maximum performance.
Overall, Llama-8B and Llama-11B perform quite well in this evaluation, excelling in critical dimensions while maintaining reasonable latency and efficiency. One nice feature about Weave evaluations is that it allows you to dive deeper into the exact responses for each model, using the comparisons view. For example, we can compare the responses of each model side-by-side, and also in comparison to the ground truth in one single UI interface.
Here's a screenshot of the comparison view:

Llama vs. Amazon Nova
Amazon recently unveiled the Nova series, a diverse lineup of AI models aimed at meeting a wide range of needs. Nova Micro focuses on speed and affordability, making it ideal for straightforward tasks like summarization and translation, while Nova Lite handles multimodal input, including text, images, and video, for real-time analysis. Nova Pro strikes a balance between cost and performance, excelling in complex reasoning and multimodal workflows. Meanwhile, Nova Premier, expected to launch in early 2025, promises to handle the most intricate challenges with advanced capabilities. Let's see how these Nova models stack up against Llama.
We will write an evaluation script similar to the above script to test Amazon Nova Pro against the previous Llama Models. Note that since we used Weave Evaluations, we can simply write a new script using just the Nova Pro model, then select the previous evaluations that we will be comparing later on in the Weave dashboard. Here's the code:
import weavefrom weave import Modelimport jsonimport boto3from botocore.exceptions import ClientErrorfrom time import sleepimport asynciofrom rouge_score.rouge_scorer import RougeScorerfrom typing import Dict, Anyimport bert_scoreimport fitzimport osfrom weave.trace.box import unboximport timeimport logginglogging.basicConfig(level=logging.DEBUG)# Initialize Weaveweave.init('bedrock_abstract_eval')client = boto3.client(service_name="bedrock-runtime", region_name="us-east-1")def extract_paper_text(pdf_path: str) -> str:"""Extract text from PDF paper."""if isinstance(pdf_path, weave.trace.box.BoxedStr):pdf_path = unbox(pdf_path)text = ""try:with fitz.open(unbox(pdf_path)) as pdf:for page in pdf:text += page.get_text()except Exception as e:print(f"Error extracting text from PDF {pdf_path}: {e}")return textdef format_prompt(text: str, title: str) -> str:"""Format prompt for model."""return f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>Please analyze this research paper and provide a comprehensive 300-word summary that covers:- Primary research objective- Methodology and approach- Key findings and results- Main contributions to the fieldTitle: {title}Content: {text}Please analyze this research paper and provide a comprehensive 300-word summary that covers:- Primary research objective- Methodology and approach- Key findings and results- Main contributions to the fieldGenerate a clear, coherent summary that captures the essence of the research.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""def model_forward_nova(title: str, pdf_path: str) -> str:"""Core prediction logic for Amazon Nova Pro."""max_retries = 15backoff_time = 10 # Start with 10 seconds delayfor attempt in range(max_retries):try:paper_text = extract_paper_text(pdf_path)messages = [{"role": "user", "content": [{"text": format_prompt(paper_text, title)}]},]# Make predictionprint(f"Invoking Nova Pro model (Attempt {attempt + 1}/{max_retries})...")response = client.converse(modelId="us.amazon.nova-pro-v1:0",messages=messages)prediction = response["output"]["message"]["content"][0]["text"].strip()print("Done invoking")return predictionexcept Exception as e:print(f"Error generating prediction with Nova Pro (Attempt {attempt + 1}/{max_retries}): {e}")if attempt < max_retries - 1: # Avoid sleeping on the last attempttime.sleep(backoff_time)backoff_time *= 2 # Exponential backoffprint(f"Failed to generate prediction after {max_retries} retries.")return ""class NovaPro(Model):"""Amazon Nova Pro model."""@weave.opdef predict(self, title: str, pdf_path: str) -> dict:"""Generate prediction using Amazon Nova Pro."""prediction = model_forward_nova(title, pdf_path)return {"model_output": prediction}@weave.opdef bert_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate BERTScore for the abstract."""if not model_output or 'model_output' not in model_output:return {'bert_score': 0.0}try:P, R, F1 = bert_score.score([model_output['model_output']],[gt_abstract],lang='en',model_type='microsoft/deberta-xlarge-mnli')return {'bert_score': float(F1.mean())}except Exception as e:print(f"Error calculating BERTScore: {e}")return {'bert_score': 0.0}@weave.opdef rouge_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate ROUGE scores for the abstract."""if not model_output or 'model_output' not in model_output:return {'rouge1_f': 0.0,'rouge2_f': 0.0,'rougeL_f': 0.0}try:scorer = RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)scores = scorer.score(gt_abstract, model_output['model_output'])return {'rouge1_f': float(scores['rouge1'].fmeasure),'rouge2_f': float(scores['rouge2'].fmeasure),'rougeL_f': float(scores['rougeL'].fmeasure)}except Exception as e:print(f"Error calculating ROUGE scores: {e}")return {'rouge1_f': 0.0,'rouge2_f': 0.0,'rougeL_f': 0.0}@weave.opdef compression_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate compression ratio of the abstract."""if not model_output or 'model_output' not in model_output:return {'compression_ratio': 0.0}try:gt_words = len(gt_abstract.split())generated_words = len(model_output['model_output'].split())compression_ratio = min(gt_words, generated_words) / max(gt_words, generated_words)return {'compression_ratio': float(compression_ratio)}except Exception as e:print(f"Error calculating compression ratio: {e}")return {'compression_ratio': 0.0}@weave.opdef coverage_scorer(gt_abstract: str, model_output: dict) -> Dict[str, float]:"""Calculate content coverage using word overlap."""if not model_output or 'model_output' not in model_output:return {'coverage_score': 0.0}try:gt_words = set(gt_abstract.lower().split())generated_words = set(model_output['model_output'].lower().split())intersection = len(gt_words.intersection(generated_words))union = len(gt_words.union(generated_words))coverage_score = intersection / union if union > 0 else 0.0return {'coverage_score': float(coverage_score)}except Exception as e:print(f"Error calculating coverage score: {e}")return {'coverage_score': 0.0}@weave.opdef claude_scorer(gt_abstract: str, model_output: dict) -> dict:"""Evaluate abstract using Claude."""if not model_output or 'model_output' not in model_output:return {'claude_score': 0.0}print("claude evaluating")# client = boto3.client("bedrock-runtime", region_name="us-east-1")prompt = json.dumps({"anthropic_version": "bedrock-2023-05-31","max_tokens": 1024,"temperature": 0.0,"messages": [{"role": "user","content": [{"type": "text", "text": f'''Rate how well this generated abstract captures the key information from the ground truth abstract on a scale from 1-5, where:1: Poor - Missing most key information or seriously misrepresenting the research2: Fair - Captures some information but misses crucial elements3: Good - Captures most key points but has some gaps or inaccuracies4: Very Good - Accurately captures nearly all key information with minor omissions5: Excellent - Perfectly captures all key information and maintains accuracyGround Truth Abstract:{gt_abstract}Generated Abstract:{model_output["model_output"]}Provide your rating as a JSON object with this schema:{{"score": <integer 1-5>}}'''}]}]})try:response = client.invoke_model(modelId="anthropic.claude-3-sonnet-20240229-v1:0",body=prompt)result = json.loads(response["body"].read())score = json.loads(result["content"][0]["text"])["score"]print(score)sleep(2) # Rate limitingreturn {'claude_score': float(score)}except Exception as e:print(f"Error in Claude evaluation: {e}")return {'claude_score': 0.0}def create_evaluation_dataset(gt_file: str):"""Create dataset from ground truth file."""dataset = []with open(gt_file, 'r') as f:for line in f:entry = json.loads(line)dataset.append({"title": entry["title"],"gt_abstract": entry["summary"],"pdf_path": entry["file_path"]})return datasetasync def run_evaluations(gt_file: str):"""Run evaluations for each model."""eval_dataset = create_evaluation_dataset(gt_file)# Initialize modelsmodels = {"nova_pro": NovaPro(),}# Setup scorersscorers = [claude_scorer,rouge_scorer,compression_scorer,coverage_scorer,bert_scorer]# Run evaluationsresults = {}for model_name, model in models.items():print(f"\nEvaluating {model_name}...")evaluation = weave.Evaluation(dataset=eval_dataset,scorers=scorers,name=model_name + " Eval")results[model_name] = await evaluation.evaluate(model)# Print resultsprint("\nEvaluation Results:")for model_name, result in results.items():print(f"\n{model_name} Results:")print(json.dumps(result, indent=2))# Save results to fileoutput_file = "llama_vs_nova_evaluation_results.json"with open(output_file, 'w') as f:json.dump(results, f, indent=2)print(f"\nResults saved to {output_file}")return resultsif __name__ == "__main__":gt_file = "paper_summaries.jsonl"asyncio.run(run_evaluations(gt_file))
Here are the results as seen in the Weave Evaluations comparisons dashboard:


The results indicate that Nova Pro generally outperforms the two Llama-based models on multiple evaluation metrics. Nova Pro’s Claude Score of 4.8 surpasses Llama-8B (4.6) and Llama-11B (4.4), suggesting its summaries align more closely with human-like judgment as reflected by the Claude evaluation criteria. It also excels in ROUGE metrics, achieving a ROUGE-1 score of 0.6308, a ROUGE-2 score of 0.3038, and a ROUGE-L score of 0.3543—higher than both Llama models.
Although Nova Pro’s compression ratio (0.8198) is slightly lower than Llama-8B (0.822) and Llama-11B (0.8233), it compensates with stronger content retention, as evidenced by a coverage score of 0.31, exceeding Llama-8B’s 0.2819 and Llama-11B’s 0.2791. Nova Pro also attains the highest BERTScore of 0.7252, reflecting a closer semantic match to the reference summaries.
It’s important to note that while metrics like ROUGE and BERTScore capture lexical and semantic overlap, they do not fully measure readability, coherence, or overall user satisfaction. The Claude Score provides a more human-aligned benchmark, helping to highlight Nova Pro’s ability to produce well-structured, coherent, and appealing summaries.
Overall, Nova Pro’s strong showing across these metrics—combined with its superior alignment with human evaluators—makes it a compelling choice for tasks that prioritize both technical precision and human-centered quality.
Why choose Amazon Bedrock?
Amazon Bedrock combines a diverse selection of foundation models with seamless integration into the AWS ecosystem, making it a powerful platform for generative AI. It offers access to models from leading providers like Anthropic, Meta, AI21 Labs, Mistral, and Amazon’s Titan, allowing users to tailor their model choice to tasks like text summarization, embedding generation, or multimodal processing. By integrating with AWS services, Bedrock simplifies workflows, leverages trusted infrastructure, and enables scalable deployments without the complexity of managing underlying systems.
With a focus on scalability, security, and reliability, Bedrock ensures businesses can easily adjust resources to meet growing demands while maintaining data protection through advanced encryption and compliance with industry standards. Backed by AWS’s robust global infrastructure, it delivers consistent performance and uptime, making it an ideal solution for organizations seeking a dependable and flexible platform for their AI initiatives.
Conclusion
Amazon Bedrock and W&B Weave represent a compelling combination for evaluating and comparing the performance of large language models in diverse use cases such as text summarization. Bedrock’s expansive array of foundation models, coupled with its integration into the AWS ecosystem, provides flexibility and scalability for businesses seeking robust AI solutions.
By leveraging Weave’s sophisticated benchmarking and visualization capabilities, organizations can methodically analyze the trade-offs between models, gaining insights that go beyond raw metrics to understand practical strengths and limitations. As AI technologies continue to evolve, platforms like Bedrock and Weave pave the way for businesses to harness the power of generative AI, ensuring they remain competitive in a data-driven world.
Related Articles
Building an LLM Python debugger agent with the new Claude 3.5 Sonnet
Building a AI powered coding agent with Claude 3.5 Sonnet!
Training a KANFormer: KAN's Are All You Need?
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers!
Building reliable apps with GPT-4o and structured outputs
Learn how to enforce consistency on GPT-4o outputs, and build reliable Gen-AI Apps.
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.