Automated PDF summarization of arXiv papers with Claude 3.5 Sonnet and W&B Weave

Learn how to create an automated PDF summarization system for arXiv papers using Anthropic's API and W&B Weave using Chain Of Density.
Anish Shah
Created on July 24|Last edited on July 30
Comment
As the pace of research accelerates, keeping up with machine learning research is increasingly difficult. Thankfully, we have arXiv to help. 
It's of course impossible to read every paper that gets published there. That's why our goal today is to build an automated PDF summarization and question-answering system tailored for ArXiv papers, utilizing W&B Weave and Anthropic's API. 
You can follow along with this interactive Colab and also check it out in the Weave project here. 
﻿
﻿
﻿
We'll also be showcasing some powerful new Weave features like our evaluation comparison tool: 
﻿
Before you dig into the implementation details, here's what our model produces as an output (with some light formatting for easier readability):
﻿
﻿Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?﻿
SliCK addresses limitations in LLM knowledge integration through a granular four-category system (HighlyKnown, MaybeKnown, WeaklyKnown, Unknown) based on PCorrect, surpassing P(True) (Kadavath et al., 2022) with nuanced understanding of knowledge types. Multi-exemplar sampling (Nex=10, 4-shot prompts, Nsample=16, T=0.5) improves Unknown example identification, achieving 2% post-fine-tuning accuracy at 35% Unknown vs. P(True) methods.
Controlled closed-book QA with disjoint train/test splits addresses capability misalignment (Huang et al., 2023), enabling precise examination of knowledge integration. Exact Match evaluation mitigates paraphrase detection issues. Linear regression model (Accuracy = β0 + βkn * (Nkn / |D|) + βunk * (Nunk / |D|)) quantifies new knowledge impact (R² = 0.86 in-distribution, 0.95 OOD), surpassing qualitative assessments.
Out-of-distribution evaluation using 7 unseen relations reveals similar trends in Unknown example impact (6-point OOD drop vs. 14-point in-distribution), challenging the superficial alignment hypothesis (Zhou et al., 2023). Analysis of training dynamics shows slower fitting of Unknown examples (25% vs. 75% Known at early stopping), addressing gaps in LLM knowledge acquisition literature.
Category-specific fine-tuning demonstrates MaybeKnown examples' superiority (43.6% accuracy) over HighlyKnown-only approaches (40.5%), challenging simplistic knowledge category assumptions. Uncertainty expression relabeling mitigates overfitting (61.8% accuracy maintained vs. 43.0% to 38.8% drop in standard fine-tuning).
SliCK's advantages include: granular categorization, robust multi-exemplar sampling, controlled experimental design, quantitative impact analysis, comprehensive in-distribution and OOD evaluation, practical mitigation strategies (early stopping, filtering), and uncertainty expression integration. These advancements offer a precise, empirically-grounded approach to improving LLM performance and reliability, addressing current challenges in knowledge integration and fine-tuning practices.
﻿
With that preamble out of the way, let's introduce the tools we'll be using: 
Tools for our PDF summarization projectW&B Weave: An LLMOps platform solution seamless integration of LLM components and efficient data lineage tracking.
Anthropic's API: Providing advanced language understanding and generation capabilities.
arXiv API: For retrieving relevant papers based on user queries.
PDF Processing: Extracting textual and visual content from research papers.
Chain-of-Density Summarization: A technique for generating increasingly dense and informative summaries.
Weave automatically captures the input Request and output Response objects from the Anthropic API, making experimentation as simple as import weave; weave.init(...)
💡
Goals for our project: We want to create a pipeline that can: 
Fetch relevant arXiv papers using user-defined queries that are upscaled using LLM query enhancement.
Extract and process text and images from PDFs, leveraging the LLM’s vision capabilities to convert image content to text.
Generate concise, technical summaries of papers using multiple layers of iterative refinement conditioned on specific user instruction.
Evaluate the condensed answer using an LLM as a judge, so we can optimize parameters such as model choice, and understand at what points the iterative refinement using Chain of Density ceases to be gainful.
Claude 3.5 Sonnet﻿Claude 3.5 Sonnet, Anthropic's latest model (as of July 11, 2024), significantly improves performance and computational efficiency over its predecessors, operating at twice the speed of Claude 3 Opus. It excels in complex tasks, such as advanced reasoning and coding, and includes enhanced vision capabilities for interpreting charts, graphs, and transcribing text from images, which is essential for various technical applications. This is ideal for our use case. 
What is Chain of Density summarization?Chain of Density Summarization is an NLP technique that identifies and ranks the most informative sentences in a document by evaluating their "density" (significant content concentration). It then selects the top-ranked sentences to create a concise summary that retains the main ideas and key points.
Here's how it works:
Initial summary creation: First, start with an entity-sparse summary. This initial summary gets generated using a large language model (LLM) like GPT-4 but includes minimal details.
Identification of missing entities: Next, review the initial summary to identify key details or entities that are missing. These entities are critical for understanding the main content of the text.
Chained prompts: Construct additional prompts to iteratively increase the information density of the summary. Importantly, each prompt aims to add more entities without increasing the summary length.
Execution and refinement: Feed these chained prompts back to the LLM to produce denser summaries. This process continues iteratively, refining the summary until it is both concise and rich in information (hence the name).
Final review: The final summary should capture all essential points while maintaining readability and coherence. The process balances informativeness with clarity, ensuring that the summary is not overly dense or hard to follow.
Why we're using chain of density summarizationThis method should let us create summaries that are dense and concise, making it ideal   for tasks that necessitate detailed and precise information extraction. Which is exactly what  summarizing academic papers from arXiv requires. 
Let's build the pipeline. 
﻿
How to set up Weave and AnthropicWe'll begin by initializing our Weave project to enable comprehensive tracing of our PDF pre-processing and summarization pipeline:
import weave
weave.init("arxiv-paper-summarization-anthropic")
This initialization sets up Weave's experiment tracking capabilities, allowing us to log and analyze function inputs and outputs, as well as pipeline performance (including cost and speed).
Next, let's initialize the Anthropic client:
import anthropic
import os
﻿
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
Weave is integrated with Anthropic so it automatically logs LLM requests and responses from the Anthropic client, providing valuable insights into token usage, associated costs, and the content of requests/responses. This integration facilitates detailed analysis of our model's performance and resource utilization.
Optional: How to fetch arXiv papers using the API (click to expand)
Selecting and serializing our document to summarizeWe use Pydantic models to represent Arxiv papers, ensuring type safety and easy data validation. Here's how it looks in Weave:
﻿
The code: 
from datetime import datetime, timezone
from pydantic import BaseModel, Field
from typing import List, Optional
﻿
class Author(BaseModel):
    full_name: str
﻿
class Link(BaseModel):
    href: str
    title: Optional[str] = None
    rel: Optional[str] = None
    content_type: Optional[str] = None
﻿
class ArxivPaper(BaseModel):
    entry_id: str
    updated: datetime
    published: datetime
    title: str
    authors: List[Author]
    summary: str
    comment: Optional[str] = None
    journal_ref: Optional[str] = None
    doi: Optional[str] = None
    primary_category: str
    categories: List[str]
    links: List[Link]
    pdf_url: Optional[str] = None
﻿
    def __getitem__(self, key):
        return getattr(self, key)
﻿
arxiv_paper = ArxivPaper(
    entry_id="<http://arxiv.org/abs/2406.04744v1>",
    updated=datetime(2024, 6, 7, 8, 43, 7, tzinfo=timezone.utc),
    published=datetime(2024, 6, 7, 8, 43, 7, tzinfo=timezone.utc),
    title="CRAG -- Comprehensive RAG Benchmark",
    authors=[Author(full_name="Xiao Yang"), Author(full_name="Kai Sun"), Author(full_name="Hao Xin")],
    summary="The long summary from the paper",
    doi="10.48550/arXiv.2406.04744",
    primary_category="cs.CL",
    categories=["cs.CL"],
    links=[
        Link(href="<https://arxiv.org/abs/2406.04744>", title="Abstract", rel="alternate"),
        Link(href="<https://arxiv.org/pdf/2406.04744>", title="pdf", rel="related")
    ]
    pdf_url="https://arxiv.org/pdf/2406.04744"
)
Memory efficient PDF loading and processingWe use PyPDF2.PdfReader for PDF manipulation, which offers a Pythonic interface to PDF content:
import requests
import io
import PyPDF2
﻿
def load_pdf(arxiv_result):
    pdf_url = arxiv_result["pdf_url"]
    response = requests.get(pdf_url)
    pdf_file = io.BytesIO(response.content)
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    return pdf_reader
The function above streams PDF content, which is memory-efficient for large files like the ones in this project. 
Handling vector graphics in PDFsPDFs can contain vector graphics (read: SVGs) that PyPDF2 doesn't recognize as images. We use a novel approach to capture these. Essentially, this function takes a "screenshot" of the page when vector graphics are present, converting them to a format Claude can process:
from pdf2image import convert_from_bytes
from PIL import Image
import base64
﻿
def convert_vector_graphic_page_to_image(pdf_page: PyPDF2.PageObject, scale_factor: float = 0.5) -> Optional[str]:
    def get_object(obj):
        return obj.get_object() if isinstance(obj, PyPDF2.generic.IndirectObject) else obj
﻿
    resources = get_object(pdf_page.get('/Resources', {}))
    xobject = get_object(resources.get('/XObject', {}))
﻿
    for obj in xobject.values():
        obj = get_object(obj)
        if isinstance(obj, dict) and obj.get('/Subtype') == '/Form':  # Indicates a vector graphic
            pdf_bytes = io.BytesIO()
            PyPDF2.PdfWriter().add_page(pdf_page).write(pdf_bytes)
            pdf_bytes.seek(0)
﻿
            images = convert_from_bytes(pdf_bytes.getvalue(), fmt='png')
            if images:
                image = images[0]
                new_size = tuple(int(dim * scale_factor) for dim in image.size)
                image = image.resize(new_size, Image.LANCZOS)
                img_byte_arr = io.BytesIO()
                image.save(img_byte_arr, format='PNG')
                img_str = base64.b64encode(img_byte_arr.getvalue()).decode("utf-8")
                data_url = f"data:image/png;base64,{img_str}"
                return data_url
﻿
    return None
Converting images into text descriptions with Claude 3.5 SonnettIn our pipeline, we leverage Claude's advanced vision capabilities to interpret images from scientific papers. This process is crucial for converting visual information into textual data that can be easily processed and analyzed. We use two distinct approaches for different scenarios: standalone figures and full PDF pages that may contain multiple vector graphics. 
﻿
Let's start with the standalone images.
Scenario 1: Standalone imagesThe prompt below is designed to extract comprehensive, technically accurate information from scientific figures:
1. Structured analysis: The numbered list guides Claude through a systematic examination of the figure, ensuring all crucial aspects are covered.
2. Technical focus: By explicitly requesting a "detailed technical description," we prime Claude to use domain-specific language and avoid superficial observations.
3. Versatility: The prompt is adaptable to various types of scientific figures across different research domains.
4. Quantitative emphasis: Special attention is given to quantitative information, crucial in scientific research.
5. Contextual understanding: By inquiring about methodology, implications, and limitations, we encourage Claude to interpret the figure within the broader research context.
6. Precision instruction: The final instruction pushes Claude to provide specific, scientifically relevant information rather than vague or general observations.
@weave.op()
def process_figure_image(data_url: str, model: str = "claude-3-5-sonnet-20240620") -> str:
    img_str = data_url.split(",")[1]
    prompt = """Analyze this image as if it's a figure from a scientific research paper. Provide a detailed technical description addressing:
    1. Type of figure (e.g., graph, diagram, flowchart, experimental setup)
    2. Key components or variables represented
    3. Relationships or trends depicted
    4. Quantitative information (if present)
    5. Methodology or process illustrated (if applicable)
    6. Potential implications or conclusions that can be drawn
    7. Any limitations or assumptions evident in the figure
    Focus on technical accuracy and relevance to scientific research. Avoid general descriptions and concentrate on the specific scientific content presented."""
﻿
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_str}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text
Scenario 2: Full PDF pages with vector graphicsThe prompt below is tailored for the complex task of extracting information from full PDF pages:
1. Context setting: It informs Claude that the image is a full page from a PDF, setting the stage for analysis.
2. Focused attention: Claude is instructed to identify and focus only on vector graphic figures or charts, crucial for pages with mixed content.
3. Structured analysis per image: A detailed technical analysis is requested for each figure, maintaining consistency with the standalone figure approach.
4. Handling multiple images: The prompt addresses the possibility of multiple figures on a single page, ensuring clear structure in the output.
5. Exclusion instruction: Claude is explicitly told to ignore text and other non-vector graphic elements, focusing solely on the visual data of interest.
6. Technical focus: The final instruction reinforces the need for accurate, technical descriptions, maintaining high scientific relevance.
@weave.op()
def process_vector_image_pdf(data_url: str, model: str = "claude-3-5-sonnet-20240620") -> str:
    img_str = data_url.split(",")[1]
    prompt = """This image is a full page from a scientific paper PDF, converted to PNG format. It may contain one or more vector graphic figures or charts. Your task is to:
    1. Identify and focus solely on the vector graphic figures or charts within the page.
    2. For each identified figure or chart, provide a detailed technical analysis addressing:
       a. Type of figure (e.g., graph, diagram, flowchart)
       b. Key components or variables represented
       c. Relationships or trends depicted
       d. Quantitative information (if present)
       e. Methodology or process illustrated (if applicable)
       f. Potential implications or conclusions that can be drawn
    3. Ignore any text or other elements on the page that are not part of the vector graphic figures.
    4. If multiple figures are present, analyze each separately and clearly indicate which figure you are describing.
    Focus on providing accurate, technical descriptions of the vector graphic content only."""
﻿
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_str}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text
Both these prompts serve as a crucial bridge between unstructured visual data in scientific papers and structured, analyzable text data. This approach enables:
Extraction of structured, technical information from figures, valuable for tasks like automated literature review or data extraction from research papers.
Creation of rich, multi-modal datasets linking textual descriptions with visual elements in scientific documents.
Potential training of more specialized models for scientific figure interpretation.
Handling of complex, multi-element PDF pages, a common challenge in scientific document processing.
Scalability across various scientific domains, making the system adaptable for different types of research papers.
This ensures that our chain of density summarization process can incorporate information from both textual and visual elements, crucial for our project where research often contains tons of both. 
﻿
﻿
Comprehensive image extraction and processingWe combine all the steps above into a single function that processes an entire PDF:
import filetype
﻿
@weave.op()
def extract_images(paper, model="claude-3-5-sonnet-20240620"):
    """Extract text and images from PDF content."""
    pdf_reader = load_pdf(paper)
    all_images = []
﻿
    aspect_ratio_sizes = {
        (1, 1): (1092, 1092),
        (3, 4): (951, 1268),
        (2, 3): (896, 1344),
        (9, 16): (819, 1456),
        (1, 2): (784, 1568)
    }
﻿
    def get_closest_aspect_ratio(width, height):
        img_ratio = width / height
        return min(aspect_ratio_sizes.keys(), key=lambda x: abs(x[0]/x[1] - img_ratio))
﻿
    for page in pdf_reader.pages:
        images = []
﻿
        for image in page.images:
            img_data = image.data
            kind = filetype.guess(img_data)
            if kind is None:
                print("Cannot guess file type!")
                continue
﻿
            # Resize image if necessary
            img = Image.open(io.BytesIO(img_data))
            closest_ratio = get_closest_aspect_ratio(img.width, img.height)
            new_size = aspect_ratio_sizes[closest_ratio]
            
            if img.width != new_size[0] or img.height != new_size[1]:
                img = img.resize(new_size, Image.LANCZOS)
                
                # Convert resized image back to bytes
                img_byte_arr = io.BytesIO()
                img.save(img_byte_arr, format=img.format if img.format else 'PNG')
                img_data = img_byte_arr.getvalue()
﻿
﻿
            img_str = base64.b64encode(img_data).decode("utf-8")
            data_url = f"data:{kind.mime};base64,{img_str}"
            try:
                images.append(
                    {"image": data_url, "description": process_figure_image(data_url, model=model)}
                )
            except Exception as e:
                print(f"Error processing image: {e}")
                images.append({"image": data_url, "description": ""})
﻿
        vector_graphics_image_data_url = convert_vector_graphic_page_to_image(page)
        if vector_graphics_image_data_url:
            images.append({"image": vector_graphics_image_data_url, "description": process_vector_image_pdf(vector_graphics_image_data_url, model=model)})
        all_images.append(images)
﻿
    return all_images
Note: this function handles both raster images and vector graphics, processing them appropriately and generating detailed descriptions using Claude, while ensuring each image matches the input sizes by Claude.
Replacing images with text descriptionsFinally, we integrate the image descriptions into the text of the paper. The function below extracts the text from each page, then inserts the image descriptions at appropriate points, creating a comprehensive text representation of the entire paper, including visual content:
@weave.op()
def replace_images_with_descriptions(paper, images):
    pdf_reader = load_pdf(paper)
    text = ""
    for page_num, page in enumerate(pdf_reader.pages):
        text += page.extract_text() + "\n\n"
        if images[page_num] and len(images[page_num]) > 0:
            text += f"\n\n[Image Descriptions for page {page_num+1}]\n"
            for image_num, image in enumerate(images[page_num]):
                text += f"\n[Image {image_num+1}]: {image['description']}\n"
            text += "[END OF IMAGE DESCRIPTIONS]\n"
﻿
    return text
Here's what it looks like in Weave: 
﻿
Additional context about context windowsAs the length of the PDF can exceed the context length of the LLM or the length of the PDF passed to a long context model can from missing relevant info if too much text is passed to the LLM.



































ModelsContext WindowInput Cost / 1M tokensOutput Cost / 1M tokens
Claude 3 Opus200,000$15.00$75.00
Claude 3 Sonnet200,000$3.00$15.00
Claude 3 Haiku200,000$0.25$1.25
Claude 3.5 Sonnet200,000$3$15
﻿
It's worth remembering: 
﻿
We have multiple layers to how we approach summarization. We'll need investigate at the quality at each layer as we refine our approach. 
Implementing chain of density summarizationWe're ready to implement chain of density now. You can see how it looks in Weave, with the code and details to follow: 
﻿
The core of our summarization pipeline is  implemented in the following functions:
1. Generate a summary given the previous summary and the current missing entities from the documentThe summarize_current_summary step:
Forms the foundation of our chain of density implementation
Utilizes a carefully crafted prompt to guide the language model
Instructs the model to identify new technical entities
Incorporates new entities into the summary
Increases overall information density while maintaining relevance to the given instruction
The code: 
@weave.op()
def summarize_current_summary(document, instruction, current_summary="", iteration=1, model="claude-3-5-sonnet-20240620"):
    # Define the maximum number of tokens for the model's response
    max_tokens = 4096
    
    # Construct the prompt for the LLM
    prompt = f"""
    Document: {document}
    Current summary: {current_summary}
    Instruction to focus on: {instruction}
    Iteration: {iteration}
    
    Generate an increasingly concise, entity-dense, and highly technical summary from the provided document that specifically addresses the given instruction using the below approach:
    1. Carefully read the current summary and the instruction.
    2. Identify 1-3 new, important technical entities or ideas from the original text that:
       - Are directly relevant to the instruction
       - Are not yet present in the current summary
       - Add significant, specific information to the summary
       - Are preferably 5 words or fewer
       - May include methodologies, algorithms, metrics, or key findings
       - Ensure to include this in the output before the summary
    3. Write a new summary that:
       - Incorporates the newly identified entities/ideas
       - Retains all crucial information from the current summary
       - Increases overall information density
       - Remains focused on addressing the instruction
       - Utilizes the response window of {max_tokens} tokens
    
    Guidelines:
    - Prioritize technical accuracy and specificity over general readability
    - Use precise terminology, domain-specific jargon, and include quantitative details where relevant
    - Ensure all information is directly related to the instruction
    - Make every word count: rewrite to improve density and make space for new technical entities
    - Employ fusion, compression, and removal of less informative phrases to increase density
    - Never drop entities or technical details from the current summary that are relevant to the instruction
    - Maintain coherence while maximizing information density
    
    Your goal is to create a summary that is noticeably denser, more technical, and more informative than the previous one, utilizing the response window of {max_tokens} tokens while staying laser-focused on the instruction. The summary should be suitable for an expert audience in the field.
    """
    
    # Make the API call to the LLM
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Return the generated summary
    return response.content[0].text
﻿
2. Accumulate our dense summary over N iterationsThe iterative_density_summarization step: 
Orchestrates the iterative refinement process
Repeatedly calls summarize_current_summary
Uses each iteration’s ouput as input for the next
Allows for gradual accumulation of technical details
Increases density of information progressively
 The code: 
@weave.op()
def iterative_density_summarization(document, instruction, current_summary, density_iterations, model):
    # Initialize a list to store summaries from each iteration
    iteration_summaries = []
﻿
    # Iterate through the specified number of density iterations
    for iteration in range(1, density_iterations + 1):
        # Generate a new summary based on the current summary and document
        current_summary = summarize_current_summary(document, instruction, current_summary, iteration, model)
        
        # Add the new summary to the list of iteration summaries
        iteration_summaries.append(current_summary)
        
        # Print the current iteration and summary for monitoring
        print(f"Iteration {iteration}:\n{current_summary}\n")
﻿
    # Return the final summary and the list of all iteration summaries
    return current_summary, iteration_summaries
3. Take the result of the accumulation and do one final pass to clean the outputThe final_summary step: 
Performs a final condensation step after the iterative process
Aims to reduce summary length by 30-40%
Retains all critical technical content
Optimizes for maximum information density and relevance to the instruction
The code: 
@weave.op()
def final_summary(instruction, current_summary, model):
    # Construct the prompt for the final summary generation
    prompt = (
        f"""Given this summary:{current_summary}
        And this instruction to focus on:{instruction}
        Create an extremely dense, final summary that captures all key technical information in the most concise form possible, while specifically addressing the given instruction. Follow these guidelines:
        1. Aim to reduce length by 30-40% while retaining all critical technical content relevant to the instruction.
        2. Prioritize highly specific methodologies, algorithms, metrics, and findings that directly address the instruction.
        3. Preserve precise quantitative data, including statistical significance and error margins where applicable and relevant to the instruction.
        4. Maintain the use of domain-specific terminology and technical jargon pertinent to the instruction.
        5. Ensure that all key entities and concepts from the original summary that relate to the instruction are represented.
        6. Use compact phrasing and remove any remaining non-essential information that doesn't directly contribute to addressing the instruction.
        7. If relevant to the instruction, include brief mentions of limitations, assumptions, or conflicting viewpoints.
        8. Optimize for information density while maintaining coherence for an expert audience, always keeping the focus on the given instruction.
        The final summary should be a highly concentrated, technical distillation of the research that specifically addresses the given instruction, suitable for specialists in the field."""
    )
﻿
    # Make the API call to the LLM for the final summary
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}]
    )
﻿
    # Return the generated final summary
    return response.content[0].text
4. Combine together as Chain of density summarizationThe chain_of_density_summarization step: 
Serves as the main entry point for the summarization process
Coordinates the entire summarization pipeline
Initiates the iterative summarization
Applies the final condensation
Returns a comprehensive result set including:
Final summary
Accumulated summary
All intermediate summaries
The code: 
@weave.op()
def chain_of_density_summarization(document, instruction, current_summary="", model="claude-3-5-sonnet-20240620", density_iterations=2):
    # Perform iterative density summarization
    current_summary, iteration_summaries = iterative_density_summarization(
        document, instruction, current_summary, density_iterations, model
    )
    
    # Generate the final, highly condensed summary
    final_summary_text = final_summary(instruction, current_summary, model)
    
    # Print the final summary for monitoring
    print(f"Final Summary:\n{final_summary_text}\n")
    
    # Return a dictionary containing all generated summaries
    return {
        "final_summary": final_summary_text,
        "accumulated_summary": current_summary,
        "iteration_summaries": iteration_summaries,
    }
﻿
﻿
Put together, this implementation leverages the chain of density technique to produce increasingly dense and informative summaries. 
By iteratively refining the summary and focusing on technical entities and ideas, it generates concise yet highly informative summaries tailored to specific instructions. The process prioritizes technical accuracy, domain-specific terminology, and quantitative details, making it particularly suitable for summarizing complex scientific documents for expert audiences.
﻿
﻿
﻿
How it works in W&B WeaveWe'll be using W&B Weave to interrogate our summaries, compare LLM evaluations, and more. 
Weave model objectFirst, we need to create a model object to encapsulate our summarization pipeline. Here's how that looks in Weave: 
﻿
The class below encapsulates our summarization pipeline as a Weave Model. By inheriting from weave.Model and using the @weave.op() decorator, we enable automatic versioning and tracking of inputs, outputs, and code changes. This makes it easy to reproduce experiments and compare results across different model versions or parameter settings.
class ArxivChainOfDensityPipeline(weave.Model):
    model: str = "claude-3-5-sonnet-20240620"    density_iterations: int = 3    def __init__(self, model: str = "claude-3-5-sonnet-20240620", density_iterations: int = 3):
        super().__init__()
        self.model = model
        self.density_iterations = density_iterations
    @weave.op()
    def predict(self, paper: ArxivPaper, instruction: str) -> dict:
        extracted_images = extract_images(paper)
        cleaned_text = replace_images_with_descriptions(paper, extracted_images)
        result = chain_of_density_summarization(cleaned_text, instruction, model=self.model, density_iterations=self.density_iterations)
        return result
Evaluation datasetNext, let's create an evaluation dataset using sample arVix papers and instructions. It will look something like this: 
﻿
The code below creates a Weave Dataset object that combines papers, instructions, and original summaries for evaluation. The weave.Dataset class allows us to version and track our evaluation data, ensuring reproducibility of our experiments. By publishing the dataset with weave.publish(), we make it available for future use and comparison.
eval_papers = [arxiv_paper3]
eval_instructions = [
    "Summarize the key methodologies and novel contributions of this research, focusing on their potential impact in the field.",
]
eval_data = list(product(eval_papers, eval_instructions))
dataset = weave.Dataset(name="we-paper-reading-eval-data", rows=[{"paper": arxiv_paper, "instruction": instruction, "summary": arxiv_paper.summary} for arxiv_paper, instruction in eval_data])
weave.publish(dataset)
Evaluating our iterative summarization pipeline over the refinement processNext, we want to implement several metrics to assess how good our summaries are.
﻿
We'll use a function that leverages GPT-4 to evaluate individual summaries based on three criteria: 
Relevance
Technical quality
Concision
We get a few benefits here. First, this helps capture the nuanced aspects of summary quality. It also provides a holistic assessment of how well the summary addresses the given instruction while evaluating the technical accuracy with concision in mind. 
The code in question: 
@weave.op()
def score_summary(summary, summary_type, instruction, model):
    openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    # Construct a detailed prompt for the GPT model to evaluate the summary    prompt = f"""Evaluate the quality of the following {summary_type} based on how well it addresses the given instruction. Use the scoring rules below to calculate three numerical scores between 0 and 10.Instruction: {instruction}{summary_type}:{summary}Scoring Rules:1. Relevance (0-5): [Detailed scoring criteria for relevance]2. Technical Quality (0-5): [Detailed scoring criteria for technical quality]3. Conciseness (0-5): [Detailed scoring criteria for conciseness]Provide your evaluation in the following JSON format:{{    "relevance": {{        "score": <float>    }},    "technical_quality": {{        "score": <float>    }},    "conciseness": {{        "score": <float>    }}}}Ensure your response is ONLY valid JSON. Do not include any other text outside the JSON object.Ensure you have the keys: relevance, technical_quality, conciseness, each containing only a score.Ensure each score is a float between 0 and 10, using the scoring rules provided above."""    # Make an API call to the GPT model for evaluation    response = openai_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    # Parse and return the JSON response    return json.loads(response.choices[0].message.content)
Our next function analyzes the distribution of scores across multiple summaries. It calculates for each aspect (relevance, technical quality, conciseness), mean score, and “tail ratio” (average of top 5% scores compared to overall mean)
This is useful because it helps us identify potential outliers or exceptionally high-quality summaries. It also provides insight into overall performance of our summarization process and highlights areas where the model excels—or needs to be improved.
The code: 
@weave.op()
def calculate_long_tail_stats(scores):
    if not scores:
        return None    aspects = ['relevance', 'technical_quality', 'conciseness']
    stats = {}
    for aspect in aspects:
        try:
            # Handle different input formats (list of lists or list of dicts)            if isinstance(scores[0], list):
                flattened_scores = [score[aspect]['score'] for sublist in scores for score in sublist]
            elif isinstance(scores[0], dict):
                flattened_scores = [score[aspect]['score'] for score in scores]
            else:
                print(f"Unexpected format for scores: {scores}")
                return None            # Calculate statistics for each aspect            stats[aspect] = {
                "mean": np.mean(flattened_scores),
                "tail_ratio": np.mean(sorted(flattened_scores)[-max(1, int(len(flattened_scores)*0.05)):]) / np.mean(flattened_scores),
            }
        except Exception as e:
            print(f"Error calculating stats for {aspect}: {str(e)}")
            stats[aspect] = None    return stats
Our next function assesses the improvement of summaries across iterations. We're looking at metrics that evidence the point of diminishing returns (where improvements become negative or zero) and the cumulative improvement for each step.
This helps optimize the number of iterations in the chain of density process and determines when further iterations may no longer lead to significant improvements.
@weave.op()
def analyze_iteration_impact(scores):
    if len(scores) < 2:
        return {aspect: {"diminishing_returns_point": 0, "cumulative_improvement": 0} for aspect in ['relevance', 'technical_quality', 'conciseness']}
    aspects = ['relevance', 'technical_quality', 'conciseness']
    results = {}
    for aspect in aspects:
        aspect_scores = [s[aspect]['score'] for s in scores]
        improvements = [aspect_scores[i+1] - aspect_scores[i] for i in range(len(aspect_scores)-1)]
        results[aspect] = {
            "diminishing_returns_point": next((i for i, imp in enumerate(improvements) if imp <= 0), len(improvements)),
            "cumulative_improvement": sum(improvements),
        }
    return results
Our next code block determines the most effective range of iterations for improvement. We use a moving average of improvements to identify sustained progress with the aim of finding the optimal range where improvements are above a certain threshold.
This helps with fine-tuning our chain of density process and identifies the most productive iteration range for each aspect of summary quality.
@weave.op()
def find_optimal_improvement_range(scores):
    if len(scores) < 3:
        return {aspect: {"optimal_range_start": 0, "optimal_range_end": 0, "score_at_start": 0, "score_at_end": 0, "improvement_in_range": 0} for aspect in ['relevance', 'technical_quality', 'conciseness']}
    aspects = ['relevance', 'technical_quality', 'conciseness']
    results = {}
    for aspect in aspects:
        aspect_scores = [s[aspect]['score'] for s in scores]
        improvements = [aspect_scores[i+1] - aspect_scores[i] for i in range(len(aspect_scores)-1)]
        # Calculate moving average of improvements        window_size = min(3, len(aspect_scores) - 1)
        moving_avg = np.convolve(improvements, np.ones(window_size), 'valid') / window_size
        # Find range where improvements are above a threshold        threshold = 0.1 * np.mean(improvements)
        above_threshold = [i for i, avg in enumerate(moving_avg) if avg >= threshold]
        if not above_threshold:
            optimal_start, optimal_end = 0, 0        else:
            optimal_start = above_threshold[0]
            optimal_end = above_threshold[-1] + 1        results[aspect] = {
            "optimal_range_start": optimal_start,
            "optimal_range_end": optimal_end,
            "score_at_start": aspect_scores[optimal_start],
            "score_at_end": aspect_scores[optimal_end] if optimal_end < len(aspect_scores) else aspect_scores[-1],
            "improvement_in_range": sum(improvements[optimal_start:optimal_end])
        }
    return results
Next, we identify the iteration range producing the highest quality summaries. We find the range leading up to the highest score for each aspect and consider cumulative improvement within that range. This helps us understand which iterations contribute most significantly to final summary quality as well as assists in optimizing the summarization process.
@weave.op()
def find_optimal_score_range(scores):
    if len(scores) < 2:
        return {aspect: {"optimal_range_start": 0, "optimal_range_end": 0, "highest_score": 0, "improvement_in_range": 0} for aspect in ['relevance', 'technical_quality', 'conciseness']}
    aspects = ['relevance', 'technical_quality', 'conciseness']
    results = {}
    for aspect in aspects:
        aspect_scores = [s[aspect]['score'] for s in scores]
        improvements = [aspect_scores[i+1] - aspect_scores[i] for i in range(len(aspect_scores)-1)]
        highest_score = max(aspect_scores)
        highest_score_index = aspect_scores.index(highest_score)
        # Find the best range leading up to the highest score        best_start = 0        best_end = highest_score_index
        best_improvement = sum(improvements[:highest_score_index])
        for start in range(highest_score_index):
            current_improvement = sum(improvements[start:highest_score_index])
            if current_improvement > best_improvement:
                best_start = start
                best_improvement = current_improvement
        results[aspect] = {
            "optimal_range_start": best_start,
            "optimal_range_end": highest_score_index,
            "score_at_start": aspect_scores[best_start],
            "score_at_end": highest_score,
            "improvement_in_range": best_improvement
        }
    return results
Now, we aggregate and analyze scores across all summarization iterations. This gives us a holistic view of summary quality evolution throughout our chain of density iterations. We can use this to understand the overall effectiveness of our process and identify trends that improve our quality. 
@weave.op()
def process_iteration_summaries(model_output, instruction, model):
    iteration_scores = [score_summary(summary, f"Iteration Summary {i+1}", instruction, model)
                        for i, summary in enumerate(model_output["iteration_summaries"])]
    return {
        "long_tail_stats": calculate_long_tail_stats(iteration_scores),
        # Additional analyses can be added here if needed    }
Our last function serves as the main entry point for evaluating summarization quality. It combines all previous metrics into a comprehensive evaluation, analyzing iteration summaries, accumulated summary, and final summary. 
This provides a detailed, multi-faceted assessment of the summarization pipeline’s performance, brings insight into various aspects of summary quality, and evaluates the effectiveness of the entirety of our chain of density process. 
@weave.op()
def quality_scorer(instruction, model_output, model="gpt-4o"):
    scores = {
        "iteration_summaries_analysis": {},
        "accumulated_summary": {},
        "final_summary": {}
    }
    try:
        # Process iteration summaries        scores["iteration_summaries_analysis"] = process_iteration_summaries(model_output, instruction, model)
        # Score accumulated summary        scores["accumulated_summary"] = score_summary(model_output["accumulated_summary"], "Accumulated Summary", instruction, model)
        # Score final summary        scores["final_summary"] = score_summary(model_output["final_summary"], "Final Summary", instruction, model)
        # Flatten the scores dictionary for easier analysis        flattened_scores = {}
        for key, value in scores.items():
            if isinstance(value, dict):
                flattened_scores[key] = flatten_dict(value)
            else:
                flattened_scores[key] = value
        scores = flatten_dict(flattened_scores)
    except Exception as e:
        print(f"Error in quality_scorer: {str(e)}")
        scores["error"] = str(e)
    return scores
﻿
Collectively, these evaluation metrics provide a robust framework for assessing the quality and effectiveness of our chain of density summarization pipeline. By examining multiple aspects of summary quality across different stages of the process, we can gain valuable insights into the strengths and weaknesses of our approach, identify areas for improvement, and optimize the summarization process for maximum effectiveness.
Running the evaluation One of our newer Weave feature we're really proud of is evaluation comparisons.  It's a visual, interactive way to compare LLM outputs and drill into performance, both at an overall level and on individual examples. 
﻿
The set up looks like this: 
models = [
    "claude-3-opus-20240229",
    "claude-3-haiku-20240307",
    "claude-3-5-sonnet-20240620"]
evaluation = weave.Evaluation(dataset=dataset, scorers=[quality_scorer])
for model in models:
    arxiv_chain_of_density_pipeline = ArxivChainOfDensityPipeline(model=model, density_iterations=8)
    await evaluation.evaluate(arxiv_chain_of_density_pipeline)
Essentially, this code sets up a Weave Evaluation object and runs the evaluation for each model in our list.
Results and outputsHere's how our models stacked up against each other: 





























ModelStrengthsWeaknessesBest Use Case
claude-3-sonnet-20240320• Highest relevance (4.1944)
• Best technical quality (4.2500)
• Most comprehensive summaries• Lowest conciseness (3.5833)
• Highest latency (556.4930 ms)
• Highest token usage (271,323)Applications prioritizing summary quality over computational efficiency
claude-3-opus-20240229• Balanced performance
• Best conciseness (4.0000)
• Good relevance (4.0556) and technical quality (4.0000)• Moderate latency (478.1886 ms)
• Moderate token usage (221,489)Scenarios requiring a balance between summary quality and resource usage
claude-3-haiku-20240307• Lowest latency (241.1572 ms)
• Lowest token usage (204,852)• Lowest relevance (3.4444)
• Lowest technical quality (3.4167)
• Lowest conciseness (3.5444)Applications with strict latency requirements or limited computational resources
﻿
And here is an another example of our pipeline in action: 
﻿
﻿Many-Shot In-Context Learning﻿
Experimental setup: Gemini 1.5 Pro (1M token context) for many-shot ICL across NLP tasks. Methodology: Random sampling with replacement for K-shot prompts, 3-5 seeds, greedy decoding, KV caching. Evaluation: Task-specific metrics, paired bootstrap resampling for significance, Bonferroni correction.
Key results:
1. Low-resource MT: 997-shot ICL improved Bemba (28.3% to 47.7%, p<0.001, d=2.8) and Kurdish (39.5% to 44.0%, p<0.01, d=1.2).
2. Summarization (XSum): 50-shot ICL peaked at ROUGE-L 32.1% ± 0.4% (95% CI [31.3%, 32.9%]).
3. GPQA: 125-shot ICL achieved 43.8% accuracy (95% CI [41.2%, 46.4%]), comparable to Claude-3 Opus (p=0.08).
4. BIG-Bench Hard: Reinforced ICL outperformed 3-shot CoT (83% vs. 72.1%, p<0.001, d=1.7).
5. Sentiment analysis: 2048-shot ICL reached 95.2% accuracy (95% CI [94.1%, 96.3%]), overcoming label flips.
6. High-dimensional classification: 2048-shot ICL approached k-NN performance (N=16: 89.7% ± 1.2%, N=64: 79.8% ± 1.8%).
7. Sequential parity: 8192-shot ICL achieved 40.2% accuracy (95% CI [38.9%, 41.5%]), surpassing GPT-2 Medium (p<0.001, d=2.1).
Chrono-ablation: Performance plateaus beyond certain shot counts (XSum: 50 shots, MATH: 125 shots, p>0.05 for additional shots).
Limitations:
1. Example ordering sensitivity: MATH500 performance varied significantly across subareas (ANOVA p<0.001, SD=3.2% across 10 orderings).
2. Next-token prediction loss unreliable for ICL performance prediction (r=-0.18, p=0.23).
3. Limited generalizability due to focus on Gemini 1.5 Pro.
4. Hallucination in summarization: XSum fabrications increased from 0% (1-shot) to 18.7% (500-shot), χ² p<0.001.
5. Statistical power limitations: Inadequate power (1-β<0.8) for small effect sizes (d<0.3) in some comparisons.
Statistical robustness:
- Multiple seeds (3-5), insufficient for robust analysis.
- Standard deviation reported for MT (0.1%-0.5%) and visualized via error bars.
- Coefficient of variation: 2.1% (GPQA) to 8.7% (BIG-Bench Hard).
- Bootstrapped 95% CIs provided for key metrics.
Methodological innovations:
1. Reinforced ICL outperformed few-shot ICL with human rationales (MATH: Δ=5.2%, p<0.001, d=1.3).
2. Unsupervised ICL showed domain-specific effectiveness (MATH: 83% relative performance, p<0.01).
Future work recommendations: Increase statistical rigor (G*Power analysis, false discovery rate control), enhance cross-model validation, optimize prompt engineering, refine evaluation metrics, conduct comprehensive ablation studies, and quantify hallucination propensity.
﻿
Optional: Advanced chunking techniques (click to expand)
ConclusionChain of density prompting turned out to be an ideal technique for understanding dense, complicated text with supporting imagery. In fact, the entire approach above could easily be ported to use cases like: 
Reviewing large corpuses of research—think a massive volume of clinical trial data
Assisting in peer review by providing concise overviews 
Enhancing search and discovery for relevant research in any field 
Notably, it's worth remembering our implementation can handle text and visual elements, comes complete with robust eval metrics, and can be easily customized for non-arXiv use cases. 
Thanks so much for reading. Here's that Colab link one last time if you want to give it all a try yourself: 
﻿
﻿
Models	Context Window	Input Cost / 1M tokens	Output Cost / 1M tokens
Claude 3 Opus	200,000	$15.00	$75.00
Claude 3 Sonnet	200,000	$3.00	$15.00
Claude 3 Haiku	200,000	$0.25	$1.25
Claude 3.5 Sonnet	200,000	$3	$15
Model	Strengths	Weaknesses	Best Use Case
claude-3-sonnet-20240320	• Highest relevance (4.1944) • Best technical quality (4.2500) • Most comprehensive summaries	• Lowest conciseness (3.5833) • Highest latency (556.4930 ms) • Highest token usage (271,323)	Applications prioritizing summary quality over computational efficiency
claude-3-opus-20240229	• Balanced performance • Best conciseness (4.0000) • Good relevance (4.0556) and technical quality (4.0000)	• Moderate latency (478.1886 ms) • Moderate token usage (221,489)	Scenarios requiring a balance between summary quality and resource usage
claude-3-haiku-20240307	• Lowest latency (241.1572 ms) • Lowest token usage (204,852)	• Lowest relevance (3.4444) • Lowest technical quality (3.4167) • Lowest conciseness (3.5444)	Applications with strict latency requirements or limited computational resources
Add a comment
Tags: Articles, Weave, LLM, Text Generation, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.