Supercharging LLM summarization

A guide to making the most of LLMs for summarization tasks
Created on August 16|Last edited on August 26
Comment
In the age of information overload, the ability to distill vast amounts of text into concise summaries is invaluable. Whether it's for digesting news articles, research papers, or even legal documents, summarization allows individuals to quickly grasp the essence of content without wading through the minutiae. 
This is one area where large language models really shine. Unlike tasks such as performing mathematical calculations—where LLMs tend to struggle—these models tend to be really talented at summarization, making them particularly effective and useful when large amounts of information needs to be condensed into a smaller and more digestible format. 
In this tutorial, we will cover some practical techniques to improve summarization quality for the task of AI research paper summarization. 
﻿
Table of contentsTable of contentsFine-tuning: Tailoring the model to specific needsIn-context learning: A lightweight approachPrompt engineering: Crafting effective instructionsChunking, summarizing, and merging: Handling large textsSelf-reflection for avoiding factual errorsSome thoughts on evaluation Example 1: Fixed QA summarizationExample 2: fixed + dynamic QA summarizationExample 2: fixed + dynamic QA summarization + chunking Building a LLM summarizer web app Comparing multiple responses in WeaveConclusion 
﻿
Fine-tuning: Tailoring the model to specific needsOne of the ways to enhance an LLM's summarization capabilities is through fine-tuning. Fine-tuning involves further training the model on a smaller, specialized dataset that is closely related to the desired task—in this case, summarization. By exposing the model to numerous examples of summaries and their corresponding full texts, it can learn to produce more accurate and contextually relevant summaries. 
This approach allows the model to be tailored to specific domains or styles, such as legal documents or academic papers. However, fine-tuning can be resource-intensive and requires a well-curated dataset to avoid introducing biases or inaccuracies. Additionally, this process will need to be repeated whenever new, smarter models are released, as the fine-tuning from a previous model won’t necessarily carry over to the new version. This can make maintaining high-quality, domain-specific models both costly and time-consuming.
In-context learning: A lightweight approachFor those looking for a less resource-intensive method, in-context learning offers a practical alternative. This approach involves providing the model with examples directly within the prompt. For instance, by including a few pairs of full texts and their summaries in the prompt, the model can infer the desired format and style for the summary it is about to generate. 
This method is flexible and can be adapted on the fly to suit different tasks or domains without the need for retraining the model. Interestingly, in-context learning usually doesn't even require the full input text. The model can often learn from just the output summary and then match a similar style when generating a new summary, based on the new input. While this may be more expensive during inference in comparison to fine-tuning (due to a longer input prompt), in-context learning is a powerful technique that leverages the model's ability to learn from examples presented within the prompt itself.
Prompt engineering: Crafting effective instructionsThe way a prompt is structured can significantly influence the quality of the summaries generated by an LLM. Several factors need to be considered when crafting a prompt:
Length: Specify the desired length of the summary to guide the model in producing a concise output that meets your needs. This helps ensure the summary is neither too brief nor too detailed, striking the right balance based on the given text and your specific requirements.
Fixed QA (Question-Answer) Prompts: This involves giving the model a fixed set of questions related to the content being summarized, which effectively is used to guide the model for summarization. By asking the model to answer the questions in the prompt, you gain higher consistency in the models ability to generate helpful summaries. 
Dynamic QA Prompts: These prompts adapt in real-time to the content of the input, enabling the model to provide more tailored and context-aware responses. For summarization, this means the the questions given to the model will be more relevant to the subject matter of the text, without the need to manually create specialized questions for the input prompt. 
Chunking, summarizing, and merging: Handling large textsWhen dealing with particularly long texts, a single-pass summarization might not be practical or effective. Instead, a more sophisticated approach involves chunking the text into smaller sections, summarizing each chunk, and then merging these summaries into a cohesive whole. 
This strategy helps maintain the context and continuity across longer documents, ensuring that important details aren't lost in a single, overly condensed summary. While this approach may require more steps, it can produce more comprehensive and accurate summaries, particularly for complex or lengthy documents, and also save costs during inference.
Self-reflection for avoiding factual errorsIncorporating a self-reflection step for factual errors into your summarization workflow enhances the reliability of the generated summaries. After the initial summary is created using an in-context example or dynamic QA prompts, the summary is passed through an additional process where the model checks its output against the original text. This self-reflection involves prompting the model to identify and correct any factual inaccuracies by comparing the summary with the original document. The model then produces a revised summary that aims to be more accurate. 
This step can be incredibly valuable, especially when dealing with critical content where factual accuracy is paramount, such as in legal, medical, or academic summaries. By adding this layer of verification, you can significantly improve the trustworthiness of the output, ensuring that the summaries not only capture the essence of the content but do so without introducing misleading or incorrect information. This approach leverages the model's understanding to refine its own outputs, leading to higher-quality and more dependable results.
Some thoughts on evaluation When evaluating the quality of summaries generated by large language models, it's important to recognize that quality is inherently subjective, making evaluation a complex challenge. No single metric can fully capture the nuances of what makes a summary effective. This complexity means that multiple approaches are often needed to get a comprehensive assessment.
One useful method involves comparing generated summaries to reference summaries using metrics like ROUGE and BLEU. ROUGE measures the overlap of n-grams between the generated and reference summaries, while BLEU assesses how well the generated summary matches the reference in terms of word choice and phrasing. Both metrics offer valuable insights, but they mainly focus on surface-level similarities and may not fully reflect the summary's overall quality.
A more advanced evaluation approach could involve using a powerful model like GPT-4 to manually answer questions based on the full original document. The model's ability to generate similar answers when given just the summary can then be tested. By comparing these generated answers to the manual ones, you can gauge how well the summary captures essential information and whether it effectively conveys key details. This method provides a more nuanced evaluation, combining both quantitative metrics and the ability to assess specific content.
Finally, I recommend human evaluation on your specific tasks in order to pick the best model for your needs. While automated metrics like ROUGE and BLEU offer valuable insights, they can only go so far in assessing the true quality of a summary. Human judgment is essential for capturing the subjective elements of a "good" summary—such as coherence, readability, and the ability to convey the core message in a way that resonates with the intended audience.
In practice, human evaluators can review summaries generated by different models and provide feedback on which ones best meet the specific requirements of the task at hand. This could involve rating the summaries on various criteria such as clarity, relevance, and conciseness, or even directly comparing multiple summaries side-by-side to determine which one offers the most accurate and engaging distillation of the source material.
Human evaluation is particularly important for domain-specific tasks where nuances and context are important, such as legal, medical, or academic content. By incorporating human insights into the evaluation process, you can ensure that the chosen model not only performs well on objective metrics but also aligns with the subjective expectations of the end users, leading to more effective and reliable summaries in real-world applications.
Example 1: Fixed QA summarizationNow, we will dive into some code covering how to implement these concepts. For our models, we will use the Azure fine serverless inference API, which you can read more about here. This API is great for testing out many of the best LLM's, without having to make code changes, or manage multiple API accounts. 
Here, we will implement the fixed QA approach for summarizing an AI research paper. We'll create a set of questions that are commonly helpful for understanding an research paper, and prompt the model to generate a summary of the paper while also addressing the questions. Additionally, I give the model an in-context example, which is a text file containing a previous summary on a research paper that I personally wrote. Even without giving the model a the full research paper corresponding to the in-context example, I find that this approach still improves the quality of the summary as the model is able to generalize from my writing style.
Here's the code: 
import fitz  # PyMuPDF
import requests
import weave
import weave
import os
﻿
# Initialize Weave for logging
weave.init('summarizer')
﻿
ENDPOINT_URL = "https://Mistral-small-achqr.eastus2.models.ai.azure.com/chat/completions"
PRIMARY_KEY = "your key"
DEFAULT_PAPER_URL = "https://arxiv.org/pdf/2407.20183.pdf"
IN_CONTEXT_FILE = "in_context_example.txt"
﻿
summary_length = 400  # Set the desired maximum length of the summary in words
﻿
FIXED_QUESTIONS = """
What is the primary objective of this research?
What methodologies or algorithms are proposed or evaluated?
What datasets or experimental setups are used in this study?
What are the key findings and contributions of this research?
What are the implications of these findings for the broader field of AI?
What limitations or challenges are acknowledged by the authors?
What are the proposed future directions or next steps in this research?
"""
﻿
﻿
﻿
@weave.op()
def get_model_prediction(prompt):
    headers = {
        "Authorization": f"Bearer {PRIMARY_KEY}",
        "Content-Type": "application/json"
    }
﻿
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    }
﻿
    response = requests.post(ENDPOINT_URL, headers=headers, json=payload)
    return response.json()
﻿
def download_pdf(url, filename):
    response = requests.get(url)
    with open(filename, 'wb') as file:
        file.write(response.content)
    print(f"Downloaded PDF from {url}")
﻿
def load_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text
﻿
def load_in_context_example(filepath):
    if os.path.exists(filepath):
        with open(filepath, 'r') as file:
            return file.read()
    return ""
﻿
if __name__ == "__main__":
    # Download the PDF if it doesn't exist
    pdf_path = "2407.20183.pdf"
    if not os.path.exists(pdf_path):
        download_pdf(DEFAULT_PAPER_URL, pdf_path)
﻿
    # Load the PDF text
    pdf_text = load_pdf_text(pdf_path)
    print("PDF text loaded.")
﻿
    # Load the in-context example
    in_context_example = load_in_context_example(IN_CONTEXT_FILE)
    
    # Combine the in-context example with the fixed questions
    prompt = (
            f"Heres a previous In-context example paper summary:\n{in_context_example}\n\n"
            f"Please summarize the following text and address these questions:\n{FIXED_QUESTIONS} in {summary_length} words \n\n"
            f"Text:\n{pdf_text}"
        )
    # Get the model prediction
    response = get_model_prediction(prompt)
    print("Model response:", response)
﻿
This script is designed to automate the summarization of academic papers by combining an in-context example with a set of fixed questions that guide the summarization process. The script first checks if a specified PDF exists, and if not, it downloads it. After extracting the text from the PDF, the in-context example is loaded from a file. The core of the script revolves around constructing a prompt that includes the in-context example, a set of predefined questions (such as the primary objective, methodologies, key findings, etc.), and the extracted text. 
This prompt is then sent to an Azure model endpoint using a function that handles the API request. The model processes this input, generating a summary that directly addresses the specified questions. The inclusion of a fixed set of questions ensures that the summary is focused and structured, making it particularly useful for consistently extracting key information from similar types of documents. Finally, the response from the model, which contains the summary, is printed out, providing a concise overview of the document that is aligned with the user’s specified requirements.
The @weave.op() decorator is applied to the get_model_prediction function, indicating that this function is an "operation" tracked by Weave. When the decorator is added, Weave automatically logs all inputs to and outputs from this function. This logging includes the prompt sent to the model and the response generated, allowing for detailed tracking and analysis. This feature is particularly useful for debugging and optimizing the model's performance, as it provides insights into how different inputs affect the output and enables you to fine-tune the interaction with the language model effectively.
To use this code, you must replace the PRIMARY_KEY and ENDPOINT_URL with the key and URL specific to your model. This ensures that API calls are authenticated and directed to the correct endpoint, enabling the script to function properly.
Example 2: fixed + dynamic QA summarizationNow we will add dynamic QA generation to complement the fixed set of questions, creating a more adaptable and comprehensive summarization process. In the original approach, the script relied on a predefined set of fixed questions to guide the summarization, ensuring that specific aspects of the text were consistently addressed. While effective for maintaining structure, this method could miss important, context-specific details within the document. 
We will add to the script so that the model will first generate a set of questions that an AI researcher may be interested in, and we will use these questions in the prompt for the following call to the model which will ask the model to generate a summary of the paper given the fixed and newly generate questions: 
import fitz  
import requests
import os
import weave  # Import Weave for logging and tracking
﻿
# Initialize Weave for logging
weave.init('summarizer')
﻿
ENDPOINT_URL = "https://Mistral-small-achqr.eastus2.models.ai.azure.com/chat/completions"
PRIMARY_KEY = "your key"
DEFAULT_PAPER_URL = "https://arxiv.org/pdf/2407.20183.pdf"
IN_CONTEXT_FILE = "in_context_example.txt"
﻿
summary_length = 400  # Set the desired maximum length of the summary in words
﻿
FIXED_QUESTIONS = """
What is the primary objective of this research?
What methodologies or algorithms are proposed or evaluated?
What datasets or experimental setups are used in this study?
What are the key findings and contributions of this research?
What are the implications of these findings for the broader field of AI?
What limitations or challenges are acknowledged by the authors?
What are the proposed future directions or next steps in this research?
"""
﻿
@weave.op()
def get_model_prediction(prompt):
    headers = {
        "Authorization": f"Bearer {PRIMARY_KEY}",
        "Content-Type": "application/json"
    }
﻿
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    }
﻿
    response = requests.post(ENDPOINT_URL, headers=headers, json=payload)
    return response.json()
﻿
def download_pdf(url, filename):
    response = requests.get(url)
    with open(filename, 'wb') as file:
        file.write(response.content)
    print(f"Downloaded PDF from {url}")
﻿
def load_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text
﻿
def load_in_context_example(filepath):
    if os.path.exists(filepath):
        with open(filepath, 'r') as file:
            return file.read()
    return ""
﻿
if __name__ == "__main__":
    # Download the PDF if it doesn't exist
    pdf_path = "2407.20183.pdf"
    if not os.path.exists(pdf_path):
        download_pdf(DEFAULT_PAPER_URL, pdf_path)
﻿
    # Load the PDF text
    pdf_text = load_pdf_text(pdf_path)
    print("PDF text loaded.")
﻿
    # Load the in-context example
    in_context_example = load_in_context_example(IN_CONTEXT_FILE)
﻿
    # Generate dynamic questions
    prompt_for_questions = f"Generate key questions that a researcher would ask based on the following text:\n{pdf_text}"
    questions_response = get_model_prediction(prompt_for_questions)
﻿
    if questions_response:
        generated_questions = questions_response.get("choices")[0].get("message").get("content")
        print("Generated Questions:")
        print(generated_questions)
﻿
        # Combine fixed and generated questions to create the final prompt
        final_prompt = (
            f"Heres a previous In-context example paper summary:\n{in_context_example}\n\n"
            f"Please summarize the following text and address these questions:\n{FIXED_QUESTIONS}\n{generated_questions} in {summary_length} words \n\n"
            f"Text:\n{pdf_text}"
        )
﻿
        # Get the final model prediction
        final_response = get_model_prediction(final_prompt)
        print("Model response:", final_response)
With the updated script, after extracting the text from the PDF, we use the model to dynamically generate additional questions based on the document's content. These dynamically generated questions are then combined with the fixed questions to form a more tailored prompt. This approach allows the model to adapt to the unique nuances of the document, resulting in a summary that is both consistent and contextually relevant. 
The final prompt, which now includes the in-context example along with both fixed and dynamic questions, is sent to the model to produce a more detailed and responsive summary. This method enhances the summarization process by ensuring that the generated summary reflects both the general structure provided by the fixed questions and the specific nuances captured by the dynamic questions.
Example 2: fixed + dynamic QA summarization + chunking Now we will add chunking and dynamic summarization to handle longer documents more effectively. In the original approach, the script processed the entire document in one go, which could be inefficient or result in overly condensed summaries for lengthy texts. With the updated script, the text is first split into smaller, manageable chunks based on a specified word count. Each chunk is summarized individually, allowing for more focused and coherent summaries that retain important details across different sections of the document.
import fitz  # PyMuPDF
import requests
import weave
import os
﻿
# Initialize Weave for logging
weave.init('summarizer')
﻿
﻿
ENDPOINT_URL = "https://Mistral-small-achqr.eastus2.models.ai.azure.com/chat/completions"
PRIMARY_KEY = "your key"
DEFAULT_PAPER_URL = "https://arxiv.org/pdf/2407.20183.pdf"
IN_CONTEXT_FILE = "in_context_example.txt"
﻿
summary_length = 400  # Set the desired maximum length of the summary in words
chunk_size = 800  # Example chunk size in words
summary_pct = 10   # Example summary percentage for chunked summarization
﻿
FIXED_QUESTIONS = """
What is the primary objective of this research?
What methodologies or algorithms are proposed or evaluated?
What datasets or experimental setups are used in this study?
What are the key findings and contributions of this research?
What are the implications of these findings for the broader field of AI?
What limitations or challenges are acknowledged by the authors?
What are the proposed future directions or next steps in this research?
"""
﻿
﻿
@weave.op()
def get_model_prediction(prompt):
    headers = {
        "Authorization": f"Bearer {PRIMARY_KEY}",
        "Content-Type": "application/json"
    }
﻿
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    }
﻿
    response = requests.post(ENDPOINT_URL, headers=headers, json=payload)
    return response.json()
﻿
def download_pdf(url, filename):
    response = requests.get(url)
    with open(filename, 'wb') as file:
        file.write(response.content)
    print(f"Downloaded PDF from {url}")
﻿
def load_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text
﻿
def load_in_context_example(filepath):
    if os.path.exists(filepath):
        with open(filepath, 'r') as file:
            return file.read()
    return ""
﻿
def chunk_text_by_words(text, chunk_size):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
﻿
def calculate_summary_length(chunk_text, summary_pct):
    word_count = len(chunk_text.split())
    return max(1, int(word_count * summary_pct / 100))
﻿
if __name__ == "__main__":
    # Download the PDF if it doesn't exist
    pdf_path = "2407.20183.pdf"
    if not os.path.exists(pdf_path):
        download_pdf(DEFAULT_PAPER_URL, pdf_path)
﻿
    # Load the PDF text
    pdf_text = load_pdf_text(pdf_path)
    print("PDF text loaded.")
﻿
    # Load the in-context example
    in_context_example = load_in_context_example(IN_CONTEXT_FILE)
﻿
    # Chunked Summarization
    chunks = chunk_text_by_words(pdf_text, chunk_size)
    chunk_summaries = []
﻿
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}")
﻿
        # Calculate the dynamic summary length based on the chunk size
        summary_len = calculate_summary_length(chunk, summary_pct)
﻿
        prompt = (
            f"Summarize the following section of a research paper in {summary_len} words):\n{chunk}"
        )
        response = get_model_prediction(prompt)
        chunk_summaries.append(response.get("choices")[0].get("message").get("content"))
﻿
    # Combine chunk summaries and generate dynamic questions
    combined_summary = " ".join(chunk_summaries)
    prompt_for_questions = (
        f"In-context example:\n{in_context_example}\n\n"
        f"Generate a few key questions a researcher might ask about the following summarized sections:\n{combined_summary}"
    )
    questions_response = get_model_prediction(prompt_for_questions)
﻿
    if questions_response:
        generated_questions = questions_response.get("choices")[0].get("message").get("content")
        print("Generated Questions:")
        print(generated_questions)
﻿
        # Use the combined summary and generated questions to create the final prompt
        final_prompt = (
            f"In-context example:\n{in_context_example}\n\n"
            f"Please summarize the following text and address these questions :\n{FIXED_QUESTIONS}\n{generated_questions} (Word Limit: {summary_length} words)\n\n"
            f"Text:\n{combined_summary}"
        )
﻿
        # Get the final model prediction
        final_response = get_model_prediction(final_prompt)
        print("Model response:", final_response)
﻿
After summarizing each chunk, the summaries are combined, and the script uses the model to dynamically generate key questions about the summarized sections. These generated questions, along with a fixed set of questions, are used to create a final prompt that guides the summarization process. 
This approach not only makes the summarization process more scalable for large documents but also enhances the relevance of the summary by considering both general and specific aspects of the text. The final output is a summary that addresses both the fixed and dynamically generated questions, providing a comprehensive overview of the document while maintaining clarity and coherence across different sections. This method ensures that even with complex or lengthy documents, the resulting summaries remain accurate, informative, and aligned with the intended focus of the content.
Building a LLM summarizer web app Now we will build a web app capable of summarizing research papers, ensuring that the generated summaries are not only concise but also accurate and aligned with the specific needs of the user. Leveraging the power of language models through Azure's API, the app allows for flexible and dynamic interactions, making it a robust tool for content summarization across various domains. 
We'll also use Weave to track model responses and provide a mechanism for users to give feedback on those responses as they use the app. This tracking and feedback system enables users to evaluate the model on realistic data continuously. By integrating Weave, the app not only logs each interaction with the model but also allows users to upvote or downvote the generated summaries. This feedback loop is crucial for refining the model’s performance, as it provides valuable insights into how well the model meets user expectations in real-world scenarios. I will omit the front-end HTML code for this tutorial and share the backend code for the Flask app here:
import requests
import json
import random
import os
from flask import Flask, request, jsonify, render_template
import weave
﻿
# Initialize Weave
client = weave.init('summarizer')
﻿
app = Flask(__name__)
﻿
# Define global variables for the models and cache
MODELS = [
    {"endpoint": "https://Meta-Llama-3-1-405B-Instruct-lqf.eastus2.models.ai.azure.com/chat/completions", "key": "your key"},
    {"endpoint": "https://Meta-Llama-3-1-70B-Instruct-xwib.eastus2.models.ai.azure.com/chat/completions", "key": "your key"},
    # {"endpoint": "your fourth model endpoint url", "key": "your fourth model key"},
    # {"endpoint": "your fifth model endpoint url", "key": "your fifth model key"},
]
﻿
IN_CONTEXT_FILE = "./in_context_example.txt"
SUMMARY_LENGTH = 400  # Maximum length of the summary in words
FIXED_QUESTIONS = """
What is the primary objective of this research?
What methodologies or algorithms are proposed or evaluated?
What datasets or experimental setups are used in this study?
What are the key findings and contributions of this research?
What are the implications of these findings for the broader field of AI?
What limitations or challenges are acknowledged by the authors?
What are the proposed future directions or next steps in this research?
"""
﻿
# Cache to store used prompts and models
used_prompts_cache = {}
﻿
@weave.op()
def get_model_prediction(prompt, endpoint, key, original_input=None):
    headers = {
        "Authorization": f"Bearer {key}",
        "Content-Type": "application/json"
    }
﻿
    payload = {
        "messages": [
            {"role": "system", "content": "You are a summarization assistant."},
            {"role": "user", "content": prompt}
        ]
    }
﻿
    response = requests.post(endpoint, headers=headers, json=payload)
﻿
    # Get the current call ID
    current_call = weave.get_current_call()
    call_id = current_call.id
﻿
    try:
        return parse_response(response.json()), call_id
    except json.JSONDecodeError as e:
        print(f"Failed to decode JSON response: {e}")
        return None, call_id
﻿
def parse_response(response):
    if 'choices' in response:
        choices = response['choices']
        for choice in choices:
            if 'message' in choice and 'content' in choice['message']:
                content = choice['message']['content']
                print(f"Model response content: {content}")
                return content
            if 'finish_reason' in choice:
                finish_reason = choice['finish_reason']
                print(f"Finish reason: {finish_reason}")
    return "No valid response"
﻿
def load_in_context_example(filepath):
    if os.path.exists(filepath):
        with open(filepath, 'r') as file:
            return file.read()
    return ""
﻿
@weave.op()
def perform_self_reflection(summary, original_text, endpoint, key, original_input=None):
    reflection_prompt = (
        f"The following is a summary of the original document.  "
        f"Original Document:\n{original_text}\n\n"
        f"Summary:\n{summary}\n\n"
        f"If there are mistakes, simply remove incorrect sections and REWRITE it COMPLETELY as it was originally with the only change being the removal of incorrect sentences:"
        f"If everything is correct, simply rewrite the summary EXACTLY as it was:"
    )
﻿
    revised_summary, _ = get_model_prediction(reflection_prompt, endpoint, key)
    return revised_summary
﻿
def select_random_model(prompt):
    available_models = [model for model in MODELS if prompt not in used_prompts_cache.get(model['endpoint'], [])]
﻿
    if not available_models:
        # If all models have been used, reset the cache for the prompt
        for model in MODELS:
            if model['endpoint'] in used_prompts_cache:
                used_prompts_cache[model['endpoint']].remove(prompt)
        available_models = MODELS
﻿
    selected_model = random.choice(available_models)
﻿
    # Cache the selected model for the prompt
    if selected_model['endpoint'] not in used_prompts_cache:
        used_prompts_cache[selected_model['endpoint']] = []
    used_prompts_cache[selected_model['endpoint']].append(prompt)
﻿
    return selected_model
﻿
@app.route('/')
def index():
    return render_template('index.html')
﻿
@app.route('/chat', methods=['POST'])
def chat():
    data = request.json
    prompt = data['prompt']
﻿
    # Load the in-context example
    in_context_example = load_in_context_example(IN_CONTEXT_FILE)
﻿
    # Select a random model that hasn't been used with this prompt yet
    model = select_random_model(prompt)
﻿
    # Generate dynamic questions based on the input text
    prompt_for_questions = f"Generate key questions that a researcher would ask based on the following text:\n{prompt}"
    questions_response, _ = get_model_prediction(prompt_for_questions, model['endpoint'], model['key'])
﻿
    generated_questions = ""
    if questions_response:
        generated_questions = questions_response
    # Combine fixed and generated questions to create the final prompt
    final_prompt = (
        f"Heres a previous In-context example paper summary:\n{in_context_example}\n\n"
        f"Please summarize the following text and address these questions:\n{FIXED_QUESTIONS}\n{generated_questions} in {SUMMARY_LENGTH} words.\n\n"
        f"Text:\n{prompt}"
    )
﻿
    # Get the initial summary and call ID from the selected model
    summary, call_id = get_model_prediction(final_prompt, model['endpoint'], model['key'], original_input=prompt)
﻿
    # Perform self-reflection to remove factual errors
    revised_summary = perform_self_reflection(summary, prompt, model['endpoint'], model['key'], original_input=prompt)
    # revised_summary = summary
    return jsonify({
        "response": revised_summary, 
        "call_id": call_id, 
        "model_used": model['endpoint']
    })
﻿
@app.route('/feedback', methods=['POST'])
def feedback():
    data = request.json
    call_id = data['call_id']
    feedback_type = data['feedback']
﻿
    if feedback_type == "upvote":
        client.call(call_id).feedback.add_reaction("👍")
    elif feedback_type == "downvote":
        client.call(call_id).feedback.add_reaction("👎")
﻿
    return jsonify({"status": "success"})
﻿
if __name__ == "__main__":
    app.run(debug=True)
﻿
﻿
At its core, the app offers several key features that significantly improve the summarization process. One important aspect is model switching, where the app randomly selects from a list of predefined models for each request. This allows users to test different models' effectiveness in generating summaries, which can be particularly useful for comparative studies or finding the best-suited model for a particular type of content. 
This approach is similar to the LMSYS Chatbot Arena, where models are tested without users knowing which one is being used. By removing the bias of knowing the model's identity, this method provides a more valid and objective way to evaluate models. Users can focus solely on the quality of the output, ensuring that their assessments are based on performance rather than preconceived notions about a particular model. This makes the evaluation process more reliable and helps in identifying the most effective model for specific tasks.
Weave plays a crucial important throughout the app. Weave is used to track the responses generated by the models and record which models are used for each request. By logging this information, Weave allows us to monitor and analyze the performance of different models, track the quality of their outputs, and gather data that can be used to optimize the app over time. 
This tracking is invaluable for understanding how different models perform across various tasks and for making data-driven decisions about which models to prioritize or fine-tune further. An important detail in the script is that we pass the original input prompt to our functions with the Weave decorator, which will enable us to later compare responses with the same input. This is particularly useful when comparing different models for the same input! 
The app also includes feedback mechanisms powered by Weave that allow users to rate the quality of the summaries. By upvoting or downvoting a particular summary, users can provide real-time feedback that can be used to fine-tune the models or adjust the summarization parameters. This feedback loop is essential for continuously improving the app's performance and ensuring that it meets user expectations.
Here we get the UUID of the specific trace in Weave: 
current_call = weave.get_current_call()
call_id = current_call.id
And in the feedback function, we can utilize the call_id to set feedback on the response: 
@app.route('/feedback', methods=['POST'])
def feedback():
    data = request.json
    call_id = data['call_id']
    feedback_type = data['feedback']
﻿
    if feedback_type == "upvote":
        client.call(call_id).feedback.add_reaction("👍")
    elif feedback_type == "downvote":
        client.call(call_id).feedback.add_reaction("👎")
﻿
    return jsonify({"status": "success"})
Another key feature is self-reflection for factual accuracy. After the initial summary is generated, the app performs a self-reflection step where it revisits the original document and identifies any potential factual inaccuracies in the summary. The model then corrects these errors and rewrites the summary, ensuring that the final output is both precise and trustworthy. This feature is especially valuable in scenarios where accuracy is paramount, such as legal or academic summaries.
Here is the code for performing self-reflection: 
def perform_self_reflection(summary, original_text, endpoint, key):
    reflection_prompt = (
        f"The following is a summary of the original document. Identify and remove any factual errors "
        f"from the summary, and rewrite it to be accurate:\n\n"
        f"Original Document:\n{original_text}\n\n"
        f"Summary:\n{summary}\n\n"
        f"Revised Summary:"
    )
﻿
    revised_summary, _ = get_model_prediction(reflection_prompt, endpoint, key)
    return revised_summary
In addition, the app employs a combination of fixed and dynamic question generation. It begins with a set of fixed questions that guide the summarization process, ensuring consistency in the output. Then, it dynamically generates additional questions based on the input text, allowing the summary to be more responsive to the document's specific content. This dual approach ensures that the summary is comprehensive, covering both standard and context-specific aspects of the text.
In the app, you can generate summaries, shuffle between models, and give feedback to the different summaries, which will allow you to make more informed decisions about which models to choose: 
﻿
﻿
After running the app and navigating to the W&B Weave dashboard, you will see the following logs (note I also upvoted the response from the model from the app): 
﻿
By clicking one of the cells, we can see more detailed logs of the inputs and outputs to our function. Below I'll share another screenshot showing how we can easily view the responses from the model! 
﻿
﻿
﻿
Comparing multiple responses in WeaveNow, you may want to compare different outputs for the same input across different models. Inside Weave, you can use a handy feature for quickly filtering traces with the same same input value, by simply holding option and clicking the input in any of the Weave cells. Recall that we passed the original_prompt to both our self reflection function and also our model prediction function, and this will allow us to filter by a specific input from the user. After adding our filter, we will can see a comparsion of model responses! 
﻿
Here, we see I used the same input for both the Llama 3.1 70B model and the 405B model, and I was able to easily compare the two responses! 
Conclusion When choosing a specific model for your own applications, I highly reccomend trying out a few of the popular models and choosing based on your personal opinion of summary quality. I personally use GPT-4o for most of my summarization tasks, but Claude 3.5 Sonnet is also a great choice as well. Additionally, the Llama 3.1 70B and 405B models performed pretty well for summarization of academic AI research papers in this project. 
Overall, this web app is a powerful tool that combines flexibility, accuracy, and user engagement to deliver high-quality summaries. Whether you're working with research papers, legal documents, or any other complex content, this app provides a streamlined solution for generating reliable and concise summaries tailored to your needs, with robust tracking, analysis, and feedback through Weave, you can easily keep track of model responses so that you can make the most informed decisions about what model to use for any given task! 
In conclusion, large language models offer practical solutions for summarizing large volumes of text, making it easier to extract key information from complex documents. Techniques such as fine-tuning, in-context learning, prompt engineering, and chunking help these models produce summaries that are both concise and contextually accurate. The addition of self-reflection for factual accuracy improves the reliability of the output. I hope you enjoyed this tutorial!  Also, feel free to check out the Github repo here.
Training a KANFormer: KAN's Are All You Need? 
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers! 
Grokking: Improved generalization through over-overfitting
One of the most mysterious phenomena in deep learning; Grokking is the tendency of neural networks to improve generalization by sustained overfitting.
Building a real-time answer engine with Llama 3.1 405B and W&B Weave
Infusing llama 3.1 405B with internet search capabilities!! 
Fine-Tuning Llama-3 with LoRA: TorchTune vs HuggingFace
A battle between the HuggingFace and TorchTune!!! 
﻿
﻿
﻿
Add a comment
Tags: Articles, GenAI, LLM, NLP, Tutorial, Weave, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.