AI scorers: Evaluating AI-generated text with ROUGE

This article explores the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, a powerful tool used for evaluating the quality of AI-generated text
Brett Young
Created on December 27|Last edited on March 1
Comment
AI's capability to produce coherent and meaningful text has revolutionized industries ranging from journalism to customer service, but creating this text is only half the challenge—ensuring its quality is just as vital. This is where ROUGE, or Recall Oriented Understudy for Gisting Evaluation, comes into play. ROUGE has become one of the most widely used metrics for evaluating text generated by AI systems, particularly in summarization and translation tasks.
ROUGE offers a systematic way to measure how closely machine-generated text aligns with human-written references, helping developers refine their models to meet expectations for clarity, coherence, and relevance. Whether you're an AI researcher or a practitioner, understanding ROUGE can unlock powerful insights into your models' performance.
If you're eager to dive in and see ROUGE in action, check out our interactive Colab notebook, where you can experiment with evaluating AI outputs using ROUGE metrics.
﻿
Otherwise, continue reading as we explore the fundamentals, applications, and best practices for leveraging ROUGE effectively.
﻿
Table of contentsWhat is ROUGE?Understanding recall, precision, and F1-ScoreTutorial: Using ROUGE with Weave Scorers Evaluating multiple models with ROUGE and Weave Evaluations  Conclusion 
﻿
What is ROUGE?ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used to evaluate the similarity between machine-generated and human-written text. It analyzes overlaps in elements like unigrams (single words), bigrams (word pairs), and longest common subsequences (LCS) to quantify text quality.
In practice, ROUGE evaluates the quality of AI-generated text in tasks like summarization or translation by determining if it captures the key points and details of the reference text. If the generated text misses critical information or includes irrelevant content, the ROUGE score reflects these gaps, making it a valuable tool for identifying areas of improvement in model performance.
The main variants of ROUGE include:
ROUGE-1: Measures the overlap of unigrams (individual words) between the generated text and the reference text, emphasizing content selection.
ROUGE-2: Evaluates the overlap of bigrams (pairs of consecutive words) between the generated text and the reference text, capturing fluency and coherence.
ROUGE-L: Focuses on the longest common subsequence (LCS) of words, assessing structural alignment and sentence-level coherence.
These metrics provide a nuanced analysis of text quality, balancing considerations like relevance, readability, and fidelity. By breaking down text into different granularities, ROUGE enables developers to understand how well a model is performing and guides iterative improvements.
For example, in summarization tasks, ROUGE-1 ensures key content is included, ROUGE-2 checks for smooth phrasing, and ROUGE-L confirms logical sentence structure. Together, these metrics offer a comprehensive view of text alignment and quality, making ROUGE indispensable for text evaluation in natural language processing tasks.
Understanding recall, precision, and F1-ScoreBefore delving into the specific ROUGE metrics, it's helpful to understand the concepts of recall, precision, and their harmonic mean, the F1-score.
💡
These foundational metrics explain how ROUGE evaluates the quality of machine-generated text compared to human-written references.
Recall measures how much of the reference text’s information is captured by the generated text. It’s calculated as the number of overlapping n-grams divided by the total n-grams in the reference. High recall indicates comprehensive coverage of essential content.
﻿
Precision evaluates how much of the generated text aligns with the reference text. It’s calculated as the number of overlapping n-grams divided by the total n-grams in the generated text. High precision ensures the output avoids irrelevant or unnecessary information.
﻿
﻿F1-Score combines recall and precision into a balanced metric. Calculated as the harmonic mean of the two, the F1-score provides a comprehensive evaluation of performance, ensuring models aren’t overly penalized for excelling in one aspect while lagging in another.
﻿
The F1-score ensures that models are not overly penalized for excelling in one aspect, such as high recall, at the expense of the other, such as low precision. By combining both metrics, the F1-score provides a comprehensive assessment of performance.
ROUGE uses the F1-score as its default metric because it balances the trade-off between recall and precision. This is especially useful in summarization tasks, where capturing all key points (recall) and avoiding verbosity or irrelevant details (precision) are equally important. For instance:
High recall but low precision might result in an overly verbose summary.
High precision but low recall might produce a concise summary that misses critical details.
By defaulting to the F1-score, ROUGE provides a single, holistic measure of relevance and completeness, making it a reliable choice for text evaluation.
﻿
﻿
﻿
Similar to other implementations of Rouge, the Weave library uses the F1-Score version of each of these variants. 
Tutorial: Using ROUGE with Weave Scorers Evaluating the quality of AI-generated text is essential for developing robust models, especially for tasks like summarization and translation. While ROUGE provides a reliable metric for assessing text alignment, calculating and comparing scores across multiple predictions and references can become tedious.
This is where Weave, an advanced logging and evaluation platform, simplifies the process. By combining ROUGE metrics with Weave’s seamless integration, you can efficiently calculate scores, compare multiple models, and visualize results - all within an intuitive workflow.
In this tutorial, we’ll guide you through:
Setting up a basic script to compute ROUGE scores.
Logging results automatically with Weave.
Comparing the performance of two models - GPT-4o and GPT-4o Mini - using the LongBench GovReport dataset.
Let’s start by installing the required libraries:
pip install litellm
pip install -qq rouge
git clone -b xtra-scorers https://github.com/wandb/weave.git && cd weave && pip install -qq -e .
This script uses the Weave library, which provides a streamlined way to compute ROUGE metrics and integrate them into workflows. The example showcases how to initialize the RougeScorer, prepare data, and calculate ROUGE scores for each prediction-ground truth pair.
import weave
from weave.scorers import RougeScorer
﻿
# Initialize Weave
weave.init("rouge-scorer-demo")
﻿
# Define the RougeScorer
scorer = RougeScorer(column_map={"output": "prediction", "ground_truth": "ground_truth"})
﻿
# Prepare your data
data = [
    {"prediction": "The cat sat on the mat.", "ground_truth": "A cat was sitting on the mat."},
    {"prediction": "The quick brown fox jumps over the lazy dog.", "ground_truth": "A fast brown fox jumped over a sleeping dog."},
    {"prediction": "The sun rises in the east.", "ground_truth": "Sunlight emerges from the eastern horizon."},
]
﻿
# Iterate through the data and compute ROUGE scores
for item in data:
    scores = scorer.score(item["prediction"], item["ground_truth"])
    print(f"Prediction: {item['prediction']}")
    print(f"Ground Truth: {item['ground_truth']}")
    print(f"ROUGE Scores: {scores}")
    print("-" * 50)
﻿
In the script, we first import the required modules and initialize Weave. The RougeScorer is then defined, mapping the expected output and ground_truth to the respective fields in the dataset. The data consists of pairs of predictions and their corresponding reference texts. Each prediction is evaluated against its ground truth using the scorer.score method.
Normally, using Weave requires adding the @weave.op decorator to log operations, but since the RougeScorer is natively integrated, logging happens automatically after calling weave.init("rouge-scorer-demo"). This integration captures all inputs, such as the prediction and ground truth, as well as outputs like the calculated ROUGE metrics, without requiring any additional setup.
Evaluating multiple models with ROUGE and Weave Evaluations  We’ll now evaluate and compare the performance of GPT-4o and GPT-4o Mini using Weave’s RougeScorer. This demonstrates how to systematically assess the quality of text summaries generated by two models against a reference dataset of government reports.
We will use the LongBench GovReport dataset, a subset designed for summarization tasks. Each data entry contains a query (the context to be summarized) and a ground truth (the reference summary). The goal is to evaluate how well GPT-4o and GPT-4o Mini generate summaries aligned with these references.
To integrate seamlessly with Weave, both models are defined as Weave-compatible models with a predict method. This method generates summaries for given queries and utilizes litellm for API interactions. The results are logged and evaluated using Weave, making the process efficient and transparent.
The RougeScorer, configured with column_map to match the dataset's structure, calculates ROUGE metrics for each generated summary. We select the first 30 rows of the dataset for this demonstration. Here is the code for the evaluation: 
import weave
import time 
from litellm import acompletion
import asyncio
import nest_asyncio
from weave.scorers import RougeScorer
from weave import Evaluation
﻿
﻿
weave_client = weave.init("rouge-scorer")
﻿
dataset = weave.ref(
    "weave:///c-metrics/rouge-scorer/object/longbench_gov_report_subset:qGNjItwJSEw1NF6xMXX2a0syHJfXVMjeYqwqVwWsdbs"
).get()
﻿
﻿
﻿
class GPT4oMini(weave.Model):
    model_name: str = "gpt-4o-mini"
    temp: float = 0.0
    max_tokens: int = 2048
    top_p: float = 1.0
﻿
    @weave.op()
    async def predict(self, query: str) -> str:
        response = await acompletion(
            model=self.model_name,
            api_key="your api key",
            messages=[
                {
                    "role": "system",
                    "content": "You are provided with government reports. Summarize the report in a few sentences but make sure to include all the important information."
                },
                {
                    "role": "user",
                    "content": query
                }
            ],
            temperature=self.temp,
            max_tokens=self.max_tokens,
            top_p=self.top_p
        )
        return response.choices[0].message.content
﻿
class GPT4o(weave.Model):
    model_name: str = "gpt-4o-2024-08-06"
    temp: float = 0.0
    max_tokens: int = 2048
    top_p: float = 1.0
﻿
    @weave.op()
    async def predict(self, query: str) -> str:
        time.sleep(2)
        response = await acompletion(
            model=self.model_name,
            api_key="your api key",
            messages=[
                {
                    "role": "system",
                    "content": "You are provided with government reports. Summarize the report in a few sentences but make sure to include all the important information."
                },
                {
                    "role": "user",
                    "content": query
                }
            ],
            temperature=self.temp,
            max_tokens=self.max_tokens,
            top_p=self.top_p
        )
        return response.choices[0].message.content
﻿
gpt4o = GPT4o()
gpt4omini = GPT4oMini()
﻿
﻿
nest_asyncio.apply()
﻿
﻿
scorer = RougeScorer(column_map={"output": "query", "ground_truth": "ground_truth"})
﻿
evaluation = Evaluation(
    dataset = dataset.rows[:30],
    scorers=[scorer],
)
﻿
asyncio.run(evaluation.evaluate(gpt4o))
asyncio.run(evaluation.evaluate(gpt4omini))
Each model generates summaries for input queries, which are then compared against ground truth summaries using ROUGE metrics provided by Weave's RougeScorer.
﻿The scorer is configured to align dataset fields with the expected inputs (query as output and ground_truth as reference). The evaluation results reveal that GPT-4o Mini slightly outperforms GPT-4o across all ROUGE metrics, with higher scores in ROUGE-1, ROUGE-2, and ROUGE-L, indicating better content selection, fluency, and sentence-level structural alignment. Additionally, GPT-4o Mini exhibits significantly lower latency, making it highly efficient for real-time applications while maintaining quality.
Overall, Weave is very convenient for quickly standing up an evaluation, and making it easy to analyze the results. Weave Evaluations provides a really nice dashboard which allows us to not only visualize the performance over the entire evaluation, but also analyze the specific results for each individual sample, as shown below: 
﻿
﻿
Conclusion In conclusion, ROUGE has proven to be a crucial tool for assessing the quality of machine-generated text, especially in tasks like summarization and translation. By providing clear and quantitative metrics such as recall, precision, and the F1-score, ROUGE helps developers improve AI models, ensuring the generated content meets human expectations for relevance, clarity, and coherence.
As AI continues to evolve, having reliable, automated methods to assess its outputs will be essential to meet growing demands for clarity, relevance, and coherence in generated content. Whether you're refining a summarization model or comparing performance across systems, tools like ROUGE and Weave will empower you to achieve high-quality results effortlessly. Start exploring today and stay ahead in the ever-evolving world of AI.
Evaluating LLMs on Amazon Bedrock
Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.  
Comparing GPT Models on Azure AI Foundry with W&B Weave
Learn how to compare and evaluate OpenAI’s GPT models on Azure with W&B Weave on text summarization tasks, leveraging Azure’s managed infrastructure and Weave’s customizable evaluation tools.
Building and evaluating a RAG system with DSPy and W&B Weave 
A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.
Supercharging LLM summarization 
A guide to making the most of LLMs for summarization tasks
﻿
﻿
Add a comment
Tags: Articles, Weave, GenAI, Evaluations
Iterate on AI agents and models faster. Try Weights & Biases today.