Skip to main content

ROUGE Scorer

Created on November 29|Last edited on December 2

Definition

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics widely used to evaluate the quality of text generation tasks, particularly in text summarization. ROUGE measures the overlap between a candidate text (e.g., machine-generated summary) and reference text(s) (e.g., human-written summaries) in terms of n-grams, word sequences, or longest common subsequences.

Implementation Details:


You can use out RougeScorer like this:
from weave.scorers import RougeScorer

eval = weave.Evaluation(
dataset=dataset,
scorers=[RougeScorer()]
)

# .. run your eval
  • We are leveraging the rouge library which is the most used python implementation of rouge and boasts "results on par with official release".
  • Our scorer is returning these key metrics:
    • ROUGE-1: Measures the overlap of unigrams (single words) between the candidate and reference. It evaluates content selection.
    • ROUGE-2: Measures the overlap of bigrams (two consecutive words) between the candidate and reference. It evaluates fluency and coherence.
    • ROUGE-L: Focuses on the longest common subsequence (LCS) between the candidate and reference. It captures sentence-level structure and alignment.
Our implementation gives your sample wise rouge values (F1 for ROUGE-1, ROUGE-2 and ROUGE-L). We also give you a corpus level value in the summary of the evaluation.
Figure 1: Rouge metrics logged to our Weave eval dashboard using our RougeScorer. Check it out here.