ROUGE Scorer
Created on November 29|Last edited on December 2
Comment
Definition
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics widely used to evaluate the quality of text generation tasks, particularly in text summarization. ROUGE measures the overlap between a candidate text (e.g., machine-generated summary) and reference text(s) (e.g., human-written summaries) in terms of n-grams, word sequences, or longest common subsequences.
Implementation Details:
You can use out RougeScorer like this:
from weave.scorers import RougeScorereval = weave.Evaluation(dataset=dataset,scorers=[RougeScorer()])# .. run your eval
- We are leveraging the rouge library which is the most used python implementation of rouge and boasts "results on par with official release".
- Our scorer is returning these key metrics:
- ROUGE-1: Measures the overlap of unigrams (single words) between the candidate and reference. It evaluates content selection.
- ROUGE-2: Measures the overlap of bigrams (two consecutive words) between the candidate and reference. It evaluates fluency and coherence.
- ROUGE-L: Focuses on the longest common subsequence (LCS) between the candidate and reference. It captures sentence-level structure and alignment.
Our implementation gives your sample wise rouge values (F1 for ROUGE-1, ROUGE-2 and ROUGE-L). We also give you a corpus level value in the summary of the evaluation.

Figure 1: Rouge metrics logged to our Weave eval dashboard using our RougeScorer. Check it out here.
Add a comment