ROUGE Scorer

Created on November 29|Last edited on December 2
Comment
﻿
DefinitionROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics widely used to evaluate the quality of text generation tasks, particularly in text summarization. ROUGE measures the overlap between a candidate text (e.g., machine-generated summary) and reference text(s) (e.g., human-written summaries) in terms of n-grams, word sequences, or longest common subsequences.
Implementation Details:
  
﻿
You can use out RougeScorer like this:
from weave.scorers import RougeScorer
﻿
eval = weave.Evaluation(
    dataset=dataset,
    scorers=[RougeScorer()]
)
﻿
# .. run your eval
We are leveraging the rouge library which is the most used python implementation of rouge and boasts "results on par with official release". 
Our scorer is returning these key metrics:
ROUGE-1: Measures the overlap of unigrams (single words) between the candidate and reference. It evaluates content selection.
ROUGE-2: Measures the overlap of bigrams (two consecutive words) between the candidate and reference. It evaluates fluency and coherence.
ROUGE-L: Focuses on the longest common subsequence (LCS) between the candidate and reference. It captures sentence-level structure and alignment.
Our implementation gives your sample wise rouge values (F1 for ROUGE-1, ROUGE-2 and ROUGE-L). We also give you a corpus level value in the summary of the evaluation.
Figure 1: Rouge metrics logged to our Weave eval dashboard using our RougeScorer. Check it out here.
﻿
Add a comment