BLEU Scorer
Created on November 28|Last edited on December 3
Comment
BLEUScorer is a custom scoring class built using the sacreBLEU library. It evaluates the quality of generated text by computing the BLEU (Bilingual Evaluation Understudy) score, a widely used metric for text generation tasks like machine translation.
This implementation supports both sentence-level BLEU (for individual predicted text evaluation) and corpus-level BLEU (for overall system evaluation). The scorer provides flexibility through configurable parameters such as tokenization, smoothing, and case-sensitivity.
Definition
BLEU evaluates the overlap of n-grams (subsequences of words) between the hypothesis and the references, accounting for:
- Precision: The fraction of n-grams in the hypothesis that appear in the references. By default we use 1-4 grams.
- Brevity Penalty: A penalty to prevent short hypotheses from scoring disproportionately high.
If you are not familiar with BLEU check out this excellent blog post going into the limitations and formulation of this metric.
The BLEU score ranges from 0-100, where higher is better.
Note that, trying to compare BLEU scores across different corpora and languages is strongly discouraged. Even comparing BLEU scores for the same corpus but with different numbers of reference translations can be highly misleading.
💡
Sentence-Level BLEU
- Evaluates the BLEU score for a single prediction against its references.
- Useful for debugging individual predictions or analyzing specific outputs.
- The average of sentence-level BLEU is not reliable because averaging these individual scores corresponds to macro-averaging, as it treats each sentence equally regardless of length.
Corpus-Level BLEU
- Aggregates statistics across all hypotheses and references before calculating BLEU. This method reflects micro-averaging, as it accounts for the contribution of each sentence proportionally to its length.
- Provides a holistic view of model performance across the entire dataset.
- The original BLEU metric, as introduced by Papineni et al. (2002), employs corpus-level (micro-average) precision.
Usage
Open the colab notebook to see the scorer in action. But here are a few details:
Usage:
from weave import Evaluationsfrom weave.scorers import BLEUScorerbleu_scorer = BLEUScorer()evaluation = Evaluation(dataset=dataset,scorers=[BLEUScorer(),])# ...run evaluation
Technical Details:
- We return sentence-level scores so that you can look at the prediction quality at a fine-grained level. As a summary we also return the average of the sentence-level BLEU scores.

Figure 1: Sample wise view of the evaluation on the TruthfulQA (generation) benchmark where our scorer gives sentence-level BLEU scores. >>>Click here for interactivity<<<
- We return corpus-level score as well. This is the reliable (original implementation) BLEU score as reported in most research papers. Because this is a score given to the entire dataset, you will find the score in the summary of the evaluation.

- We provide you with knobs so that you can configure the scorer. The defaults are provided by the scareBLEU library and in most cases need not be changed. If you know what you are doing, you have the options to configure the scorer.
Parameter | Type | Default | Description |
---|---|---|---|
lowercase | bool | False | If True , performs case-insensitive matching. |
tokenize | Optional[str] | None | Specifies the tokenizer to use. Defaults to SacreBLEU’s built-in tokenizer. |
smooth_method | str | "exp" | Smoothing method to handle n-grams with zero matches. Options: 'none' , 'floor' , 'add-k' . |
smooth_value | Optional[float] | None | Value used for smoothing methods like 'floor' or 'add-k' . |
max_ngram_order | int | 4 | The maximum n-gram order for BLEU calculation. |
effective_order | bool | True | If True , adjusts for missing higher-order n-grams in short sentences. |
Limitations
By now, as a community we know enough about the limitations of this metric:
- String Matching Focus: BLEU evaluates string similarity, not true translation quality or semantic equivalence.
- Single Reference Limitation: Many tests use only one human reference translation, penalizing valid translations that differ in phrasing or word choice.
- Ignores Synonyms and Paraphrases: BLEU gives no credit for synonyms or paraphrases (e.g., “sofa” vs. “couch”), leading to lower scores for accurate translations.
- Syntactically Incorrect Matches: Nonsensical outputs with correct n-grams in the wrong order can score high.
- Meaningless for Cross-Domain/Language Comparisons: Scores are highly specific to the test set and language pair, making cross-domain or cross-language comparisons invalid.
- Insensitive to Meaning: BLEU does not assess if the translation preserves meaning; it only measures token overlap.
- Unreliable Small Differences: Small score differences are hard to interpret and often not meaningful.
- Overfitting Risk: Systems trained on the test set can artificially inflate BLEU scores without real-world improvements.
- Limited by Reference Set: Using multiple human references improves scores but still cannot fully capture translation diversity.
Closing remarks
You should consider using BLEU because it is a cheap metric to compute as long as you have the required dataset. It is also reproducible as long as the reference remains the same and you are using the same configuration. And if you have a use case where it needs to be benchmarked against an old paper or reported number, the chances of it to be BLEU is high given its historical precedence.
Nevertheless, an improving system will be reflected by this metric.
Add a comment