Skip to main content

AI scorers: Evaluating AI-generated text with BLEU

This article breaks down BLEU, a key metric for evaluating machine-generated text, covering its mechanics, practical applications with Python and Weave, and its role in improving text generation systems.
Created on December 30|Last edited on March 1
BLEU, or Bilingual Evaluation Understudy, is a cornerstone metric for evaluating machine-generated text, particularly in tasks like machine translation and summarization. By comparing generated text to human-written references, BLEU provides a quantitative measure of similarity through n-gram precision and brevity.
While calculating BLEU effectively across datasets can be complex, tools like Weave simplify the process by automating evaluation workflows and providing actionable insights. In this article, we’ll break down the mechanics of BLEU, explore its practical applications, and demonstrate its use with Python and Weave.
Ready to jump into the code? Check out our interactive Colab notebook to experiment with BLEU scoring in a guided environment:

Or, you can continue reading to dive deeper into how BLEU works and how it fits into the landscape of text evaluation metrics.


Table of contents



What is a BLEU score?

A BLEU score quantifies the quality of AI-generated text by comparing it to one or more reference texts. It measures the degree of n-gram overlap, where n-grams are sequences of n words or tokens. By calculating precision across 1- to 4-grams, BLEU provides a detailed similarity analysis between the generated text and the reference(s).
For example, in the sentence "The cat sat on the mat," 1-grams are single words like "The" and "cat," while 2-grams are word pairs like "The cat" and "cat sat." BLEU scores how many n-grams in the generated text match those in the reference(s), capturing alignment at different granularities.
In addition to precision, BLEU applies a brevity penalty to discourage overly short outputs. This ensures the generated text is sufficiently complete while remaining aligned with the reference. For instance, in machine translation tasks, a BLEU score reflects whether the generated text captures the essence and key details of the reference translation. If it omits essential elements or is overly verbose, the BLEU score indicates these shortcomings.

How is a BLEU score calculated?

A BLEU score measures the quality of AI-generated text by comparing it to one or more reference texts. It calculates how well the generated text aligns with the reference(s) by using two key components: n-gram precision, which measures word or phrase overlap, and a brevity penalty to discourage overly short outputs.
Here's how the BLEU score is calculated:

N-gram precision

BLEU starts by calculating how many n-grams in the hypothesis appear in the reference(s). If the hypothesis is "The cat sat on the mat" and the reference is "A cat was sitting on the mat", the matching 1-grams are "cat", "on", "the", and "mat". Precision is calculated as the number of matching n-grams divided by the total number of n-grams in the hypothesis.
This is repeated for 2-grams, 3-grams, and so on, up to a maximum order of 4-grams by default.

Clipping, brevity penalty, and geometric mean

BLEU applies adjustments to ensure fairness. Clipping caps the contribution of any n-gram to the precision score at the maximum count observed in the reference. If the hypothesis is "the the the the" and the reference is "the cat sat on the mat", the word "the" matches only once, resulting in a clipped precision of 1/4.
The brevity penalty adjusts the score to discourage overly short hypotheses. If the hypothesis is shorter than the reference, a penalty reduces the score proportionally to the length difference. If the hypothesis length (c) is greater than or equal to the reference length (r), no penalty is applied.
The brevity penalty is calculated as:


Final BLEU Score

The final BLEU score is computed by combining the brevity penalty and the geometric mean of the clipped n-gram precision scores:


The formula ensures that precision across all n-grams and hypothesis completeness contribute to the score, which ranges from 0 to 100. Higher scores indicate better quality, but direct comparisons across datasets or languages are discouraged due to contextual variability.

Corpus-level BLEU scoring vs. sentence-level

The above formulas showcase how to calculate sentence-level BLEU scores, which evaluates individual predictions against their reference texts. While useful for debugging and analyzing specific outputs, averaging sentence-level BLEU scores is less reliable as it ignores sentence length and corpus-level context.
Corpus-level BLEU, on the other hand, aggregates statistics across all hypotheses and references before computing the BLEU score. This approach provides a holistic evaluation, reflecting the overall performance of the model across the dataset.
When calculating corpus-level BLEU scores, instead of evaluating hypotheses individually, n-gram matches and counts are aggregated across all hypotheses and references in the dataset. Similarly, the brevity penalty is applied globally, based on the total length of all hypotheses. This global approach smooths over sentence-level variability, providing a comprehensive evaluation of model performance across the entire dataset.

Tutorial: Using BLEU with Weave Scorers

To see the BLEU score in action, we’ll set up a script using Weave’s BLEUScorer. This tool integrates with the sacreBLEU library to calculate both sentence-level and corpus-level BLEU scores. Below, we’ll walk through the steps to compute these scores and analyze the results.

Step 1: Installation

Start by installing the necessary libraries. Run the following commands:
git clone https://github.com/wandb/weave.git && cd weave && git fetch origin pull/3006/head:xtra-scorers && git checkout xtra-scorers && pip install -qq -e .
Also, to run the evaluation later on in this tutorial, export your OpenAI key:
export OPENAI_API_KEY='your api key'
If you are interested in experimenting with BLEU in a Google Colab, feel free to check out a notebook here.

Step 2: Setting up the BLEU scorer

Next we'll calculate sentence-level and corpus-level BLEU scores.
First, import the necessary libraries and initialize the BLEUScorer:
import weave; weave.init('bleu-scorer-demo')
from weave.scorers import BLEUScorer

# Initialize Weave BLEU scorer
weave_scorer = BLEUScorer()

# Define sample hypotheses and references
examples = [
{"hypothesis": "The cat sat on the mat.", "references": ["A cat was sitting on the mat."]},
{"hypothesis": "The quick brown fox jumps over the lazy dog.", "references": ["A fast brown fox jumped over a sleeping dog."]},
{"hypothesis": "The sun rises in the east.", "references": ["Sunlight emerges from the eastern horizon."]},
]

# Prepare input for Weave's summarize function
weave_inputs = []

print("Sentence-Level BLEU:")
for example in examples:
# Compute sentence-level BLEU for each example
sentence_bleu_result = weave_scorer.score(
ground_truths=example["references"],
output=example["hypothesis"]
)
weave_inputs.append({
"output_pred": example["hypothesis"],
"output_refs": example["references"],
"sentence_bleu": sentence_bleu_result["sentence_bleu"], # Use the sentence-level score from Weave
})

# Print sentence-level BLEU results
print(f"Hypothesis: {sentence_bleu_result['output_pred']}")
print(f"References: {sentence_bleu_result['output_refs']}")
print(f"BLEU Score: {sentence_bleu_result['sentence_bleu']}")
print(f"Brevity Penalty: {sentence_bleu_result['sentence_bp']}")
print("-" * 50)

# Calculate corpus-level BLEU using Weave
weave_corpus_score = weave_scorer.summarize(weave_inputs)

# Print corpus-level BLEU results
print("\nWeave Corpus-Level BLEU:")
print(f"BLEU Score: {weave_corpus_score['corpus_level']['bleu']}")
print(f"Brevity Penalty: {weave_corpus_score['corpus_level']['brevity_penalty']}")



Interpreting results

The script computes both sentence-level and corpus-level BLEU scores, providing insights into text quality at multiple granularities. Here are the main distinctions:
  • Sentence-level BLEU: Useful for analyzing specific outputs and debugging.
  • Corpus-level BLEU: Aggregates scores across all examples, offering a holistic view of model performance.
Here are some of the parameters that you can use to customize your BLEU Scorer:
ParameterTypeDefaultDescription
lowercaseboolFalseIf True, performs case-insensitive matching.
tokenizeOptional[str]NoneSpecifies the tokenizer to use. Defaults to SacreBLEU’s built-in tokenizer.
smooth_methodstr"exp"Smoothing method to handle n-grams with zero matches. Options: 'none', 'floor', 'add-k'.
smooth_valueOptional[float]NoneValue used for smoothing methods like 'floor' or 'add-k'.
max_ngram_orderint4The maximum n-gram order for BLEU calculation.
effective_orderboolTrueIf True, adjusts for missing higher-order n-grams in short sentences.


Evaluating multiple models with BLEU and Weave Evaluations

Weave Evaluations also enables structured comparisons of multiple models, such as GPT-4o and GPT-4o Mini. By evaluating predictions against a shared reference dataset, you can identify performance differences between models.
Here’s a condensed overview of how to set up such an evaluation using Weave:
import weave
import time
from litellm import acompletion
import asyncio
import nest_asyncio
from weave.scorers import BLEUScorer
from weave import Evaluation

# Initialize Weave with BLEU Scorer
weave_client = weave.init("bleu-scorer-demo")

dataset = weave.ref(
"weave:///c-metrics/rouge-scorer/object/longbench_gov_report_subset:qGNjItwJSEw1NF6xMXX2a0syHJfXVMjeYqwqVwWsdbs"
).get()



class GPT4oMini(weave.Model):
model_name: str = "gpt-4o-mini"
temp: float = 0.0
max_tokens: int = 2048
top_p: float = 1.0

@weave.op()
async def predict(self, query: str) -> str:
response = await acompletion(
model=self.model_name,
messages=[
{
"role": "system",
"content": "You are provided with government reports. Summarize the report in a few sentences but make sure to include all the important information."
},
{
"role": "user",
"content": query
}
],
temperature=self.temp,
max_tokens=self.max_tokens,
top_p=self.top_p
)
return response.choices[0].message.content

class GPT4o(weave.Model):
model_name: str = "gpt-4o-2024-08-06"
temp: float = 0.0
max_tokens: int = 2048
top_p: float = 1.0

@weave.op()
async def predict(self, query: str) -> str:
time.sleep(2)
response = await acompletion(
model=self.model_name,
messages=[
{
"role": "system",
"content": "You are provided with government reports. Summarize the report in a few sentences but make sure to include all the important information."
},
{
"role": "user",
"content": query
}
],
temperature=self.temp,
max_tokens=self.max_tokens,
top_p=self.top_p
)
return response.choices[0].message.content

gpt4o = GPT4o()
gpt4omini = GPT4oMini()

nest_asyncio.apply()

# Initialize BLEU Scorer with column map
scorer = BLEUScorer(column_map={"output": "query", "ground_truths": "ground_truth"})

# Create the evaluation
evaluation = Evaluation(
dataset=dataset.rows[:30],
scorers=[scorer],
)

# Evaluate the models
asyncio.run(evaluation.evaluate(gpt4o))
asyncio.run(evaluation.evaluate(gpt4omini))

The models are tasked with summarizing reports into concise and informative sentences. Each model is defined with parameters such as temperature, maximum tokens, and top-p, and uses Weave’s @weave.op decorator to track model predictions. The BLEU Scorer is initialized with a column mapping to align the model outputs with their corresponding ground truths.
The evaluation uses a subset of the dataset, selecting the first 30 rows for comparison. Both models are evaluated independently, and their results are aggregated using Weave’s evaluation framework.
Here are the results from the evaluation:

The evaluation results, including sentence-level and corpus-level BLEU scores, are displayed on a Weave dashboard. These insights enable detailed comparisons of model performance and output quality.
At the corpus level, GPT4o-Mini achieves a BLEU score of 30.709 with a brevity penalty of 0.940, indicating strong alignment with reference texts and relatively balanced output lengths. In contrast, GPT4o has a lower corpus-level BLEU score of 19.704 and a brevity penalty of 0.590, suggesting shorter hypotheses and less alignment.
Additionally, we can analyze the specific responses for each model on each example, using the comparisons view inside Weave:

This feature provides a detailed side-by-side comparison of the outputs generated by each model for every individual example in the dataset, along with their corresponding reference text. It allows us to see not just the BLEU scores for each prediction but also the qualitative differences in how each model approaches the task, such as variations in phrasing and inclusion of key details.
By drilling down into these comparisons, we can identify patterns in strengths and weaknesses for each model, gaining deeper insights into why one model might perform better overall or fail in specific scenarios. This level of granularity is helpful for debugging and improving model performance.

The limitations of BLEU

While the BLEU score is widely used, it has notable limitations that are well-recognized in the research community:
  • String Matching Over Semantic Understanding: BLEU relies on n-gram overlap to evaluate similarity, ignoring semantic equivalence. This means valid translations with different phrasing, such as "sofa" for "couch," are unfairly penalized. BLEU also rewards outputs with correct n-grams even if they are in nonsensical or incoherent order, leading to inflated scores for poor-quality text
  • Limited Flexibility with References: BLEU’s performance is highly sensitive to the reference set. Most evaluations rely on a single human-written reference, which penalizes diverse but valid outputs. Even with multiple references, BLEU struggles to capture the full range of acceptable variations in phrasing or meaning.
  • Context-Specific and Dataset Bound: The BLEU score is heavily influenced by the specific test set and language pair, making comparisons across datasets or domains unreliable. It focuses on token overlap rather than meaning preservation, which limits its effectiveness for assessing true translation quality.
  • Interpretation Challenges: Small changes in BLEU scores can be difficult to interpret, as they may not reflect meaningful improvements. Additionally, models overfitted to a specific test set can achieve artificially high BLEU scores without delivering real-world performance gains.
These limitations emphasize the need to use BLEU in combination with other evaluation metrics to gain a comprehensive understanding of model performance. Metrics that incorporate semantic equivalence or human evaluations can complement BLEU, addressing its shortcomings while ensuring a more robust assessment of text quality.

Conclusion

Despite its limitations, the BLEU score remains a valuable tool for evaluating machine-generated text when applied appropriately. While its reliance on string matching and lack of semantic understanding are well-known, BLEU’s simplicity and ease of reproducibility make it a staple for benchmarking text generation systems.
Tools like Weave further enhance its practicality by simplifying evaluation workflows and delivering actionable insights. As AI continues to evolve, BLEU remains a reliable metric for measuring advancements in text quality and performance.

Iterate on AI agents and models faster. Try Weights & Biases today.