AI scorers: Evaluating AI-generated text with BLEU
This article breaks down BLEU, a key metric for evaluating machine-generated text, covering its mechanics, practical applications with Python and Weave, and its role in improving text generation systems.
Created on December 30|Last edited on March 1
Comment
BLEU, or Bilingual Evaluation Understudy, is a cornerstone metric for evaluating machine-generated text, particularly in tasks like machine translation and summarization. By comparing generated text to human-written references, BLEU provides a quantitative measure of similarity through n-gram precision and brevity.
While calculating BLEU effectively across datasets can be complex, tools like Weave simplify the process by automating evaluation workflows and providing actionable insights. In this article, we’ll break down the mechanics of BLEU, explore its practical applications, and demonstrate its use with Python and Weave.
Ready to jump into the code? Check out our interactive Colab notebook to experiment with BLEU scoring in a guided environment:
Or, you can continue reading to dive deeper into how BLEU works and how it fits into the landscape of text evaluation metrics.

Table of contents
What is a BLEU score? How is a BLEU score calculated?N-gram precisionClipping, brevity penalty, and geometric meanFinal BLEU ScoreCorpus-level BLEU scoring vs. sentence-levelTutorial: Using BLEU with Weave Scorers Step 1: InstallationStep 2: Setting up the BLEU scorerEvaluating multiple models with BLEU and Weave Evaluations The limitations of BLEU Conclusion
What is a BLEU score?
A BLEU score quantifies the quality of AI-generated text by comparing it to one or more reference texts. It measures the degree of n-gram overlap, where n-grams are sequences of n words or tokens. By calculating precision across 1- to 4-grams, BLEU provides a detailed similarity analysis between the generated text and the reference(s).
For example, in the sentence "The cat sat on the mat," 1-grams are single words like "The" and "cat," while 2-grams are word pairs like "The cat" and "cat sat." BLEU scores how many n-grams in the generated text match those in the reference(s), capturing alignment at different granularities.
In addition to precision, BLEU applies a brevity penalty to discourage overly short outputs. This ensures the generated text is sufficiently complete while remaining aligned with the reference. For instance, in machine translation tasks, a BLEU score reflects whether the generated text captures the essence and key details of the reference translation. If it omits essential elements or is overly verbose, the BLEU score indicates these shortcomings.
How is a BLEU score calculated?
A BLEU score measures the quality of AI-generated text by comparing it to one or more reference texts. It calculates how well the generated text aligns with the reference(s) by using two key components: n-gram precision, which measures word or phrase overlap, and a brevity penalty to discourage overly short outputs.
Here's how the BLEU score is calculated:
N-gram precision
BLEU starts by calculating how many n-grams in the hypothesis appear in the reference(s). If the hypothesis is "The cat sat on the mat" and the reference is "A cat was sitting on the mat", the matching 1-grams are "cat", "on", "the", and "mat". Precision is calculated as the number of matching n-grams divided by the total number of n-grams in the hypothesis.
This is repeated for 2-grams, 3-grams, and so on, up to a maximum order of 4-grams by default.
Clipping, brevity penalty, and geometric mean
BLEU applies adjustments to ensure fairness. Clipping caps the contribution of any n-gram to the precision score at the maximum count observed in the reference. If the hypothesis is "the the the the" and the reference is "the cat sat on the mat", the word "the" matches only once, resulting in a clipped precision of 1/4.
The brevity penalty adjusts the score to discourage overly short hypotheses. If the hypothesis is shorter than the reference, a penalty reduces the score proportionally to the length difference. If the hypothesis length (c) is greater than or equal to the reference length (r), no penalty is applied.
The brevity penalty is calculated as:

Final BLEU Score
The final BLEU score is computed by combining the brevity penalty and the geometric mean of the clipped n-gram precision scores:


The formula ensures that precision across all n-grams and hypothesis completeness contribute to the score, which ranges from 0 to 100. Higher scores indicate better quality, but direct comparisons across datasets or languages are discouraged due to contextual variability.
Corpus-level BLEU scoring vs. sentence-level
The above formulas showcase how to calculate sentence-level BLEU scores, which evaluates individual predictions against their reference texts. While useful for debugging and analyzing specific outputs, averaging sentence-level BLEU scores is less reliable as it ignores sentence length and corpus-level context.
Corpus-level BLEU, on the other hand, aggregates statistics across all hypotheses and references before computing the BLEU score. This approach provides a holistic evaluation, reflecting the overall performance of the model across the dataset.
When calculating corpus-level BLEU scores, instead of evaluating hypotheses individually, n-gram matches and counts are aggregated across all hypotheses and references in the dataset. Similarly, the brevity penalty is applied globally, based on the total length of all hypotheses. This global approach smooths over sentence-level variability, providing a comprehensive evaluation of model performance across the entire dataset.
Tutorial: Using BLEU with Weave Scorers
To see the BLEU score in action, we’ll set up a script using Weave’s BLEUScorer. This tool integrates with the sacreBLEU library to calculate both sentence-level and corpus-level BLEU scores. Below, we’ll walk through the steps to compute these scores and analyze the results.
Step 1: Installation
Start by installing the necessary libraries. Run the following commands:
git clone https://github.com/wandb/weave.git && cd weave && git fetch origin pull/3006/head:xtra-scorers && git checkout xtra-scorers && pip install -qq -e .
Also, to run the evaluation later on in this tutorial, export your OpenAI key:
export OPENAI_API_KEY='your api key'
If you are interested in experimenting with BLEU in a Google Colab, feel free to check out a notebook here.
Step 2: Setting up the BLEU scorer
Next we'll calculate sentence-level and corpus-level BLEU scores.
First, import the necessary libraries and initialize the BLEUScorer:
import weave; weave.init('bleu-scorer-demo')from weave.scorers import BLEUScorer# Initialize Weave BLEU scorerweave_scorer = BLEUScorer()# Define sample hypotheses and referencesexamples = [{"hypothesis": "The cat sat on the mat.", "references": ["A cat was sitting on the mat."]},{"hypothesis": "The quick brown fox jumps over the lazy dog.", "references": ["A fast brown fox jumped over a sleeping dog."]},{"hypothesis": "The sun rises in the east.", "references": ["Sunlight emerges from the eastern horizon."]},]# Prepare input for Weave's summarize functionweave_inputs = []print("Sentence-Level BLEU:")for example in examples:# Compute sentence-level BLEU for each examplesentence_bleu_result = weave_scorer.score(ground_truths=example["references"],output=example["hypothesis"])weave_inputs.append({"output_pred": example["hypothesis"],"output_refs": example["references"],"sentence_bleu": sentence_bleu_result["sentence_bleu"], # Use the sentence-level score from Weave})# Print sentence-level BLEU resultsprint(f"Hypothesis: {sentence_bleu_result['output_pred']}")print(f"References: {sentence_bleu_result['output_refs']}")print(f"BLEU Score: {sentence_bleu_result['sentence_bleu']}")print(f"Brevity Penalty: {sentence_bleu_result['sentence_bp']}")print("-" * 50)# Calculate corpus-level BLEU using Weaveweave_corpus_score = weave_scorer.summarize(weave_inputs)# Print corpus-level BLEU resultsprint("\nWeave Corpus-Level BLEU:")print(f"BLEU Score: {weave_corpus_score['corpus_level']['bleu']}")print(f"Brevity Penalty: {weave_corpus_score['corpus_level']['brevity_penalty']}")
Interpreting results
The script computes both sentence-level and corpus-level BLEU scores, providing insights into text quality at multiple granularities. Here are the main distinctions:
- Sentence-level BLEU: Useful for analyzing specific outputs and debugging.
- Corpus-level BLEU: Aggregates scores across all examples, offering a holistic view of model performance.
Here are some of the parameters that you can use to customize your BLEU Scorer:
Parameter | Type | Default | Description |
---|---|---|---|
lowercase | bool | False | If True , performs case-insensitive matching. |
tokenize | Optional[str] | None | Specifies the tokenizer to use. Defaults to SacreBLEU’s built-in tokenizer. |
smooth_method | str | "exp" | Smoothing method to handle n-grams with zero matches. Options: 'none' , 'floor' , 'add-k' . |
smooth_value | Optional[float] | None | Value used for smoothing methods like 'floor' or 'add-k' . |
max_ngram_order | int | 4 | The maximum n-gram order for BLEU calculation. |
effective_order | bool | True | If True , adjusts for missing higher-order n-grams in short sentences. |
Evaluating multiple models with BLEU and Weave Evaluations
Weave Evaluations also enables structured comparisons of multiple models, such as GPT-4o and GPT-4o Mini. By evaluating predictions against a shared reference dataset, you can identify performance differences between models.
Here’s a condensed overview of how to set up such an evaluation using Weave:
import weaveimport timefrom litellm import acompletionimport asyncioimport nest_asynciofrom weave.scorers import BLEUScorerfrom weave import Evaluation# Initialize Weave with BLEU Scorerweave_client = weave.init("bleu-scorer-demo")dataset = weave.ref("weave:///c-metrics/rouge-scorer/object/longbench_gov_report_subset:qGNjItwJSEw1NF6xMXX2a0syHJfXVMjeYqwqVwWsdbs").get()class GPT4oMini(weave.Model):model_name: str = "gpt-4o-mini"temp: float = 0.0max_tokens: int = 2048top_p: float = 1.0@weave.op()async def predict(self, query: str) -> str:response = await acompletion(model=self.model_name,messages=[{"role": "system","content": "You are provided with government reports. Summarize the report in a few sentences but make sure to include all the important information."},{"role": "user","content": query}],temperature=self.temp,max_tokens=self.max_tokens,top_p=self.top_p)return response.choices[0].message.contentclass GPT4o(weave.Model):model_name: str = "gpt-4o-2024-08-06"temp: float = 0.0max_tokens: int = 2048top_p: float = 1.0@weave.op()async def predict(self, query: str) -> str:time.sleep(2)response = await acompletion(model=self.model_name,messages=[{"role": "system","content": "You are provided with government reports. Summarize the report in a few sentences but make sure to include all the important information."},{"role": "user","content": query}],temperature=self.temp,max_tokens=self.max_tokens,top_p=self.top_p)return response.choices[0].message.contentgpt4o = GPT4o()gpt4omini = GPT4oMini()nest_asyncio.apply()# Initialize BLEU Scorer with column mapscorer = BLEUScorer(column_map={"output": "query", "ground_truths": "ground_truth"})# Create the evaluationevaluation = Evaluation(dataset=dataset.rows[:30],scorers=[scorer],)# Evaluate the modelsasyncio.run(evaluation.evaluate(gpt4o))asyncio.run(evaluation.evaluate(gpt4omini))
The models are tasked with summarizing reports into concise and informative sentences. Each model is defined with parameters such as temperature, maximum tokens, and top-p, and uses Weave’s @weave.op decorator to track model predictions. The BLEU Scorer is initialized with a column mapping to align the model outputs with their corresponding ground truths.
The evaluation uses a subset of the dataset, selecting the first 30 rows for comparison. Both models are evaluated independently, and their results are aggregated using Weave’s evaluation framework.
Here are the results from the evaluation:

The evaluation results, including sentence-level and corpus-level BLEU scores, are displayed on a Weave dashboard. These insights enable detailed comparisons of model performance and output quality.
At the corpus level, GPT4o-Mini achieves a BLEU score of 30.709 with a brevity penalty of 0.940, indicating strong alignment with reference texts and relatively balanced output lengths. In contrast, GPT4o has a lower corpus-level BLEU score of 19.704 and a brevity penalty of 0.590, suggesting shorter hypotheses and less alignment.
Additionally, we can analyze the specific responses for each model on each example, using the comparisons view inside Weave:

This feature provides a detailed side-by-side comparison of the outputs generated by each model for every individual example in the dataset, along with their corresponding reference text. It allows us to see not just the BLEU scores for each prediction but also the qualitative differences in how each model approaches the task, such as variations in phrasing and inclusion of key details.
By drilling down into these comparisons, we can identify patterns in strengths and weaknesses for each model, gaining deeper insights into why one model might perform better overall or fail in specific scenarios. This level of granularity is helpful for debugging and improving model performance.
The limitations of BLEU
While the BLEU score is widely used, it has notable limitations that are well-recognized in the research community:
- String Matching Over Semantic Understanding: BLEU relies on n-gram overlap to evaluate similarity, ignoring semantic equivalence. This means valid translations with different phrasing, such as "sofa" for "couch," are unfairly penalized. BLEU also rewards outputs with correct n-grams even if they are in nonsensical or incoherent order, leading to inflated scores for poor-quality text
- Limited Flexibility with References: BLEU’s performance is highly sensitive to the reference set. Most evaluations rely on a single human-written reference, which penalizes diverse but valid outputs. Even with multiple references, BLEU struggles to capture the full range of acceptable variations in phrasing or meaning.
- Context-Specific and Dataset Bound: The BLEU score is heavily influenced by the specific test set and language pair, making comparisons across datasets or domains unreliable. It focuses on token overlap rather than meaning preservation, which limits its effectiveness for assessing true translation quality.
- Interpretation Challenges: Small changes in BLEU scores can be difficult to interpret, as they may not reflect meaningful improvements. Additionally, models overfitted to a specific test set can achieve artificially high BLEU scores without delivering real-world performance gains.
These limitations emphasize the need to use BLEU in combination with other evaluation metrics to gain a comprehensive understanding of model performance. Metrics that incorporate semantic equivalence or human evaluations can complement BLEU, addressing its shortcomings while ensuring a more robust assessment of text quality.
Conclusion
Despite its limitations, the BLEU score remains a valuable tool for evaluating machine-generated text when applied appropriately. While its reliance on string matching and lack of semantic understanding are well-known, BLEU’s simplicity and ease of reproducibility make it a staple for benchmarking text generation systems.
Tools like Weave further enhance its practicality by simplifying evaluation workflows and delivering actionable insights. As AI continues to evolve, BLEU remains a reliable metric for measuring advancements in text quality and performance.
GraphRAG: Enhancing LLMs with knowledge graphs for superior retrieval
This article introduces GraphRAG, a novel approach that combines knowledge graphs and hierarchical community detection to enable scalable, query-focused summarization and global sensemaking over large datasets.
Evaluating LLMs on Amazon Bedrock
Discover how to use Amazon Bedrock in combination with W&B Weave to evaluate and compare Large Language Models (LLMs) for summarization tasks, leveraging Bedrock’s managed infrastructure and Weave’s advanced evaluation features.
Building and evaluating a RAG system with DSPy and W&B Weave
A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.
Training a KANFormer: KAN's Are All You Need?
We will dive into a new experimental architecture, replacing the MLP layers in transformers with KAN layers!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.