Skip to main content

LLM evaluation benchmarking: Beyond BLEU and ROUGE

Moving from word-counting to metrics that actually measure quality
Created on December 1|Last edited on December 9
BLEU and ROUGE scores look great on paper. They're fast to compute, easy to understand, and show up in practically every NLP research paper. The problem? They were designed twenty years ago for a completely different kind of language model.
Traditional metrics count word matches between your model's output and a reference answer. When machine translation systems produced rigid, predictable text, this worked reasonably well. But modern LLMs don't play by those rules. They paraphrase, elaborate, and produce answers that can mean exactly the same thing as the reference while sharing almost no words with it. In cases like this, BLEU looks at a perfect answer yet assigns it a near-zero score.
This article covers why traditional metrics break down with modern LLMs, what alternatives exist, and how to set up an evaluation pipeline that measures things users actually care about.

What are LLM evaluation metrics? And why do they matter

At the core, every metric is just a function. Put in some model output, get back a number. The number represents something about quality such as accuracy, fluency, factual correctness, or whatever else you're trying to measure.
The tricky part is that whichever metric you pick ends up shaping your entire optimization process. If your metric doesn't capture what matters, you'll build a model that scores well on tests but disappoints real users.
Modern metrics generally fall into two camps:
  • Surface-level metrics (BLEU, ROUGE, METEOR, exact-match): Fast, deterministic, easy to interpret, but blind to meaning. Good for narrow tasks with template-like outputs. They compare the model's output against a reference text and count overlapping words and phrases. High overlap means high score. It's deterministic and cheap to run.
  • Semantic metrics (BERTScore, BLEURT, GPTScore, embedding similarity): Use neural networks or LLMs to judge meaning rather than form. Slower, sometimes less reproducible, but much better at recognizing valid paraphrases and assessing coherence. These semantic methods cost more and take longer to run. They also correlate much better with what humans think about output quality.
The metrics you track directly influence how you fine-tune, which prompts you to choose, and what you consider good enough to ship. If you optimize solely for BLEU, you'll get models that memorize reference phrasings but struggle with real user questions. If you track semantic similarity, groundedness, and task success, you'll build systems that work in practice.

Common benchmarks worth knowing

A benchmark is a standardized test suite: a fixed set of tasks, datasets, and metrics that everyone runs the same way. Instead of cherry-picking examples or inventing your own scoring rules, you use a shared test so results are comparable across models and labs.
Here are the benchmarks you'll see most often:
  • GLUE and SuperGLUE: Collections of classic NLP tasks like sentiment analysis, textual entailment, and question answering. Built for BERT-era models. They're well-studied but increasingly saturated; modern LLMs score near 100% on many GLUE tasks.
  • MMLU (Massive Multitask Language Understanding): 57 subjects spanning high school and college knowledge, from math and physics to law and history. Measures breadth of knowledge and reasoning. Typical GPT-4 class models score 80–90%.
  • BIG-bench: A huge collection of over 200 diverse tasks contributed by researchers, designed to probe edge cases, reasoning, and factual knowledge. Tasks range from arithmetic to social bias detection. Useful for finding blind spots.
  • HELM (Holistic Evaluation of Language Models): A framework that runs models across many scenarios (question answering, summarization, sentiment, toxicity) and reports accuracy, calibration, robustness, fairness, and efficiency. HELM doesn't just ask "how accurate is the model?" but "how does it fail, and who does it fail for?"
  • MT-Bench / AlpacaEval / Arena-Hard: Chat and instruction-following benchmarks that use pairwise LLM-as-a-judge comparisons instead of fixed labels. Closer to how humans evaluate conversational agents.
Unlike a single metric, these benchmarks give you a multi-dimensional view. For example. the model might excel at factual recall but struggle with commonsense reasoning, or perform well in English but poorly in other languages. That nuance is what helps you decide whether a model is ready for your use case.

Statistical metrics vs. model-based metrics

When you score model output, you have two broad options: count tokens or ask a model to judge meaning.

Statistical scorers

How they work: Compare output to reference(s) and compute overlap in words, n-grams, or longest common subsequences.
Examples: BLEU (translation), ROUGE (summarization), METEOR (adds stemming and synonyms), exact-match (short answers).
Pros: Fast, deterministic, transparent. No API calls, no extra models, same score every time.
Cons: Ignore semantics. Penalize valid paraphrases. Can't assess reasoning, factual correctness, or coherence.
When to use them: Narrow tasks with template outputs, regression testing to catch catastrophic changes, legacy baselines for comparison.

Model-based scorers

How they work: Use neural embeddings or an LLM to assess quality. BERTScore compares embeddings; BLEURT is trained to predict human judgments; GPTScore prompts an LLM to rate outputs.
Examples: BERTScore (embedding similarity), BLEURT (trained on human ratings), GPTScore, LLM-as-a-judge (prompt a strong model to grade answers).
Pros: Recognize paraphrases, measure semantic similarity, and align better with human preferences.
Cons: Slower, costs money (if calling APIs), less reproducible (model updates change scores), can inherit biases from the judge model.
When to use them: Open-ended generation, tasks with many valid answers, when human correlation matters more than speed.

The tradeoff

Statistical metrics give you a cheap, stable signal that's easy to debug. Model-based metrics give you a richer signal that actually reflects whether the output is good. In practice, you'll use both: BLEU/ROUGE for quick smoke tests, semantic metrics for deeper evaluation, and human review to calibrate everything.
DimensionStatistical ScorersModel-Based Scorers
SpeedInstantSeconds per example
CostFreeAPI calls or GPU time
ReproducibilityPerfectChanges with model updates
Semantic awarenessNoneHigh
Human correlationLow to moderateModerate to high
Best forRegression tests, narrow tasksOpen-ended tasks, quality checks


The limitations of the ROUGE and BLEU metrics

BLEU was designed in 2002 for machine translation whereas ROUGE came shortly after for summarization. Both served their purpose when models were fragile and you wanted to keep outputs close to known-good references. But LLMs changed the rules.

Where BLEU and ROUGE fall short

  1. They ignore meaning: If your model says "The capital of France is Paris" and the reference says "Paris is the capital of France," BLEU will give a mediocre score because word order differs. A human sees two identical statements.
  2. They penalize paraphrases: Suppose the reference is "The experiment succeeded." Your model outputs "The test was successful." Zero word overlap, BLEU near zero, but the meaning is the same.
  3. They can't assess reasoning: For chain-of-thought prompts or multi-step solutions, BLEU only cares if the words match. It doesn't check whether the logic is sound or the steps are valid.
  4. They reward keyword stuffing: You can game BLEU by repeating words from the reference, even if the output is nonsense. Higher score, worse quality.
  5. Multiple valid answers break them: Many tasks have dozens of good responses. BLEU and ROUGE compare against one or a few references and penalize everything else, even if it's correct.

A Concrete Example

Let's understand this with a simple example:
Question: "Who wrote Pride and Prejudice?" Reference: "Jane Austen" Model A: "Jane Austen" Model B: "Pride and Prejudice was written by Jane Austen." Model C: "The author is Jane Austen."
BLEU and ROUGE rank them: A > C > B, because A is an exact match and B adds extra words. But all three are perfectly correct answers. If you optimize for BLEU, you'll push the model toward terse, reference-matching outputs that might feel unnatural to users.

What to use instead—and when to use it

  • BERTScore: Compares outputs using BERT embeddings, so paraphrases score high.
  • BLEURT: Trained on human judgments of quality, aligns better with what people actually prefer.
  • Semantic similarity (embeddings): Measure cosine similarity between output and reference in a vector space.
  • LLM-as-a-judge: Prompt a strong model to rate outputs on a rubric (accuracy, clarity, groundedness).
These aren't perfect either, but they at least recognize when two different phrasings mean the same thing.

Comprehensive evaluation through benchmarks

A single metric is just one lens. Benchmarks combine multiple metrics across many tasks to give you a fuller picture of what a model can and can't do.
Take HELM as an example. Instead of just reporting accuracy on question answering, it also measures:
  • Calibration: Does the model know when it's uncertain?
  • Robustness: How sensitive is it to prompt wording or example order?
  • Fairness: Does performance vary across demographic groups?
  • Efficiency: Token cost, latency, carbon footprint.
That multi-dimensional view reveals tradeoffs.
  • A model might have high accuracy but terrible calibration (overconfident on wrong answers).
  • Another might be fast but biased.
You can't see these patterns with BLEU alone.
BIG-bench takes a different approach: hundreds of contributed tasks, many intentionally weird or edge-case-heavy, designed to find blind spots. One task tests arithmetic with large numbers. Another checks whether the model understands social norms. Another probes linguistic structure. Running your model through BIG-bench shows you where it silently fails in ways you wouldn't have thought to test.
MMLU is simpler but still valuable: 57 subjects, multiple-choice questions, clear accuracy scores per domain. If your model scores 90% on physics but 60% on law, you know where to focus fine-tuning or retrieval.
The common thread: comprehensive benchmarks don't reduce the model to a single number. They give you a profile of strengths, weaknesses, and tradeoffs that helps you decide whether the model fits your use case.

Tutorial: Implementing LLM evaluation with Weights & Biases

This section walks through a hands-on example: evaluating model outputs with multiple metrics and comparing them visually in the W&B dashboard. We'll build a realistic evaluation setup with a diverse dataset, two different models, and multiple judgment criteria.

What we'll build

  • A Q&A dataset with 12 examples covering factual knowledge, explanations, comparisons, and calculations
  • We'll use two model wrappers: GPT-4o-mini and GPT-3.5-turbo.
  • Measure them on six different metrics:
    • BLEU (statistical baseline)
    • ROUGE-L (statistical baseline)
    • BERTScore (semantic similarity)
    • Embedding similarity (cosine similarity in vector space)
    • LLM-as-a-judge for factual accuracy (GPT-4o-mini)
    • LLM-as-a-judge for helpfulness (GPT-3.5-turbo)
  • Create a Weave Evaluation that runs both models and logs everything
  • Compile it to a simple leaderboard to compare models side-by-side

Step 0: Install dependencies

Start with the required packages:
pip install weave openai bert-score rouge-score sacrebleu
wandb wandb login

Step 1: Initialize Weave

Weave manages the evaluation infrastructure and logging while the other packages will provide our scoring functions.
import weave
from openai import OpenAI
import os
from dotenv import load_dotenv
Next, load your environment variables so the OpenAI SDK has access to your API key. Keeping the key in a .env file avoids hard-coding secrets in the code and get your Weave project initialized.
load_dotenv()

# Capture the Weave client for later use with leaderboards
weave_client = weave.init("llm-benchmarking-demo")
openai_client = OpenAI()

Step 2: Create a diverse dataset

We'll build a dataset that mixes different answer types: short facts, explanations, comparisons, and calculations. This variety helps us see how different metrics behave across question types.
from weave import Dataset

rows = [
{
"id": "1",
"question": "What is the capital of Japan?",
"reference": "Tokyo",
"category": "factual"
},
{
"id": "2",
"question": "Who developed the theory of relativity?",
"reference": "Albert Einstein",
"category": "factual"
},
{
"id": "3",
"question": "Explain what photosynthesis is.",
"reference": "Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to produce oxygen and energy in the form of sugar.",
"category": "explanation"
},
{
"id": "4",
"question": "What is machine learning?",
"reference": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.",
"category": "explanation"
},
{
"id": "5",
"question": "Compare Python and JavaScript for web development.",
"reference": "Python is primarily used for backend development with frameworks like Django and Flask, while JavaScript is essential for frontend development and can also handle backend with Node.js.",
"category": "comparison"
},
{
"id": "6",
"question": "What's the difference between supervised and unsupervised learning?",
"reference": "Supervised learning uses labeled training data to learn the mapping between inputs and outputs, while unsupervised learning finds patterns in unlabeled data without predefined categories.",
"category": "comparison"
},
{
"id": "7",
"question": "Calculate 15% of 240.",
"reference": "36",
"category": "calculation"
},
{
"id": "8",
"question": "What is 7 multiplied by 13?",
"reference": "91",
"category": "calculation"
},
{
"id": "9",
"question": "What does HTTP stand for?",
"reference": "Hypertext Transfer Protocol",
"category": "factual"
},
{
"id": "10",
"question": "Describe the water cycle briefly.",
"reference": "The water cycle involves evaporation of water from surfaces, condensation into clouds, precipitation as rain or snow, and collection back into bodies of water.",
"category": "explanation"
},
{
"id": "11",
"question": "Compare RAM and ROM in computers.",
"reference": "RAM is volatile memory used for temporary storage while programs run, whereas ROM is non-volatile memory that stores permanent instructions for the computer.",
"category": "comparison"
},
{
"id": "12",
"question": "If a product costs $80 after a 20% discount, what was the original price?",
"reference": "$100",
"category": "calculation"
}
]

dataset = Dataset(name="qa_benchmark_v2", rows=rows)
weave.publish(dataset)
Of course, a set of twelve questions is small for production use, but sufficient to demonstrate the evaluation flow and see patterns across question types.

Step 3: Define two models

We'll create two model classes: one using GPT-4o-mini and another using GPT-3.5-turbo. This lets us compare a newer, more capable model against an older, faster, cheaper one.
from weave import Model

class GPT4oMiniModel(Model):
"""GPT-4o-mini: More capable, better instruction following"""
model_name: str = "gpt-4o-mini"
@weave.op()
def predict(self, question: str) -> str:
messages = [
{"role": "system", "content": "You are a helpful assistant. Answer questions accurately and concisely."},
{"role": "user", "content": question}
]
res = openai_client.chat.completions.create(
model=self.model_name,
messages=messages,
temperature=0.0
)
return res.choices[0].message.content.strip()


class GPT35TurboModel(Model):
"""GPT-3.5-turbo: Faster, cheaper, less capable"""
model_name: str = "gpt-3.5-turbo"
@weave.op()
def predict(self, question: str) -> str:
messages = [
{"role": "system", "content": "You are a helpful assistant. Answer questions accurately and concisely."},
{"role": "user", "content": question}
]
res = openai_client.chat.completions.create(
model=self.model_name,
messages=messages,
temperature=0.0
)
return res.choices[0].message.content.strip()



Step 4: Implement metrics

Now we'll implement six different metrics that measure different aspects of answer quality.

4a) BLEU Score

The classic n-gram overlap measure, normalized to 0-1 range:
import sacrebleu

@weave.op()
def bleu_score(reference: str, output: str) -> dict:
"""
Statistical metric: measures n-gram overlap between output and reference.
Good for: detecting catastrophic failures, regression testing.
Bad for: paraphrased but correct answers.
"""
score = sacrebleu.sentence_bleu(
output,
[reference],
smooth_method="exp"
).score
return {"bleu": score / 100.0}

4b) ROUGE-L

Measures the longest common subsequence, more forgiving of word reordering:
from rouge_score import rouge_scorer

_rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)

@weave.op()
def rouge_l_score(reference: str, output: str) -> dict:
"""
Statistical metric: measures longest common subsequence.
Good for: summarization tasks, keyword preservation.
Bad for: restructured but semantically identical answers.
"""
scores = _rouge.score(reference, output)
return {"rouge_l": scores["rougeL"].fmeasure}

4c) BERTScore:

Compares texts using BERT embeddings to capture semantic similarity:
from bert_score import score as bert_score_fn

@weave.op()
def bert_score(reference: str, output: str) -> dict:
"""
Semantic metric: uses BERT embeddings to measure similarity.
Good for: recognizing paraphrases, semantic equivalence.
Bad for: factual correctness (can score high on fluent nonsense).
"""
try:
_, _, F1 = bert_score_fn(
[output],
[reference],
lang="en",
verbose=False
)
return {"bert_score": float(F1[0])}
except Exception as e:
print(f"BERTScore failed: {e}")
return {"bert_score": None}
The try/except block handles occasional failures from PyTorch configuration issues. Better to skip one metric than crash the entire evaluation.

4d) Embedding similarity

Uses OpenAI's embedding model to compute cosine similarity:
import numpy as np

@weave.op()
def embedding_similarity(reference: str, output: str) -> dict:
"""
Semantic metric: computes cosine similarity between embeddings.
Good for: measuring semantic closeness in vector space.
Bad for: doesn't check factual accuracy or logical correctness.
"""
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=[reference, output]
)
ref_embedding = np.array(response.data[0].embedding)
out_embedding = np.array(response.data[1].embedding)
cosine_sim = np.dot(ref_embedding, out_embedding) / (
np.linalg.norm(ref_embedding) * np.linalg.norm(out_embedding)
)
return {"embedding_similarity": float(cosine_sim)}

4e) LLM as a Judge: Factual accuracy (GPT-4o-mini)

GPT-4o-mini evaluates factual correctness on a 1-5 scale:
ACCURACY_JUDGE_PROMPT = """You are evaluating the factual accuracy of an answer.

Question: {question}
Reference Answer: {reference}
Model Answer: {output}

Rate the factual accuracy on a scale of 1-5:
1 = Completely incorrect or contradicts the reference
2 = Mostly incorrect with minor correct elements
3 = Partially correct but missing key information
4 = Mostly correct with minor issues
5 = Completely accurate and equivalent to the reference

Consider:
- Are the core facts correct?
- Does it contradict the reference answer?
- Is critical information missing?

Respond with ONLY a single number (1-5)."""

@weave.op()
def accuracy_judge(question: str, reference: str, output: str) -> dict:
"""
LLM-as-a-judge metric focusing on factual correctness.
Uses GPT-4o-mini for cost-effective evaluation.
"""
prompt = ACCURACY_JUDGE_PROMPT.format(
question=question,
reference=reference,
output=output
)
res = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.0
)
try:
score = float(res.choices[0].message.content.strip())
# Normalize to 0-1 range, cap at 0.9999 for consistent percentage display
normalized_score = min((score - 1) / 4, 0.9999)
except ValueError:
normalized_score = None
return {
"accuracy_score": normalized_score,
"accuracy_raw": score if normalized_score is not None else None
}
Important Note: The min(..., 0.9999) cap on the normalized score. This fixes a display quirk in W&B's leaderboard, when a score equals exactly 1.0, it renders as "1.00" instead of "100%" like the other percentages. Capping at 0.9999 displays as "99.99%" and keeps formatting consistent.

4f) LLM as a Judge: Helpfulness (GPT-3.5-turbo)

GPT-3.5-turbo rates how useful the answer would be to a user:
HELPFULNESS_JUDGE_PROMPT = """You are evaluating how helpful an answer is to a user.

Question: {question}
Model Answer: {output}

Rate the helpfulness on a scale of 1-5:
1 = Not helpful at all, confusing or wrong
2 = Minimally helpful, lacks important context
3 = Somewhat helpful, adequate but could be clearer
4 = Helpful, clear and addresses the question well
5 = Extremely helpful, clear, complete, and well-explained

Consider:
- Does it directly address the question?
- Is it clear and easy to understand?
- Does it provide enough context/explanation?
- Would this satisfy a user asking this question?

Respond with ONLY a single number (1-5)."""

@weave.op()
def helpfulness_judge(question: str, output: str) -> dict:
"""
LLM-as-a-judge metric focusing on user helpfulness.
Uses GPT-3.5-turbo for different perspective and lower cost.
"""
prompt = HELPFULNESS_JUDGE_PROMPT.format(
question=question,
output=output
)
res = openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.0
)
try:
score = float(res.choices[0].message.content.strip())
normalized_score = (score - 1) / 4
except ValueError:
normalized_score = None
return {
"helpfulness_score": normalized_score,
"helpfulness_raw": score if normalized_score is not None else None
}
Using different models for the two judges provides some diversity in evaluation perspective and demonstrates that you can mix models based on cost considerations.

Step 5: Build the evaluation

Now, let's bundle everything into an evaluation object:
from weave import Evaluation

evaluation = Evaluation(
dataset=dataset,
scorers=[
bleu_score,
rouge_l_score,
bert_score,
embedding_similarity,
accuracy_judge,
helpfulness_judge
]
)

Step 6: Run evaluations and create a leaderboard

Next, we run the same evaluatiosn on both models and create a leaderboard to compare them.
import asyncio
from weave.flow import leaderboard
from weave.trace.ref_util import get_ref

async def run_all():
print("Running evaluations...")
# Initialize models
gpt4o_mini_model = GPT4oMiniModel()
gpt35_turbo_model = GPT35TurboModel()
# Run evaluations
print("\nEvaluating GPT-4o-mini...")
result1 = await evaluation.evaluate(gpt4o_mini_model)
print("\nEvaluating GPT-3.5-turbo...")
result2 = await evaluation.evaluate(gpt35_turbo_model)
print("\n Evaluations completed!")
# Create leaderboard AFTER evaluations are done
print("\nCreating leaderboard...")
leaderboard_spec = leaderboard.Leaderboard(
name="LLM Benchmarking Leaderboard",
description="Comparing GPT-4o-mini vs GPT-3.5-turbo on Q&A tasks across multiple metrics",
columns=[
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="bleu_score",
summary_metric_path="bleu.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="rouge_l_score",
summary_metric_path="rouge_l.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="bert_score",
summary_metric_path="bert_score.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="embedding_similarity",
summary_metric_path="embedding_similarity.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="accuracy_judge",
summary_metric_path="accuracy_score.mean",
),
leaderboard.LeaderboardColumn(
evaluation_object_ref=get_ref(evaluation).uri(),
scorer_name="helpfulness_judge",
summary_metric_path="helpfulness_score.mean",
),
]
)
published_leaderboard = weave.publish(leaderboard_spec)
print(f"Leaderboard published: {published_leaderboard}")
# Retrieve and print results using weave_client
results = leaderboard.get_leaderboard_results(leaderboard_spec, weave_client)
print("\n=== Leaderboard Results ===")
print(results)

asyncio.run(run_all())


One important detail: the leaderboard must be created after the evaluations run, since it references evaluation objects that need to exist first.

Step 7: Exploring Results in the W&B Interface

After running the evaluations, open the Weave URL printed in your console. Let's walk through the three main views you'll use to understand your results.

Leaders View: Head-to-Head Model Comparison

Navigate to the Leaders tab in the left sidebar. This is your primary view for comparing models at a glance.
Screenshot of the Leaders section of W&B web interface. Screenshot by Author.
What you'll see is a table with models as rows and metrics as columns:
ModelBLEUROUGE-LBERTScoreEmbedding SimAccuracyHelpfulness
GPT35TurboModel9.75%33.16%82.67%66.09%91.66%97.92%
GPT4oMiniModel4.61%19.97%82.83%61.43%99.99%95.83%

As you can check in the previous table, the results tell an interesting story. GPT-4o-mini scores nearly perfect on factual accuracy (99.99%) but actually underperforms GPT-3.5-turbo on statistical metrics like BLEU and ROUGE-L. This is the BLEU paradox in action: GPT-4o-mini gives correct, well-reasoned answers that don't match the reference wording exactly.
Look at the semantic metrics (BERTScore around 82-83% for both). They're nearly identical, which makes sense, as both models understand the questions and produce semantically correct answers. The difference shows up in how they phrase things.
GPT-3.5-turbo wins on helpfulness (97.92% vs 95.83%). This suggests it tends toward more concise answers that the helpfulness judge prefers. Meanwhile, GPT-4o-mini's higher accuracy score (99.99%) indicates it captures the factual content more precisely, even if it uses more words to do it.

Trace view: Diving into individual runs

Navigate to the Traces tab to see the raw data behind these numbers. You'll see a list of all your evaluation runs with their aggregate metrics. Click on any trace to expand it. The detail panel shows you:
  • Runtime and tokens: How long the evaluation took, how many tokens it consumed and how much cost it generated.
  • Definition: Which dataset and model were used
  • Scores: The average for each metric across all 12 questions
Screenshot of the Traces section of W&B web interface. Screenshot by Author.
For instance, clicking on a GPT-3.5-turbo trace might show:
  • bleu: 0.098 (avg)
  • rouge_l: 0.332 (avg)
  • bert_score: 0.827 (avg)
  • embedding_similarity: 0.661 (avg)
  • accuracy_score: 0.917 (avg)
  • helpfulness_score: 0.979 (avg)
Screenshot of the Traces section of W&B web interface. Checking the results of an execution of the code. Screenshot by Author.
The raw scores (accuracy_raw, helpfulness_raw) show the 1-5 scale before normalization. An accuracy_raw of 4.667 means the judge gave mostly 5s with a few 4s. That's useful context when interpreting the normalized percentage.

Evals View: Comparing evaluation runs

The Evals tab gives you a different angle, it's organized by evaluation run rather than by trace.
Screenshot of the Evals section of W&B web interface. Screenshot by Author.
Here you can see all your runs side by side with their input/output pairs and per-metric scores. This view is especially useful for:
  • Spotting patterns across runs (do certain question types consistently score lower?)
  • Identifying outliers (which specific questions cause the biggest differences?)
  • Comparing model behavior on the same inputs
Click any row to expand the evaluation details. You'll see the model that was evaluated, the dataset version, and a breakdown of all scores. The accuracy_score column shows individual scores per question, you can spot where GPT-4o-mini got perfect 5s (normalized to 0.9999) while GPT-3.5-turbo might have gotten 4s or occasional 3s.
One thing to watch for: some runs might show "N/A" for BERTScore if there were failures during computation. This can happen with PyTorch version conflicts. The evaluation continues anyway, that's why we wrapped BERTScore in a try/except block.

What the Data Actually Tells You

Looking at our results, here's the practical takeaway:
  • For simple Q&A tasks, GPT-3.5-turbo delivers comparable quality at lower cost. Both models nail factual accuracy (91-99%), both produce semantically correct answers (82-83% BERTScore), and both satisfy the helpfulness judge (95-97%). The statistical metrics (BLEU 4-9%, ROUGE-L 19-33%) are low for both models, but that's expected. These models paraphrase and elaborate rather than matching reference text verbatim. That's not a bug, it's how modern LLMs work.
  • When to choose GPT-4o-mini: Tasks requiring precise factual recall, complex reasoning, or longer context windows.
  • When to choose GPT-3.5-turbo: Simple Q&A, high-volume applications where cost and latency matter, and situations where concise answers are preferred.
W&B makes this decision data-driven instead of guesswork. You're not choosing based on model hype, but on actual measured performance for your specific task.

Conclusion: Advancing LLM evaluation practices

BLEU and ROUGE served us well in the early days of NLP. They gave researchers a quick, reproducible way to compare machine translation and summarization systems. But LLMs broke the assumptions those metrics were built on. Modern models paraphrase, reason, synthesize, and generate in ways that make word-counting metrics almost useless.
The path forward combines three things:
  • Semantic metrics that recognize meaning over form (BERTScore, BLEURT, embedding similarity)
  • Comprehensive benchmarks that measure models across multiple dimensions (MMLU, HELM, BIG-bench)
  • LLM-as-a-judge setups that use strong models to grade outputs against rubrics aligned with what users actually care about
None of these are perfect. Semantic metrics can be slow and expensive. Benchmarks can become saturated or gamed. LLM judges can be biased or inconsistent. But they're all closer to measuring what matters than counting n-gram overlaps.
As you build LLM systems, treat evaluation as a living practice. Start with a small golden set, mix statistical and semantic metrics, run benchmarks to catch blind spots, and calibrate everything against human judgments. Use tools like Weights & Biases to compare models systematically, not anecdotally.
The goal isn't a single perfect score; it's a body of evidence you trust when you ship changes. That's how you move from the BLEU score looks good to this model actually works for our users.