Skip to main content

AI guardrails: Relevance scorers

This article explores relevance scoring in AI, detailing tools, datasets, and methods for evaluating and refining how well model outputs align with input prompts and context.
Created on January 14|Last edited on March 1
AI models often struggle to produce responses that are accurate, contextually relevant, and aligned with user intent. Relevance scoring serves as a guardrail, ensuring that AI-generated responses remain on track, adhere to the input prompt, and maintain coherence with contextual information.
The Weave Relevance Scorer provides a structured way to evaluate, compare, and refine model outputs based on their semantic alignment, logical structure, and contextual appropriateness. By leveraging Weave, developers can systematically assess AI performance and identify areas for improvement in real-world applications.
This article explores how the Weave Relevance Scorer works, how it integrates with Weave’s evaluation framework, and how it compares to other models. We’ll also introduce relevance-focused datasets and walk through a hands-on implementation.
For those wanting to get their hands dirty right away, we've created an interactive Colab.

For those who want to understand the complexities and details, below we'll provide a deeper understanding of relevance in AI, along with the tools and code you’ll need to integrate these strategies into your projects.


Table of contents



What is relevance?

Relevance measures how well an AI-generated response directly answers a given prompt while maintaining contextual accuracy and coherence.
Relevance is evaluated across three key dimensions:
  • Semantic Alignment: Does the response directly address the prompt’s core intent?
  • Structural Coherence: Is the response logically organized and internally consistent?
  • Contextual Integration: Does the response appropriately incorporate the provided context?
To act as a guardrail for AI-generated outputs, Weave applies a Likert Scale to measure relevance:
  • 5 (Perfectly Relevant): The response fully addresses all aspects of the prompt, integrates context seamlessly, and provides complete, precise, and concise information without unnecessary details.
  • 4 (Mostly Relevant): The response aligns well with the prompt and context, with minor gaps or slight digressions that do not significantly affect its utility.
  • 3 (Somewhat Relevant): The response addresses some aspects of the prompt but has noticeable gaps, occasional digressions, or unnecessary content that reduce its relevance.
  • 2 (Mostly Irrelevant): The response minimally addresses the prompt, with significant omissions, frequent digressions, or substantial irrelevant content.
  • 1 (Completely Irrelevant): The response entirely fails to address the prompt, is off-topic, or lacks any meaningful information.
By scoring relevance in this way, Weave establishes a guardrail that prevents AI models from producing responses that are misaligned, unclear, or contextually inappropriate.

Why relevance matters

Relevance is key to ensuring that AI-generated outputs address user queries in a meaningful and contextually appropriate way. It shapes the effectiveness of tasks like question answering, content summarization, and decision support by aligning responses with the input prompt and context.
Irrelevant or poorly aligned outputs can lead to confusion and/or dissatisfaction, undermining trust in the system. By applying guardrails like relevance scoring, it becomes possible to systematically improve model performance, ensuring outputs are clear and accurate.
Additionally, relevance scoring allows for the identification of specific areas where models excel or falter, providing actionable insights for targeted refinements. This process not only enhances the quality of AI outputs but also fosters reliability and consistency across diverse applications.

Academic work on relevance

To the best of our knowledge, only one academic work explicitly defines relevance in the way we do here: SummEval. This work collected and released human judgments, both expert and crowd-sourced, on 16 model outputs across 100 articles over four dimensions, including relevance, to advance research in human-correlated evaluation metrics.
However, SummEval is a benchmark dataset containing only 100 examples annotated for various LLM-generated text-summary pairs. While valuable, its limited size constrains its utility for training robust models across diverse contexts. To address this limitation, we proceed to construct a synthetic dataset based on a broader range of examples. This dataset leverages automated techniques to annotate responses with relevance scores, guided by well-defined dimensions such as semantic alignment, structural coherence, and contextual integration.

Datasets for relevance

Datasets like HelpSteer2, UltraFeedback, and Natural Questions provide valuable examples for training relevance scorers, as they include attributes closely aligned with these dimensions, enabling robust evaluation for relevance.
In the case of the UltraFeedback dataset, relevance scoring uses key attributes such as helpfulness, instruction following, truthfulness, and honesty. These factors are averaged to calculate a relevance score, reflecting how well a response adheres to instructions, provides factual accuracy, and remains aligned with the query while maintaining coherent reasoning.
For the Natural Questions dataset, relevance scores are assigned on a scale from 1 to 5 based on specific response qualities. Short answers that directly address the query are labeled with a relevance score of 5 for their precision and alignment. Long answers, while detailed and informative, are more verbose and receive a relevance score of 3. Distilled versions of long answers, which extract concise sentences containing the short answer, are assigned a relevance score of 4. Candidate chunks overlapping significantly with the long answer but lacking the short answer are labeled with a relevance score of 2, while unrelated chunks are given a relevance score of 1. Cosine similarity helps rank candidate chunks by their alignment with the query, and contextual sampling ensures a diverse and representative dataset for training.
The HelpSteer2 dataset, originally developed for training reward models, can be effectively repurposed to compute relevance scores by leveraging its detailed attributes: helpfulness, correctness, coherence, and verbosity. Although relevance is not directly annotated in the dataset, these attributes align well with the three core dimensions of relevance - semantic alignment, structural coherence, and contextual integration.
Semantic alignment is captured through helpfulness and correctness, assessing whether the response directly addresses the user's query and provides accurate information. Structural coherence is reflected in the coherence attribute, which evaluates the logical flow and clarity of the response, and verbosity, which penalizes overly lengthy outputs to ensure conciseness. Contextual integration is supported by helpfulness and correctness, ensuring the response remains consistent with the given context and background information.
To compute a relevance score for the Help2steer dataset, a weighted formula is applied to these attributes: helpfulness is assigned a high weight (e.g., 0.4) to emphasize its critical role in aligning with user queries and contexts, correctness is weighted at 0.3 to ensure factual accuracy, coherence contributes 0.2 to reward clarity and flow, and verbosity is given a negative weight (e.g., -0.1) to discourage unnecessary details. The resulting score provides a comprehensive metric that encapsulates how well a response meets the standards of relevance. This methodology transforms HelpSteer2 into a versatile tool for training relevance models, enabling developers to align AI outputs more closely with user expectations.
HelpSteer2 Relevance =(0.4×Helpfulness)+(0.3×Correctness)+(0.2×Coherence)(0.1×Verbosity)
The Weave Relevance Scorer was trained on a combination of these datasets to ensure a comprehensive understanding of relevance across diverse use cases. This training process allowed the model to learn from high-quality examples and capture nuanced aspects of relevance. To support the AI community, the trained model has been made publicly available on Hugging Face, providing a reliable tool for evaluating and refining AI systems.

How to use the Weave Relevance Scorer

The following script demonstrates how to evaluate a response’s relevance using Weave’s relevance scorer
import asyncio
import weave; weave.init("relevance-scorer")

from weave.scorers import RelevanceScorer


# Initialize the relevance scorer
relevance_scorer = RelevanceScorer()

# Define input, context, and output
input_text = "Some input text to consider"
context_text = "Some context to consider" # Optional context
output_text = "Some output text to consider"

# Async function to compute relevance score
async def get_relevance_score():
score = await relevance_scorer.score(input=input_text, context=context_text, output=output_text)
print(score)

# Run the async function
asyncio.run(get_relevance_score())
The relevance scores generated by the relevance_scorer.score() function include a breakdown of how well a response aligns with its input and context, measured across dimensions such as semantic alignment, structural coherence, and contextual integration. These scores provide valuable insights into the quality of the model's outputs, enabling nuanced analysis and identification of areas where relevance criteria are met or where adjustments may be needed.
After running the script, the results are automatically logged into Weave’s dashboard, ensuring continuous monitoring and refinement of model performance - a critical guardrail for AI-generated content.

Weave’s dashboard allows teams to explore and analyze the results in detail. By visualizing relevance scores and identifying patterns, developers can monitor trends, compare model outputs, and refine their systems to enhance performance.
💡

Measuring relevance with Weave Evals

In this tutorial, we will demonstrate how to:
  • Load and preprocess datasets for relevance scoring
  • Apply Weave’s Relevance Scorer to assess model responses
  • Compare results across multiple models, including OpenAI’s GPT-4o
By benchmarking Weave’s Relevance Scorer against OpenAI’s GPT-4o and GPT-4o-Mini, we establish a strong evaluation framework - a key guardrail for ensuring AI-generated outputs remain relevant and accurate.
import weave
from weave.scorers import CoherenceScorer
import pandas as pd
from datasets import load_dataset
import asyncio
from weave.trace.box import unbox
import os
from pydantic import BaseModel, Field
from typing import Literal
from typing import Optional, Any
from litellm import acompletion
import json
# from relevance_scorer import RelevanceScorer
from weave.scorers import RelevanceScorer
weave.init("relevance_eval")



def load_relevance_dataset():
# Load the HelpSteer2 dataset
dataset = load_dataset("nvidia/HelpSteer2", split="validation")
# Convert to a Pandas DataFrame for easier manipulation
df = pd.DataFrame(dataset)
# Define the weights for relevance computation
weights = {
"helpfulness": 0.4,
"correctness": 0.3,
"coherence": 0.2,
"verbosity": -0.1 # Negative weight for verbosity
}

# Calculate relevance score using the weighted formula
df["relevance"] = (
weights["helpfulness"] * df["helpfulness"] +
weights["correctness"] * df["correctness"] +
weights["coherence"] * df["coherence"] +
weights["verbosity"] * df["verbosity"]
)
# Determine binary labels (True for relevance >= 3.5, False otherwise)
df["relevance_label"] = df["relevance"] >= 3.5
# Separate into True and False groups
true_group = df[df["relevance_label"] == True]
false_group = df[df["relevance_label"] == False]

# Determine the maximum number of samples per group
max_samples = min(len(true_group), len(false_group), 75) # Adjust dynamically based on available data
# Sample from each group
balanced_true = true_group.sample(n=max_samples, random_state=42)
balanced_false = false_group.sample(n=max_samples, random_state=42)
# Combine the balanced samples
balanced_df = pd.concat([balanced_true, balanced_false]).reset_index(drop=True)
# Shuffle the resulting DataFrame
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)
# Prepare the dataset
dataset_prepared = [
{"output": row["response"], "label": row["relevance_label"], "input": row["prompt"]}
for _, row in balanced_df.iterrows()
]
return dataset_prepared


RELEVANCE_SYSTEM_PROMPT = """You are an expert evaluator assessing the relevance of LLM-generated outputs relative to their input context.
Your goal is to provide a single relevance score and classification based on comprehensive analysis.
Relevance measures how effectively a generated output addresses its input context across three core dimensions:

1. **Semantic Alignment**
- How directly does the output address key input requirements?
- Does it maintain topical focus?
- Does it provide complete coverage of necessary information?
- Is unnecessary content avoided?

2. **Structural Coherence**
- Does the output flow logically and show internal consistency?
- Is the presentation of information clear and organized?
- Is there a good balance between completeness and conciseness?

3. **Contextual Integration**
- How well does the output use the provided context?
- Does the output align with the broader discourse?
- Is it consistent with background information?
- Does it fulfill task-specific requirements?

## Evaluation Process

1. Review all input context (instructions, prompts, documents, chat history)
2. Identify core requirements and purpose
3. Analyze the LLM output across all three dimensions
4. Assign a single relevance score (1-5):
- 5: Exceptional relevance across all dimensions
- 4: Strong relevance with minor gaps
- 3: Adequate relevance with some issues
- 2: Significant relevance issues
- 1: Major relevance problems
5. Classify as relevant (score ≥ 3.0) or not relevant (score < 3.0)

## Task-Specific Considerations

- **Summarization**: Focus on key information selection and density
- **Q&A**: Emphasize answer accuracy and completeness
- **Chat**: Consider conversation flow and context maintenance
- **RAG**: Evaluate retrieved information integration

## Output Format

Provide evaluation results in the following JSON format:

```json
{
"relevance": [score from 1-5],
"relevant": [true/false]
}
```
""".strip()

class RelevanceScore(BaseModel):
"""The level of relevance of a <completion> for a given <context>."""
chain_of_thought: str = Field(..., description="The chain of thought that led to the prediction")
score: int = Field(..., description="Score the relevance of the <completion> on a likert scale of 1 to 5")
relevance: Literal["Perfectly Relevant", "Mostly Relevant", "A Little Irrelevant", "Mostly Irrelevant", "Completely Irrelevant"] = Field(..., description="The level of relevance of the <completion>")
relevant: bool = Field(..., description="Whether the <completion> is relevant or not, anything above 3 is relevant")




class OpenAIRelevanceScorer(weave.Model):
system_prompt: str = RELEVANCE_SYSTEM_PROMPT
model_id: str
temperature: float = 0.7
max_tokens: int = 4096
# client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def __init__(self, model_id: str, temperature: float = 0.7, max_tokens: int = 4096):
"""
Initialize the OpenAIRelevanceScorer with a specific model ID.
"""
super().__init__(model_id=model_id, temperature=temperature, max_tokens=max_tokens)
self.model_id = model_id
self.temperature = temperature
self.max_tokens = max_tokens
def _format_messages(self, prompt: str, completion: str, context:Optional[list[str]], chat_history: Optional[list[dict[str, str]]]) -> str:
"""Format the prompt for the model."""


chat_history = chat_history if isinstance(chat_history, list) else []
context = context if isinstance(context, list) else []
if context:
context = "\n".join(context).strip()
context = f"<documents>\n{context}\n</documents>"
else:
context = ""
prompt = f"{context}\n\n{prompt}".strip()


messages = chat_history + [{"role": "user", "content": prompt}]


messages = [f"<|msg_start|>{message['role']}\n{message['content']}<|msg_end|>" for message in messages]
messages = "\n".join(messages)
context = f"<context>{messages}</context>\n"
completion = f"<completion>{completion}</completion>\n"
context_and_completion = context + completion
return [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": context_and_completion},
]

async def run_inference(self, messages):
api_key = os.getenv("OPENAI_API_KEY") # Ensure your API key is set in the environment variable
model_name = self.model_id

response = await acompletion(
model=model_name,
api_key=api_key,
messages=messages,
temperature=self.temperature,
max_tokens=self.max_tokens,
)
return response["choices"][0]["message"]["content"]
@weave.op
async def predict(self, input: str, output:str, context:Optional[list[str]] = None, chat_history: Optional[list[dict[str, str]]] = None) -> dict[str, Any]:
try:
messages = self._format_messages(prompt=input, completion=output, context=context, chat_history=chat_history)

response = await self.run_inference(messages)
if response.startswith('```') and response.endswith('```'):
response = response.split('\n', 1)[1].rsplit('\n', 1)[0]
parsed_response = json.loads(response) # Convert the string to a Python dictionary
return bool({k: v for k, v in parsed_response.items() if k != "chain_of_thought"}['relevant'])
except Exception as e:
print(str(e))
return False



# Define Weave Relevance Scorer Model
class WeaveRelevanceModel(weave.Model):
@weave.op
async def predict(self, input: str) -> int:
result = await scorer.score(input=input, output=input)
# return result['relevant']
return result['score'] >= 3 # multiple ways to determine relevance prediction here


class PrecisionRecallF1Scorer(weave.Scorer):
"""
Custom scorer to calculate precision, recall, F1, and accuracy at the dataset level.
"""

@weave.op
def score(self, label: int, model_output: int) -> dict:
"""
Compute True Positives, False Positives, False Negatives, and True Negatives for a single row.
"""
tp = int(label == 1 and model_output == 1) # True Positive
fp = int(label == 0 and model_output == 1) # False Positive
fn = int(label == 1 and model_output == 0) # False Negative
tn = int(label == 0 and model_output == 0) # True Negative

return {"tp": tp, "fp": fp, "fn": fn, "tn": tn}

def summarize(self, score_rows: list) -> dict:
"""
Summarize precision, recall, F1, and accuracy from the row-level scores.
"""
# Aggregate true positives, false positives, false negatives, and true negatives
total_tp = sum(row["tp"] for row in score_rows)
total_fp = sum(row["fp"] for row in score_rows)
total_fn = sum(row["fn"] for row in score_rows)
total_tn = sum(row["tn"] for row in score_rows)

# Calculate precision, recall, F1, and accuracy
precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
f1 = (
2 * (precision * recall) / (precision + recall)
if (precision + recall) > 0
else 0
)
accuracy = (total_tp + total_tn) / (total_tp + total_fp + total_fn + total_tn) if (total_tp + total_fp + total_fn + total_tn) > 0 else 0

return {
"precision": precision,
"recall": recall,
"f1": f1,
"accuracy": accuracy,
# "tp": total_tp,
# "fp": total_fp,
# "fn": total_fn,
# "tn": total_tn,
}



# Instantiate OpenAI Relevance Scorers
openai_scorer_4o = OpenAIRelevanceScorer(model_id="gpt-4o-2024-08-06", temperature=0.0, max_tokens=4096)
openai_scorer_4omini = OpenAIRelevanceScorer(model_id="gpt-4o-mini", temperature=0.0, max_tokens=2048)
scorer = RelevanceScorer()



# Define OpenAIScorer4o Model
class OpenAIScorer4o(weave.Model):
"""
Weave model that wraps the OpenAIRelevanceScorer for gpt-4o.
"""
@weave.op
async def predict(self, input: str, output: str, context: Optional[list[str]] = None, chat_history: Optional[list[dict[str, str]]] = None) -> bool:
"""
Use the instantiated gpt-4o scorer to predict relevance.
"""
return await openai_scorer_4o.predict(input=input, output=output, context=context, chat_history=chat_history)


# Define OpenAIScorer4omini Model
class OpenAIScorer4omini(weave.Model):
"""
Weave model that wraps the OpenAIRelevanceScorer for gpt-4o-mini.
"""
@weave.op
async def predict(self, input: str, output: str, context: Optional[list[str]] = None, chat_history: Optional[list[dict[str, str]]] = None) -> bool:
"""
Use the instantiated gpt-4o-mini scorer to predict relevance.
"""
return await openai_scorer_4omini.predict(input=input, output=output, context=context, chat_history=chat_history)

# Run the evaluations
async def run_evaluations():
"""Run evaluations for the scorers."""
# Load dataset
dataset = load_relevance_dataset()
print("Dataset loaded...")
# Initialize models
models = {
"WeaveRelevanceModel": WeaveRelevanceModel(),
"OpenAIRelevanceScorer4oMini": OpenAIScorer4omini(),
"OpenAIRelevanceScorer4o": OpenAIScorer4o(),
}

# Define evaluation scorers
scorers = [PrecisionRecallF1Scorer()]

# Run evaluations
results = {}
for model_name, model in models.items():
print(f"\nEvaluating {model_name}...")
evaluation = weave.Evaluation(
dataset=dataset,
scorers=scorers,
name=model_name + " Eval"
)
results[model_name] = await evaluation.evaluate(model)

# Print results
for model_name, result in results.items():
print(f"\nResults for {model_name}:")
print(result)


if __name__ == "__main__":
asyncio.run(run_evaluations())

This script demonstrates how to evaluate the relevance of AI-generated outputs using the Weave platform. The process begins by loading the HelpSteer2 dataset, a resource that provides detailed annotations to guide reward modeling for AI systems. While HelpSteer2 does not explicitly include a relevance annotation, relevance scores are computed by combining attributes such as helpfulness, correctness, coherence, and verbosity to provide a structured evaluation of response quality.

Benchmarking different relevance scoring models

This evaluation compares the Weave Relevance Scorer against OpenAI’s GPT-4o and GPT-4o-mini to benchmark their performance in assessing relevance. By analyzing the relevance scores and model outputs across these tools, we highlight key strengths, trade-offs, and areas for improvement, offering actionable insights into different scoring methodologies.
To ensure a balanced evaluation, the HelpSteer2 dataset is processed to:
  • Assign relevance scores using a weighted formula, where helpfulness and correctness are prioritized, coherence ensures clarity, and verbosity is penalized.
  • Label responses as relevant or not based on whether the computed score exceeds 3.5.
  • Balance the dataset by selecting an equal number of examples from both relevant and not relevant groups, capping at 75 examples per category to avoid bias.
Once processed, this dataset is used to evaluate multiple relevance scoring models, including Weave’s scorer and OpenAI-based models such as GPT-4o and GPT-4o-mini.

Interpreting the results

After running the evaluation, results are processed using the Weave Scorer class, which aggregates row-level metrics—such as true positives, false positives, false negatives, and true negatives - into overall precision, recall, F1 score, and accuracy. These results are logged in the Weave dashboard, where developers can visualize model performance, identify patterns, and refine scoring methods to improve alignment with user expectations.

Evaluation the results & key insights


For this evaluation, the Weave Relevance Model demonstrated the highest precision at 0.556, outperforming the OpenAI 4o Relevance Scorer and OpenAI 4o Mini Relevance Scorer. This indicates that the Weave model is highly effective in minimizing false positives, making it a strong choice for high-accuracy applications where misclassified responses must be avoided.
However, OpenAI’s GPT-4o Mini Relevance Scorer achieved the highest recall (0.778), outperforming both the Weave Relevance Model (0.556) and OpenAI’s GPT-4o Relevance Scorer (0.593). This suggests that the Mini model was more effective in capturing relevant outputs, even if it sacrificed some precision. This result challenges the assumption that larger models always perform better, demonstrating that smaller models can sometimes achieve superior recall.
Interestingly, the GPT-4o Mini Relevance Scorer also achieved the highest F1 score (0.592), indicating it effectively balances precision and recall. In contrast:
  • Weave Relevance Model followed with an F1 score of 0.556
  • OpenAI GPT-4o Relevance Scorer lagged behind at 0.508
In terms of overall accuracy, the Weave Relevance Model performed the best, scoring 0.556, reflecting greater consistency in predictions. Comparatively:
  • GPT-4o-mini scored 0.463
  • GPT-4o Relevance Scorer trailed at 0.426

Key takeaways & practical implications

  • Weave Relevance Scorer excels in precision and accuracy, making it particularly suitable for scenarios requiring high reliability and fewer false positives.
  • GPT-4o Mini balances recall and F1 score effectively, making it advantageous when capturing as many relevant outputs as possible is a priority, even at the cost of some misclassifications.
  • Smaller models like GPT-4o Mini can outperform larger models in specific evaluation areas, highlighting the importance of task-specific benchmarking rather than assuming larger models always perform better.
Overall, the Weave Relevance Scorer offers an open-source, low-latency, and compute-efficient solution for relevance evaluation. Its accessibility makes it a practical choice for a wide range of applications, while the comparative analysis with OpenAI models provides valuable insights into optimizing relevance scoring strategies.

Weave comparisons view

In addition to the evaluation metrics, Weave’s comparisons view offers a detailed analysis of individual outputs generated by each model on specific examples from the dataset. This feature allows side-by-side examination of each model’s responses alongside the corresponding input text. By doing so, it highlights qualitative differences in how models approach the task, such as their ability to maintain semantic alignment, clarity, or inclusion of relevant context.

Exploring these comparisons can reveal patterns in model performance, uncovering strengths and weaknesses that may not be evident from aggregate metrics alone. This granular insight is particularly useful for diagnosing issues, understanding why certain models perform better on specific cases, and identifying areas for targeted improvement. With this functionality, Weave provides a comprehensive framework for evaluating, comparing, and refining models based on both quantitative and qualitative insights.

Conclusion

Relevance scoring is an essential tool in shaping how AI systems generate meaningful and contextually appropriate outputs. Throughout this article, we have seen how tools like the Weave Relevance Scorer and datasets such as HelpSteer2, UltraFeedback, and Natural Questions enable a nuanced evaluation of model responses. These tools help to ensure responses not only address the prompt but do so in a way that is accurate and clear.
The evaluation of models demonstrated the strengths and trade-offs inherent in different approaches. The Weave Relevance Scorer stood out for its precision and accuracy, while the OpenAI GPT-4o-mini Relevance Scorer excelled in recall and balance. Such comparisons reveal the importance of selecting the right tools based on specific needs. Moreover, the flexibility of relevance thresholds, such as the ≥ 3 threshold used here, highlights how scoring criteria can be adapted to match the goals of a given evaluation.
Ultimately, relevance scoring is not just about assessing performance but about improving how systems interact with users. It enables iterative refinements that align outputs more closely with expectations and context. As AI systems continue to play a central role in a variety of applications, relevance scoring will remain a foundational process in ensuring these systems are both effective and meaningful in their responses. I hope you enjoyed this article!

Iterate on AI agents and models faster. Try Weights & Biases today.