AI guardrails: Relevance scorers
This article explores relevance scoring in AI, detailing tools, datasets, and methods for evaluating and refining how well model outputs align with input prompts and context.
Created on January 14|Last edited on March 1
Comment
AI models often struggle to produce responses that are accurate, contextually relevant, and aligned with user intent. Relevance scoring serves as a guardrail, ensuring that AI-generated responses remain on track, adhere to the input prompt, and maintain coherence with contextual information.
The Weave Relevance Scorer provides a structured way to evaluate, compare, and refine model outputs based on their semantic alignment, logical structure, and contextual appropriateness. By leveraging Weave, developers can systematically assess AI performance and identify areas for improvement in real-world applications.
This article explores how the Weave Relevance Scorer works, how it integrates with Weave’s evaluation framework, and how it compares to other models. We’ll also introduce relevance-focused datasets and walk through a hands-on implementation.
For those wanting to get their hands dirty right away, we've created an interactive Colab.
For those who want to understand the complexities and details, below we'll provide a deeper understanding of relevance in AI, along with the tools and code you’ll need to integrate these strategies into your projects.

Table of contents
What is relevance? Why relevance mattersAcademic work on relevance Datasets for relevance How to use the Weave Relevance Scorer Measuring relevance with Weave Evals Benchmarking different relevance scoring modelsInterpreting the resultsWeave comparisons view Conclusion
What is relevance?
Relevance measures how well an AI-generated response directly answers a given prompt while maintaining contextual accuracy and coherence.
Relevance is evaluated across three key dimensions:
- Semantic Alignment: Does the response directly address the prompt’s core intent?
- Structural Coherence: Is the response logically organized and internally consistent?
- Contextual Integration: Does the response appropriately incorporate the provided context?
To act as a guardrail for AI-generated outputs, Weave applies a Likert Scale to measure relevance:
- 5 (Perfectly Relevant): The response fully addresses all aspects of the prompt, integrates context seamlessly, and provides complete, precise, and concise information without unnecessary details.
- 4 (Mostly Relevant): The response aligns well with the prompt and context, with minor gaps or slight digressions that do not significantly affect its utility.
- 3 (Somewhat Relevant): The response addresses some aspects of the prompt but has noticeable gaps, occasional digressions, or unnecessary content that reduce its relevance.
- 2 (Mostly Irrelevant): The response minimally addresses the prompt, with significant omissions, frequent digressions, or substantial irrelevant content.
- 1 (Completely Irrelevant): The response entirely fails to address the prompt, is off-topic, or lacks any meaningful information.
By scoring relevance in this way, Weave establishes a guardrail that prevents AI models from producing responses that are misaligned, unclear, or contextually inappropriate.
Why relevance matters
Relevance is key to ensuring that AI-generated outputs address user queries in a meaningful and contextually appropriate way. It shapes the effectiveness of tasks like question answering, content summarization, and decision support by aligning responses with the input prompt and context.
Irrelevant or poorly aligned outputs can lead to confusion and/or dissatisfaction, undermining trust in the system. By applying guardrails like relevance scoring, it becomes possible to systematically improve model performance, ensuring outputs are clear and accurate.
Additionally, relevance scoring allows for the identification of specific areas where models excel or falter, providing actionable insights for targeted refinements. This process not only enhances the quality of AI outputs but also fosters reliability and consistency across diverse applications.
Academic work on relevance
To the best of our knowledge, only one academic work explicitly defines relevance in the way we do here: SummEval. This work collected and released human judgments, both expert and crowd-sourced, on 16 model outputs across 100 articles over four dimensions, including relevance, to advance research in human-correlated evaluation metrics.
However, SummEval is a benchmark dataset containing only 100 examples annotated for various LLM-generated text-summary pairs. While valuable, its limited size constrains its utility for training robust models across diverse contexts. To address this limitation, we proceed to construct a synthetic dataset based on a broader range of examples. This dataset leverages automated techniques to annotate responses with relevance scores, guided by well-defined dimensions such as semantic alignment, structural coherence, and contextual integration.
Datasets for relevance
Datasets like HelpSteer2, UltraFeedback, and Natural Questions provide valuable examples for training relevance scorers, as they include attributes closely aligned with these dimensions, enabling robust evaluation for relevance.
In the case of the UltraFeedback dataset, relevance scoring uses key attributes such as helpfulness, instruction following, truthfulness, and honesty. These factors are averaged to calculate a relevance score, reflecting how well a response adheres to instructions, provides factual accuracy, and remains aligned with the query while maintaining coherent reasoning.
For the Natural Questions dataset, relevance scores are assigned on a scale from 1 to 5 based on specific response qualities. Short answers that directly address the query are labeled with a relevance score of 5 for their precision and alignment. Long answers, while detailed and informative, are more verbose and receive a relevance score of 3. Distilled versions of long answers, which extract concise sentences containing the short answer, are assigned a relevance score of 4. Candidate chunks overlapping significantly with the long answer but lacking the short answer are labeled with a relevance score of 2, while unrelated chunks are given a relevance score of 1. Cosine similarity helps rank candidate chunks by their alignment with the query, and contextual sampling ensures a diverse and representative dataset for training.
The HelpSteer2 dataset, originally developed for training reward models, can be effectively repurposed to compute relevance scores by leveraging its detailed attributes: helpfulness, correctness, coherence, and verbosity. Although relevance is not directly annotated in the dataset, these attributes align well with the three core dimensions of relevance - semantic alignment, structural coherence, and contextual integration.
Semantic alignment is captured through helpfulness and correctness, assessing whether the response directly addresses the user's query and provides accurate information. Structural coherence is reflected in the coherence attribute, which evaluates the logical flow and clarity of the response, and verbosity, which penalizes overly lengthy outputs to ensure conciseness. Contextual integration is supported by helpfulness and correctness, ensuring the response remains consistent with the given context and background information.
To compute a relevance score for the Help2steer dataset, a weighted formula is applied to these attributes: helpfulness is assigned a high weight (e.g., 0.4) to emphasize its critical role in aligning with user queries and contexts, correctness is weighted at 0.3 to ensure factual accuracy, coherence contributes 0.2 to reward clarity and flow, and verbosity is given a negative weight (e.g., -0.1) to discourage unnecessary details. The resulting score provides a comprehensive metric that encapsulates how well a response meets the standards of relevance. This methodology transforms HelpSteer2 into a versatile tool for training relevance models, enabling developers to align AI outputs more closely with user expectations.
HelpSteer2 Relevance =(0.4×Helpfulness)+(0.3×Correctness)+(0.2×Coherence)−(0.1×Verbosity)
The Weave Relevance Scorer was trained on a combination of these datasets to ensure a comprehensive understanding of relevance across diverse use cases. This training process allowed the model to learn from high-quality examples and capture nuanced aspects of relevance. To support the AI community, the trained model has been made publicly available on Hugging Face, providing a reliable tool for evaluating and refining AI systems.
How to use the Weave Relevance Scorer
The following script demonstrates how to evaluate a response’s relevance using Weave’s relevance scorer
import asyncioimport weave; weave.init("relevance-scorer")from weave.scorers import RelevanceScorer# Initialize the relevance scorerrelevance_scorer = RelevanceScorer()# Define input, context, and outputinput_text = "Some input text to consider"context_text = "Some context to consider" # Optional contextoutput_text = "Some output text to consider"# Async function to compute relevance scoreasync def get_relevance_score():score = await relevance_scorer.score(input=input_text, context=context_text, output=output_text)print(score)# Run the async functionasyncio.run(get_relevance_score())
The relevance scores generated by the relevance_scorer.score() function include a breakdown of how well a response aligns with its input and context, measured across dimensions such as semantic alignment, structural coherence, and contextual integration. These scores provide valuable insights into the quality of the model's outputs, enabling nuanced analysis and identification of areas where relevance criteria are met or where adjustments may be needed.
After running the script, the results are automatically logged into Weave’s dashboard, ensuring continuous monitoring and refinement of model performance - a critical guardrail for AI-generated content.

Weave’s dashboard allows teams to explore and analyze the results in detail. By visualizing relevance scores and identifying patterns, developers can monitor trends, compare model outputs, and refine their systems to enhance performance.
💡
Measuring relevance with Weave Evals
In this tutorial, we will demonstrate how to:
- Load and preprocess datasets for relevance scoring
- Apply Weave’s Relevance Scorer to assess model responses
- Compare results across multiple models, including OpenAI’s GPT-4o
By benchmarking Weave’s Relevance Scorer against OpenAI’s GPT-4o and GPT-4o-Mini, we establish a strong evaluation framework - a key guardrail for ensuring AI-generated outputs remain relevant and accurate.
import weavefrom weave.scorers import CoherenceScorerimport pandas as pdfrom datasets import load_datasetimport asynciofrom weave.trace.box import unboximport osfrom pydantic import BaseModel, Fieldfrom typing import Literalfrom typing import Optional, Anyfrom litellm import acompletionimport json# from relevance_scorer import RelevanceScorerfrom weave.scorers import RelevanceScorerweave.init("relevance_eval")def load_relevance_dataset():# Load the HelpSteer2 datasetdataset = load_dataset("nvidia/HelpSteer2", split="validation")# Convert to a Pandas DataFrame for easier manipulationdf = pd.DataFrame(dataset)# Define the weights for relevance computationweights = {"helpfulness": 0.4,"correctness": 0.3,"coherence": 0.2,"verbosity": -0.1 # Negative weight for verbosity}# Calculate relevance score using the weighted formuladf["relevance"] = (weights["helpfulness"] * df["helpfulness"] +weights["correctness"] * df["correctness"] +weights["coherence"] * df["coherence"] +weights["verbosity"] * df["verbosity"])# Determine binary labels (True for relevance >= 3.5, False otherwise)df["relevance_label"] = df["relevance"] >= 3.5# Separate into True and False groupstrue_group = df[df["relevance_label"] == True]false_group = df[df["relevance_label"] == False]# Determine the maximum number of samples per groupmax_samples = min(len(true_group), len(false_group), 75) # Adjust dynamically based on available data# Sample from each groupbalanced_true = true_group.sample(n=max_samples, random_state=42)balanced_false = false_group.sample(n=max_samples, random_state=42)# Combine the balanced samplesbalanced_df = pd.concat([balanced_true, balanced_false]).reset_index(drop=True)# Shuffle the resulting DataFramebalanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)# Prepare the datasetdataset_prepared = [{"output": row["response"], "label": row["relevance_label"], "input": row["prompt"]}for _, row in balanced_df.iterrows()]return dataset_preparedRELEVANCE_SYSTEM_PROMPT = """You are an expert evaluator assessing the relevance of LLM-generated outputs relative to their input context.Your goal is to provide a single relevance score and classification based on comprehensive analysis.Relevance measures how effectively a generated output addresses its input context across three core dimensions:1. **Semantic Alignment**- How directly does the output address key input requirements?- Does it maintain topical focus?- Does it provide complete coverage of necessary information?- Is unnecessary content avoided?2. **Structural Coherence**- Does the output flow logically and show internal consistency?- Is the presentation of information clear and organized?- Is there a good balance between completeness and conciseness?3. **Contextual Integration**- How well does the output use the provided context?- Does the output align with the broader discourse?- Is it consistent with background information?- Does it fulfill task-specific requirements?## Evaluation Process1. Review all input context (instructions, prompts, documents, chat history)2. Identify core requirements and purpose3. Analyze the LLM output across all three dimensions4. Assign a single relevance score (1-5):- 5: Exceptional relevance across all dimensions- 4: Strong relevance with minor gaps- 3: Adequate relevance with some issues- 2: Significant relevance issues- 1: Major relevance problems5. Classify as relevant (score ≥ 3.0) or not relevant (score < 3.0)## Task-Specific Considerations- **Summarization**: Focus on key information selection and density- **Q&A**: Emphasize answer accuracy and completeness- **Chat**: Consider conversation flow and context maintenance- **RAG**: Evaluate retrieved information integration## Output FormatProvide evaluation results in the following JSON format:```json{"relevance": [score from 1-5],"relevant": [true/false]}```""".strip()class RelevanceScore(BaseModel):"""The level of relevance of a <completion> for a given <context>."""chain_of_thought: str = Field(..., description="The chain of thought that led to the prediction")score: int = Field(..., description="Score the relevance of the <completion> on a likert scale of 1 to 5")relevance: Literal["Perfectly Relevant", "Mostly Relevant", "A Little Irrelevant", "Mostly Irrelevant", "Completely Irrelevant"] = Field(..., description="The level of relevance of the <completion>")relevant: bool = Field(..., description="Whether the <completion> is relevant or not, anything above 3 is relevant")class OpenAIRelevanceScorer(weave.Model):system_prompt: str = RELEVANCE_SYSTEM_PROMPTmodel_id: strtemperature: float = 0.7max_tokens: int = 4096# client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])def __init__(self, model_id: str, temperature: float = 0.7, max_tokens: int = 4096):"""Initialize the OpenAIRelevanceScorer with a specific model ID."""super().__init__(model_id=model_id, temperature=temperature, max_tokens=max_tokens)self.model_id = model_idself.temperature = temperatureself.max_tokens = max_tokensdef _format_messages(self, prompt: str, completion: str, context:Optional[list[str]], chat_history: Optional[list[dict[str, str]]]) -> str:"""Format the prompt for the model."""chat_history = chat_history if isinstance(chat_history, list) else []context = context if isinstance(context, list) else []if context:context = "\n".join(context).strip()context = f"<documents>\n{context}\n</documents>"else:context = ""prompt = f"{context}\n\n{prompt}".strip()messages = chat_history + [{"role": "user", "content": prompt}]messages = [f"<|msg_start|>{message['role']}\n{message['content']}<|msg_end|>" for message in messages]messages = "\n".join(messages)context = f"<context>{messages}</context>\n"completion = f"<completion>{completion}</completion>\n"context_and_completion = context + completionreturn [{"role": "system", "content": self.system_prompt},{"role": "user", "content": context_and_completion},]async def run_inference(self, messages):api_key = os.getenv("OPENAI_API_KEY") # Ensure your API key is set in the environment variablemodel_name = self.model_idresponse = await acompletion(model=model_name,api_key=api_key,messages=messages,temperature=self.temperature,max_tokens=self.max_tokens,)return response["choices"][0]["message"]["content"]@weave.opasync def predict(self, input: str, output:str, context:Optional[list[str]] = None, chat_history: Optional[list[dict[str, str]]] = None) -> dict[str, Any]:try:messages = self._format_messages(prompt=input, completion=output, context=context, chat_history=chat_history)response = await self.run_inference(messages)if response.startswith('```') and response.endswith('```'):response = response.split('\n', 1)[1].rsplit('\n', 1)[0]parsed_response = json.loads(response) # Convert the string to a Python dictionaryreturn bool({k: v for k, v in parsed_response.items() if k != "chain_of_thought"}['relevant'])except Exception as e:print(str(e))return False# Define Weave Relevance Scorer Modelclass WeaveRelevanceModel(weave.Model):@weave.opasync def predict(self, input: str) -> int:result = await scorer.score(input=input, output=input)# return result['relevant']return result['score'] >= 3 # multiple ways to determine relevance prediction hereclass PrecisionRecallF1Scorer(weave.Scorer):"""Custom scorer to calculate precision, recall, F1, and accuracy at the dataset level."""@weave.opdef score(self, label: int, model_output: int) -> dict:"""Compute True Positives, False Positives, False Negatives, and True Negatives for a single row."""tp = int(label == 1 and model_output == 1) # True Positivefp = int(label == 0 and model_output == 1) # False Positivefn = int(label == 1 and model_output == 0) # False Negativetn = int(label == 0 and model_output == 0) # True Negativereturn {"tp": tp, "fp": fp, "fn": fn, "tn": tn}def summarize(self, score_rows: list) -> dict:"""Summarize precision, recall, F1, and accuracy from the row-level scores."""# Aggregate true positives, false positives, false negatives, and true negativestotal_tp = sum(row["tp"] for row in score_rows)total_fp = sum(row["fp"] for row in score_rows)total_fn = sum(row["fn"] for row in score_rows)total_tn = sum(row["tn"] for row in score_rows)# Calculate precision, recall, F1, and accuracyprecision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0f1 = (2 * (precision * recall) / (precision + recall)if (precision + recall) > 0else 0)accuracy = (total_tp + total_tn) / (total_tp + total_fp + total_fn + total_tn) if (total_tp + total_fp + total_fn + total_tn) > 0 else 0return {"precision": precision,"recall": recall,"f1": f1,"accuracy": accuracy,# "tp": total_tp,# "fp": total_fp,# "fn": total_fn,# "tn": total_tn,}# Instantiate OpenAI Relevance Scorersopenai_scorer_4o = OpenAIRelevanceScorer(model_id="gpt-4o-2024-08-06", temperature=0.0, max_tokens=4096)openai_scorer_4omini = OpenAIRelevanceScorer(model_id="gpt-4o-mini", temperature=0.0, max_tokens=2048)scorer = RelevanceScorer()# Define OpenAIScorer4o Modelclass OpenAIScorer4o(weave.Model):"""Weave model that wraps the OpenAIRelevanceScorer for gpt-4o."""@weave.opasync def predict(self, input: str, output: str, context: Optional[list[str]] = None, chat_history: Optional[list[dict[str, str]]] = None) -> bool:"""Use the instantiated gpt-4o scorer to predict relevance."""return await openai_scorer_4o.predict(input=input, output=output, context=context, chat_history=chat_history)# Define OpenAIScorer4omini Modelclass OpenAIScorer4omini(weave.Model):"""Weave model that wraps the OpenAIRelevanceScorer for gpt-4o-mini."""@weave.opasync def predict(self, input: str, output: str, context: Optional[list[str]] = None, chat_history: Optional[list[dict[str, str]]] = None) -> bool:"""Use the instantiated gpt-4o-mini scorer to predict relevance."""return await openai_scorer_4omini.predict(input=input, output=output, context=context, chat_history=chat_history)# Run the evaluationsasync def run_evaluations():"""Run evaluations for the scorers."""# Load datasetdataset = load_relevance_dataset()print("Dataset loaded...")# Initialize modelsmodels = {"WeaveRelevanceModel": WeaveRelevanceModel(),"OpenAIRelevanceScorer4oMini": OpenAIScorer4omini(),"OpenAIRelevanceScorer4o": OpenAIScorer4o(),}# Define evaluation scorersscorers = [PrecisionRecallF1Scorer()]# Run evaluationsresults = {}for model_name, model in models.items():print(f"\nEvaluating {model_name}...")evaluation = weave.Evaluation(dataset=dataset,scorers=scorers,name=model_name + " Eval")results[model_name] = await evaluation.evaluate(model)# Print resultsfor model_name, result in results.items():print(f"\nResults for {model_name}:")print(result)if __name__ == "__main__":asyncio.run(run_evaluations())
This script demonstrates how to evaluate the relevance of AI-generated outputs using the Weave platform. The process begins by loading the HelpSteer2 dataset, a resource that provides detailed annotations to guide reward modeling for AI systems. While HelpSteer2 does not explicitly include a relevance annotation, relevance scores are computed by combining attributes such as helpfulness, correctness, coherence, and verbosity to provide a structured evaluation of response quality.
Benchmarking different relevance scoring models
This evaluation compares the Weave Relevance Scorer against OpenAI’s GPT-4o and GPT-4o-mini to benchmark their performance in assessing relevance. By analyzing the relevance scores and model outputs across these tools, we highlight key strengths, trade-offs, and areas for improvement, offering actionable insights into different scoring methodologies.
To ensure a balanced evaluation, the HelpSteer2 dataset is processed to:
- Assign relevance scores using a weighted formula, where helpfulness and correctness are prioritized, coherence ensures clarity, and verbosity is penalized.
- Label responses as relevant or not based on whether the computed score exceeds 3.5.
- Balance the dataset by selecting an equal number of examples from both relevant and not relevant groups, capping at 75 examples per category to avoid bias.
Once processed, this dataset is used to evaluate multiple relevance scoring models, including Weave’s scorer and OpenAI-based models such as GPT-4o and GPT-4o-mini.
Interpreting the results
After running the evaluation, results are processed using the Weave Scorer class, which aggregates row-level metrics—such as true positives, false positives, false negatives, and true negatives - into overall precision, recall, F1 score, and accuracy. These results are logged in the Weave dashboard, where developers can visualize model performance, identify patterns, and refine scoring methods to improve alignment with user expectations.
Evaluation the results & key insights

For this evaluation, the Weave Relevance Model demonstrated the highest precision at 0.556, outperforming the OpenAI 4o Relevance Scorer and OpenAI 4o Mini Relevance Scorer. This indicates that the Weave model is highly effective in minimizing false positives, making it a strong choice for high-accuracy applications where misclassified responses must be avoided.
However, OpenAI’s GPT-4o Mini Relevance Scorer achieved the highest recall (0.778), outperforming both the Weave Relevance Model (0.556) and OpenAI’s GPT-4o Relevance Scorer (0.593). This suggests that the Mini model was more effective in capturing relevant outputs, even if it sacrificed some precision. This result challenges the assumption that larger models always perform better, demonstrating that smaller models can sometimes achieve superior recall.
Interestingly, the GPT-4o Mini Relevance Scorer also achieved the highest F1 score (0.592), indicating it effectively balances precision and recall. In contrast:
- Weave Relevance Model followed with an F1 score of 0.556
- OpenAI GPT-4o Relevance Scorer lagged behind at 0.508
In terms of overall accuracy, the Weave Relevance Model performed the best, scoring 0.556, reflecting greater consistency in predictions. Comparatively:
- GPT-4o-mini scored 0.463
- GPT-4o Relevance Scorer trailed at 0.426
Key takeaways & practical implications
- Weave Relevance Scorer excels in precision and accuracy, making it particularly suitable for scenarios requiring high reliability and fewer false positives.
- GPT-4o Mini balances recall and F1 score effectively, making it advantageous when capturing as many relevant outputs as possible is a priority, even at the cost of some misclassifications.
- Smaller models like GPT-4o Mini can outperform larger models in specific evaluation areas, highlighting the importance of task-specific benchmarking rather than assuming larger models always perform better.
Overall, the Weave Relevance Scorer offers an open-source, low-latency, and compute-efficient solution for relevance evaluation. Its accessibility makes it a practical choice for a wide range of applications, while the comparative analysis with OpenAI models provides valuable insights into optimizing relevance scoring strategies.
Weave comparisons view
In addition to the evaluation metrics, Weave’s comparisons view offers a detailed analysis of individual outputs generated by each model on specific examples from the dataset. This feature allows side-by-side examination of each model’s responses alongside the corresponding input text. By doing so, it highlights qualitative differences in how models approach the task, such as their ability to maintain semantic alignment, clarity, or inclusion of relevant context.

Exploring these comparisons can reveal patterns in model performance, uncovering strengths and weaknesses that may not be evident from aggregate metrics alone. This granular insight is particularly useful for diagnosing issues, understanding why certain models perform better on specific cases, and identifying areas for targeted improvement. With this functionality, Weave provides a comprehensive framework for evaluating, comparing, and refining models based on both quantitative and qualitative insights.
Conclusion
Relevance scoring is an essential tool in shaping how AI systems generate meaningful and contextually appropriate outputs. Throughout this article, we have seen how tools like the Weave Relevance Scorer and datasets such as HelpSteer2, UltraFeedback, and Natural Questions enable a nuanced evaluation of model responses. These tools help to ensure responses not only address the prompt but do so in a way that is accurate and clear.
The evaluation of models demonstrated the strengths and trade-offs inherent in different approaches. The Weave Relevance Scorer stood out for its precision and accuracy, while the OpenAI GPT-4o-mini Relevance Scorer excelled in recall and balance. Such comparisons reveal the importance of selecting the right tools based on specific needs. Moreover, the flexibility of relevance thresholds, such as the ≥ 3 threshold used here, highlights how scoring criteria can be adapted to match the goals of a given evaluation.
Ultimately, relevance scoring is not just about assessing performance but about improving how systems interact with users. It enables iterative refinements that align outputs more closely with expectations and context. As AI systems continue to play a central role in a variety of applications, relevance scoring will remain a foundational process in ensuring these systems are both effective and meaningful in their responses. I hope you enjoyed this article!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.