Skip to main content

Context Relevance Scorer

This report details the creation of the Context Relevance Scorer
Created on December 16|Last edited on February 6

Definition

Evaluation

The W&B Context Relevance Scorer was evaluated against the RAGAS Context Precision Scorer
The following screenshot contains the Evaluation Comparisons between three versions of the Wandb Scorers compared against the RAGAS scores and the truth labels.
  • The WandbScorer is has a Weighted F1-Score of ~68% compared to the OpenAI with an accuracy of ~20%
  • However, The False Positive Rate of the WandbScorer is at 7% compared to the ~24% False Positive of the OpenAIScorer
  • Although the model is run locally on a CPU the Avg Model Latency of the WandbScorer is ~4s compared to the ~7.6s of the OpenAIScorer





Usage

The relevance scorer returns a pass boolean to determine whether or not the context is relevant to the input and response. For additional granularity it also returns an additional score, which is the degree of relevance and the detected spans. When the scorer is initialised the model weights will be downloaded if they're not already on disk.

from weave.scorers import ContextRelevanceScorer

relevance_scorer = ContextRelevanceScorer()

question="Where is the Eiffel Tower located?"
response="The Eiffel Tower is located in Paris."
context=["The Eiffel Tower is located in Paris."]

relevance_scorer.score(query=input, context=context, output=output)


Datasets



Training

The deberta-small-long-nli model was finetuned to produce the Context Relevance model.


Training Metrics

W&B run
We trained the model for 2 epochs on a combination of the above datasets. We have also made this model publicly available at the following huggingface repo. 



Appendix

Table of predictions of the WandbContextRelevance Scorer