Context Relevance Scorer
This report details the creation of the Context Relevance Scorer
Created on December 16|Last edited on February 6
Comment
Definition
Evaluation
The following screenshot contains the Evaluation Comparisons between three versions of the Wandb Scorers compared against the RAGAS scores and the truth labels.
- The WandbScorer is has a Weighted F1-Score of ~68% compared to the OpenAI with an accuracy of ~20%
- However, The False Positive Rate of the WandbScorer is at 7% compared to the ~24% False Positive of the OpenAIScorer
- Although the model is run locally on a CPU the Avg Model Latency of the WandbScorer is ~4s compared to the ~7.6s of the OpenAIScorer


Usage
The relevance scorer returns a pass boolean to determine whether or not the context is relevant to the input and response. For additional granularity it also returns an additional score, which is the degree of relevance and the detected spans. When the scorer is initialised the model weights will be downloaded if they're not already on disk.
from weave.scorers import ContextRelevanceScorerrelevance_scorer = ContextRelevanceScorer()question="Where is the Eiffel Tower located?"response="The Eiffel Tower is located in Paris."context=["The Eiffel Tower is located in Paris."]relevance_scorer.score(query=input, context=context, output=output)
Datasets
Training
Training Metrics
We trained the model for 2 epochs on a combination of the above datasets. We have also made this model publicly available at the following huggingface repo.
Appendix
Table of predictions of the WandbContextRelevance Scorer
Add a comment