Context Relevance Scorer

This report details the creation of the Context Relevance Scorer
Created on December 16|Last edited on February 6
Comment
﻿
Definition
Evaluation﻿Weave Evaluation comparison dashboard.﻿
The W&B Context Relevance Scorer was evaluated against the RAGAS Context Precision Scorer﻿
The following screenshot contains the Evaluation Comparisons between three versions of the Wandb Scorers compared against the RAGAS scores and the truth labels.
The WandbScorer is has a Weighted F1-Score of ~68% compared to the OpenAI with an accuracy of ~20%
However, The False Positive Rate of the WandbScorer is at 7% compared to the ~24% False Positive of the OpenAIScorer
Although the model is run locally on a CPU the Avg Model Latency of the WandbScorer is ~4s compared to the ~7.6s of the OpenAIScorer
﻿
﻿
﻿
﻿
UsageThe relevance scorer returns a pass boolean to determine whether or not the context is relevant to the input and response. For additional granularity it also returns an additional score, which is the degree of relevance and the detected spans. When the scorer is initialised the model weights will be downloaded if they're not already on disk.
﻿
from weave.scorers import ContextRelevanceScorer
﻿
relevance_scorer = ContextRelevanceScorer()
﻿
question="Where is the Eiffel Tower located?"
response="The Eiffel Tower is located in Paris."
context=["The Eiffel Tower is located in Paris."] 
﻿
relevance_scorer.score(query=input, context=context, output=output)
﻿
Datasets﻿Dataset used for training﻿
﻿
TrainingThe deberta-small-long-nli ﻿model was finetuned to produce the Context Relevance model.
﻿Fine-tuning training script﻿
﻿
Training Metrics﻿W&B run﻿
﻿W&B artifact with model weights﻿
We trained the model for 2 epochs on a combination of the above datasets. We have also made this model publicly available at the following huggingface repo. ﻿
﻿
﻿
AppendixTable of predictions of the WandbContextRelevance Scorer
﻿
﻿
﻿
﻿
Add a comment