Coherence Scorer

This report details the creation of the Coherence Scorer
Created on November 22|Last edited on December 2
Comment
﻿
DefinitionThe degree to which the output is clear, easy to understand and maintains a proper logical flow.
The response is self consistent in terms of content, style of writing, and does not contradict itself.
The response can be logically followed and understood by a human. 
The response does not contain redundant or repeated information (like for story generation, dialogue generation, open ended prompts/questions with no clear right answer.)
Evaluation﻿Weave Evaluation comparison dashboard.﻿
Below is the Evaluation comparisons between OpenAI's GPT-4o and our custom W&B Coherence Scorer. 
The W&B Coherence Scorer has a Weighted F1-Score of 74%, just marginally less than GPTo4o's 76%
The W&B Coherence Scorer has a lower False Positive Rate than GPT-4o at 19% compared to the 31% for GPT-4o
Recall for the W&B Coherence Scorer is very strong at 96%, vs 84% for GPT-4o, while it's accuracy is 67%, somewhat lower than GPT-4o 73%
Although run locally on a CPU, W&B Coherence Scorer's average model latency is 0.2s, 20x faster compared to the 4.3s of the GPT-4o
The test labels are also shown to establish a baseline True Positive and False Positive rate.
A summary of comparison of the different evaluation scorers.﻿﻿
Comparison of Predictions on Coherent Responses
Comparison of Predictions on Incoherent Responses
Usagefrom weave.scorers import CoherenceScorer
﻿
coherence_scorer = CoherenceScorer()
﻿
input = "What is the capital of Antarctica?"
output = "but why not monkey up day"
﻿
result = coherence_scorer.score(input=input, output=output)
﻿
print(f"Output is coherent: {result['is_coherent']}")
print(result)
﻿
# Output is coherent: False
# {'is_coherent': False, 'coherence': 'Mostly Incoherent', 'coherence_score': 1, 'confidence': 0.32165247201919556}
Training﻿Fine-tuning training script﻿
We trained the model for two epochs on a combination of the above datasets. Here are the model evaluation logs
﻿
﻿
Here's an overview of the best model metrics from the training run
"eval/accuracy": 0.69
"eval/f1": 0.63
"eval/loss": 0.78
"eval/precision": 0.63
"eval/recall": 0.69﻿﻿
Model selectionWe trained tasksource/deberta-small-long-nli, a DeBerta-based model to predict the level of coherence of a given input-response pair. We chose this model because:
The model has 142M parameters - suitable for running on most modern CPUs with low latency
The model has a context length of 1680 tokens - enough to fit to cover the length of a large % of prompt, response pairs we might encounter with LLM applications
The model has already been pre-trained on a variety of useful NLP tasks which also includes Coherence Classification.
We have also made our custom model publicly available at the following huggingface repo﻿
Existing Work
Key TakeawaysDeveloping a metric and scorer for Coherence can be a challenging task primarily due to the following reasons.
Subjectivity: Coherence is inherently subjective, and what is coherent to one reader may not be to another .11﻿
Complexity of Language: The complexity and variability of natural language make it difficult to develop universal coherence metrics .
Lack of Gold Standard: The absence of a single "gold standard" for coherence evaluation makes it challenging to establish clear benchmarks for success .9﻿
Metric Discrepancies: Different coherence measures can sometimes negatively correlate with each other, indicating that they may favor different aspects of coherence .8﻿
Context Dependency: The coherence of a text can be highly dependent on context, which is challenging to capture fully in automated metrics.
Deeper Semantic Understanding: Current models may struggle with understanding deeper semantic relationships and narrative structures that are easily grasped by humans .3﻿
Despite these challenges we were able to build a small classification model that is capable of evaluating the coherence of a response for a given input. The model has comparable performance compared to a GPT-4o based scorer while running locally and considerable faster. 
FutureWork
Appendix: Scorer Implementation﻿
Add a comment