Coherence Scorer
This report details the creation of the Coherence Scorer
Created on November 22|Last edited on December 2
Comment
Definition
The degree to which the output is clear, easy to understand and maintains a proper logical flow.
- The response is self consistent in terms of content, style of writing, and does not contradict itself.
- The response can be logically followed and understood by a human.
- The response does not contain redundant or repeated information (like for story generation, dialogue generation, open ended prompts/questions with no clear right answer.)
Evaluation
Below is the Evaluation comparisons between OpenAI's GPT-4o and our custom W&B Coherence Scorer.
- The W&B Coherence Scorer has a Weighted F1-Score of 74%, just marginally less than GPTo4o's 76%
- The W&B Coherence Scorer has a lower False Positive Rate than GPT-4o at 19% compared to the 31% for GPT-4o
- Recall for the W&B Coherence Scorer is very strong at 96%, vs 84% for GPT-4o, while it's accuracy is 67%, somewhat lower than GPT-4o 73%
- Although run locally on a CPU, W&B Coherence Scorer's average model latency is 0.2s, 20x faster compared to the 4.3s of the GPT-4o
The test labels are also shown to establish a baseline True Positive and False Positive rate.

Comparison of Predictions on Coherent Responses
Comparison of Predictions on Incoherent Responses
Usage
from weave.scorers import CoherenceScorercoherence_scorer = CoherenceScorer()input = "What is the capital of Antarctica?"output = "but why not monkey up day"result = coherence_scorer.score(input=input, output=output)print(f"Output is coherent: {result['is_coherent']}")print(result)# Output is coherent: False# {'is_coherent': False, 'coherence': 'Mostly Incoherent', 'coherence_score': 1, 'confidence': 0.32165247201919556}
Training
We trained the model for two epochs on a combination of the above datasets. Here are the model evaluation logs
Here's an overview of the best model metrics from the training run
"eval/accuracy": 0.69"eval/f1": 0.63"eval/loss": 0.78"eval/precision": 0.63
Model selection
We trained tasksource/deberta-small-long-nli, a DeBerta-based model to predict the level of coherence of a given input-response pair. We chose this model because:
- The model has 142M parameters - suitable for running on most modern CPUs with low latency
- The model has a context length of 1680 tokens - enough to fit to cover the length of a large % of prompt, response pairs we might encounter with LLM applications
- The model has already been pre-trained on a variety of useful NLP tasks which also includes Coherence Classification.
Existing Work
Key Takeaways
Developing a metric and scorer for Coherence can be a challenging task primarily due to the following reasons.
- Subjectivity: Coherence is inherently subjective, and what is coherent to one reader may not be to another .11
- Complexity of Language: The complexity and variability of natural language make it difficult to develop universal coherence metrics .
- Lack of Gold Standard: The absence of a single "gold standard" for coherence evaluation makes it challenging to establish clear benchmarks for success .9
- Metric Discrepancies: Different coherence measures can sometimes negatively correlate with each other, indicating that they may favor different aspects of coherence .8
- Context Dependency: The coherence of a text can be highly dependent on context, which is challenging to capture fully in automated metrics.
- Deeper Semantic Understanding: Current models may struggle with understanding deeper semantic relationships and narrative structures that are easily grasped by humans .3
Despite these challenges we were able to build a small classification model that is capable of evaluating the coherence of a response for a given input. The model has comparable performance compared to a GPT-4o based scorer while running locally and considerable faster.
FutureWork
Appendix: Scorer Implementation
Add a comment