Skip to main content

Coherence Scorer

This report details the creation of the Coherence Scorer
Created on November 22|Last edited on December 2

Definition

The degree to which the output is clear, easy to understand and maintains a proper logical flow.
  • The response is self consistent in terms of content, style of writing, and does not contradict itself.
  • The response can be logically followed and understood by a human.
  • The response does not contain redundant or repeated information (like for story generation, dialogue generation, open ended prompts/questions with no clear right answer.)

Evaluation

Below is the Evaluation comparisons between OpenAI's GPT-4o and our custom W&B Coherence Scorer.
  • The W&B Coherence Scorer has a Weighted F1-Score of 74%, just marginally less than GPTo4o's 76%
  • The W&B Coherence Scorer has a lower False Positive Rate than GPT-4o at 19% compared to the 31% for GPT-4o
  • Recall for the W&B Coherence Scorer is very strong at 96%, vs 84% for GPT-4o, while it's accuracy is 67%, somewhat lower than GPT-4o 73%
  • Although run locally on a CPU, W&B Coherence Scorer's average model latency is 0.2s, 20x faster compared to the 4.3s of the GPT-4o
The test labels are also shown to establish a baseline True Positive and False Positive rate.
A summary of comparison of the different evaluation scorers.

Comparison of Predictions on Coherent Responses

Comparison of Predictions on Incoherent Responses

Usage

from weave.scorers import CoherenceScorer

coherence_scorer = CoherenceScorer()

input = "What is the capital of Antarctica?"
output = "but why not monkey up day"

result = coherence_scorer.score(input=input, output=output)

print(f"Output is coherent: {result['is_coherent']}")
print(result)

# Output is coherent: False
# {'is_coherent': False, 'coherence': 'Mostly Incoherent', 'coherence_score': 1, 'confidence': 0.32165247201919556}

Training

We trained the model for two epochs on a combination of the above datasets. Here are the model evaluation logs


Here's an overview of the best model metrics from the training run
"eval/accuracy": 0.69
"eval/f1": 0.63
"eval/loss": 0.78
"eval/precision": 0.63
"eval/recall": 0.69

Model selection

We trained tasksource/deberta-small-long-nli, a DeBerta-based model to predict the level of coherence of a given input-response pair. We chose this model because:
  • The model has 142M parameters - suitable for running on most modern CPUs with low latency
  • The model has a context length of 1680 tokens - enough to fit to cover the length of a large % of prompt, response pairs we might encounter with LLM applications
  • The model has already been pre-trained on a variety of useful NLP tasks which also includes Coherence Classification.
We have also made our custom model publicly available at the following huggingface repo

Existing Work

Key Takeaways

Developing a metric and scorer for Coherence can be a challenging task primarily due to the following reasons.
  1. Subjectivity: Coherence is inherently subjective, and what is coherent to one reader may not be to another .11
  2. Complexity of Language: The complexity and variability of natural language make it difficult to develop universal coherence metrics .
  3. Lack of Gold Standard: The absence of a single "gold standard" for coherence evaluation makes it challenging to establish clear benchmarks for success .9
  4. Metric Discrepancies: Different coherence measures can sometimes negatively correlate with each other, indicating that they may favor different aspects of coherence .8
  5. Context Dependency: The coherence of a text can be highly dependent on context, which is challenging to capture fully in automated metrics.
  6. Deeper Semantic Understanding: Current models may struggle with understanding deeper semantic relationships and narrative structures that are easily grasped by humans .3
Despite these challenges we were able to build a small classification model that is capable of evaluating the coherence of a response for a given input. The model has comparable performance compared to a GPT-4o based scorer while running locally and considerable faster.

FutureWork

Appendix: Scorer Implementation