Using ChatGPT as an evaluation metric : Prompt selection and evaluation
Created on August 19|Last edited on November 2
Comment
Introduction
There are many statistical metrics to evaluate clustering, though these metrics evaluate mainly the density of cluster in the embedding space.
To move from a quantative evaluation to a qualitative one, one way is to put a human in the loop, but it is an expensive approach.
To avoid checking all clustered documents by a human, we can leverage the power of ChatGPT, and have it simulate a human annotator.
Problem
The idea is to use ChatGPT to evaluate the output clusters.
Mainly we would prompt ChatGPT asking it to check if the documents of each cluster belong really to the cluster and to search for outliers. This would help set another metric besides the traditional ones (like Silhouette wich is far more quantative) to evaluate our clustering and iterate on the problem till finding the best combination of algorithms to cluster our input documents.
But first, we need to check if that prompt works as it should. so this is an evaluation of the intended ChatGPT evaluation metric on a sample of a dataset that I annotated manually.
Input data and parameters
- 5 clusters selected and annotated, below the ground truth data, which represents the outliers of each cluster.
gt_intruders = [
[],
[5, 6, 15, 16, 17, 23, 26, 28, 32, 33, 39, 40, 43, 45, 51],
[2, 3, 4, 7, 8, 10, 12, 13, 17, 19, 24, 26, 28, 31, 32, 33, 34, 35, 36, 37, 41, 42, 43, 44, 47],
[1, 4, 8, 20, 21, 22, 23, 24, 25, 27, 30, 31, 32, 33, 34, 35, 38, 39, 40, 41],
[7, 26],
]
- prompt structure is combined of two parts:
- Topic identification
- outliers detection
- Parameters:
- Temperature
- LLM model
It consists in finding the best combination of two prompts (for topic identification and outliers detection) along with the optimal parameters (Temperature and LLM only in my case).
The objective is maximizing an F1 score as in a classification problem.
Experiments
I used WandB sweeps to track my experiments and do prompt engineering.
Each experiment is a separate run with different prompts, temperature and LLM model from Open AI.
For each run a recall, a precision and an F1 score are calculated.
Below is the parallel coordinates chart that tracks the different parameters and scores over my experiments.
I built different types of prompts from zero to few shot for the two parts of prompting (prompt chain one: Topic identification, prompt chain two: outliers detection) and combined them through different experiments.
The main takeaways are:
- Few shot prompting performed better than one or zero shot prompting but with limits : there is no need to go beyond 3 shots as the performance does not grow but the context size approaches the window limits of ChatGPT
- It tends to give better results with low temperature values than higher ones, as it follows better the instruction, but still 0.2 and 0.4 values gave better results than 0 temperature value.
- I could not experiment with GPT-4, as it is my first billing month to reach $1 (66 experiments costed me $1.40). But gpt-3.5-turbo gave better results than gpt-3.5-turbo-16k.
- The combination that reached the best F1 score is using gpt-3.5-turbo prompted with first_prompt_1, then second_prompt_4 (the two are few shot prompts) using 0.2 temperature for both. See parallel coordinates below.
The selected run
The data is a sample of news stories between 15th and 20th of May from France Télévisions.
The best experiment achieves about 72% F1-score. Let's analyse this experiment more in depth. Below are the detailed results per cluster on the data using the best combination of parameters :
The gpt-3.5-turbo worked perfectly on the first prompt giving a very precise topic titles for the 5 clusters.
The first cluster was put on purpose among the data to see how ChatGPT would handle a set of incoherent news stories from different topics. ChatGPT was prompted to output "Aucun" if it is the case, which it did. So for this case, there are no outliers as there is no topic. The precision and recall score are set to 0.
After clustering, I used 7 of the most representative docs to identify the cluster. When annotating the 3rd and 4th cluster, I chose to name them respectively "santé" as health in french and "Tourisme" as tourism.
But the two clusters had respectively a strong bias to unvaccinated medical staff (a topic highly discussed during that week of May) and the Ascension weekend (which happens to be during that week), so ChatGPT chose to identify the bias and detected more outliers than in my ground truth annotaion, which happens to be more adequate for my case, as I want to guide my clustering to be more precise and more helpful to journalists than just giving them general topics like tourism and health.
I find the recall score acceptable in my case.
Still have a problem on the precision of the football cluster which is a cluster of very short news stories like "match PSG-Rennes" that does not give enough context to work on. I am going to skip this problem right now, finish the pipeline, then come back to it later.
Add a comment