Wandbot Evaluation

Created on April 17|Last edited on April 17
Comment
We have followed below process to evaluate Wandbot performance:
Generate a big dataset of simulated user questions for each code snippet in W&B docs. The snippet should serve as "ideal response" so we have both question and answer. 
We used existing Wandbot code and algorithm to generate a response for each question.
Automate evaluation with fuzzy match and GPT3.5 model-based eval.
Human labeling of a subset of questions & model answers by our MLEs. For 200+ questions, we scored the question (is it correct or not), the ideal answer (correct or not) and model-generated answer (correct or not).
﻿
﻿
Comparing Model-Based and Human EvaluationBelow confusion matrices present a comparison between the predicted score (by GPT model) and actual score (by W&B MLE). The first presents results on the entire labeled dataset, the second narrows down to questions marked as correct by human labelers, and the third narrows down to examples where human labeler also marked the "ideal answer" as correct. 
﻿
Run: dauntless-frost-411
﻿
Error Analysis﻿
The following table is filtered to examples with most common error type - model predicting the answer as incorrect, while human labeler assesses it as correct. Reviewing these it looks like in many cases there are multiple ways of achieving an objective, and the one presented in original ("ideal") answer may not be the only or the best one. Getting a better dataset of ideal answers (potentially model based answers assessed by human labelers as correct) might mitigate this issue. 
﻿
﻿
﻿
Run: dauntless-frost-411
﻿
﻿
﻿
Add a comment