Last 5 Evaluations Analysis

Analysis of accuracy vs cost for the most recent wandbot evaluations

Created on June 10|Last edited on June 10

Comment

﻿
Last 5 Evaluations Analysis
SummaryChart showing accuracy vs cost trade-off for the 5 most recent wandbot evaluations.
Key Findings- **Best performing**: wandbot_v1.3.3_test-v55-index with 91.02% accuracy at $6.03
- **Most cost-effective**: intercom_eval_answers-1_trial at $0.38 (but only 11.22% accuracy)  
- **Average accuracy**: 49.4%
- **Average cost**: $4.38
Data| Evaluation | Date | Accuracy | Cost |
|------------|------|----------|------|
| wandbot_v1.3.3_test-v55-index | 2025-06-10 | 91.02% | $6.03 |
| v1.3.2 PROD | 2025-05-19 | 90.41% | $6.02 |
| wandbot_v1-3-2_o4-mini | 2025-04-17 | 85.31% | $7.61 |
| intercom_eval_answers-1_trial | 2025-05-20 | 11.22% | $0.38 |
| intercom_eval_answers-5_trial | 2025-05-20 | 8.98% | $1.88 |
ObservationsThe main production models (v1.3.3 and v1.3.2) show consistent high performance around 90%+ accuracy with costs around $6. The intercom eval trials show much lower accuracy, suggesting they may be testing different configurations or datasets.
﻿

Add a comment