Top 3 Evaluations Analysis Report
Comprehensive analysis of the top-performing evaluations in mcp-tests project, comparing correctness scores, costs, and performance metrics
Created on May 22|Last edited on May 22
Comment
Top 3 W&B Evaluations Analysis Report
*wandb-applied-ai-team/mcp-tests*
Executive Summary
This report analyzes the top 3 performing evaluations from the mcp-tests project, focusing on correctness scores, cost efficiency, and operational metrics. Our analysis reveals significant insights about model performance trade-offs and optimization opportunities.
Evaluation Overview
1. **wandbot_gpt-4o-2024-11-20** ⭐ *Best Overall Performance*
- **Correctness Score:** 76.5% (75/98 correct answers)
- **Total Samples:** 98
- **Success Rate:** 100%
- **Average Latency:** 94.1ms
- **Total Cost:** $5.39
- **Date:** March 12, 2025
2. **wandbot_less_query_enhancement** 🚀 *Best Latency*
- **Correctness Score:** 68.4% (67/98 correct answers)
- **Total Samples:** 98
- **Success Rate:** 99.0%
- **Average Latency:** 37.9ms (60% faster)
- **Total Cost:** $5.30
- **Date:** March 12, 2025
3. **wandbot-eval** 📊 *Largest Scale*
- **Correctness Score:** 68.4% (134/196 correct answers)
- **Total Samples:** 196 (2x larger dataset)
- **Success Rate:** 98.0%
- **Average Latency:** 372.1ms
- **Total Cost:** $10.64
- **Date:** February 27, 2025
Key Performance Metrics
Correctness Analysis
The GPT-4o configuration achieved the highest correctness score at **76.5%**, representing an **8.1 percentage point improvement** over the other configurations. This translates to approximately **8 additional correct answers** out of every 100 queries.
Cost Efficiency
- **Most Cost-Effective:** Less Query Enhancement at $5.30
- **Best Value:** GPT-4o at $5.39 (only $0.09 more for 8% better performance)
- **Highest Cost:** Standard evaluation at $10.64 (2x cost for same per-sample performance)
Latency Performance
- **Fastest:** Less Query Enhancement at 37.9ms average
- **Production-Ready:** GPT-4o at 94.1ms average
- **Slowest:** Standard evaluation at 372.1ms average
Cost vs Performance Analysis
The analysis reveals a clear **efficiency frontier** where GPT-4o provides the optimal balance:
- **8% higher correctness** than alternatives
- **Only 1.7% higher cost** than the cheapest option
- **Acceptable latency** for most production use cases
API Reliability Insights
All evaluations demonstrated robust operational characteristics:
- **Query Enhancer LLM API:** 99-100% success rate across all evaluations
- **Chat API:** 98-100% success rate with minimal error rates
- **Embedding/Reranker APIs:** 100% success rate consistently
- **Error Handling:** Comprehensive error tracking and recovery
Token Usage Analysis
GPT-4o Configuration
- **Prompt Tokens:** 517,225
- **Completion Tokens:** 7,255
- **Total Requests:** 98
- **Average Tokens per Request:** 5,352
Less Query Enhancement
- **Prompt Tokens:** 508,141 (1.8% reduction)
- **Completion Tokens:** 7,335 (1.1% increase)
- **Total Requests:** 98
- **Average Tokens per Request:** 5,261
Recommendations
For Production Deployment
1. **Primary Choice:** GPT-4o configuration
- Best accuracy-cost trade-off
- Reliable sub-100ms latency
- Proven 100% success rate
For Cost-Sensitive Applications
2. **Alternative:** Less Query Enhancement
- 60% latency reduction
- Minimal cost increase
- Acceptable 8% performance trade-off
For Large-Scale Evaluations
3. **Considerations:** Batch processing approach
- Standard evaluation shows linear cost scaling
- Consider parallel processing for time efficiency
- Monitor per-sample cost metrics
Technical Implementation Notes
Model Configuration Differences
- **GPT-4o:** Enhanced query processing with full feature set
- **Less QE:** Reduced query enhancement for speed optimization
- **Standard:** Full-featured baseline with larger evaluation dataset
Success Rate Factors
- Robust error handling across all configurations
- Consistent API reliability (>98% success)
- Effective fallback mechanisms for failed requests
Future Optimization Opportunities
1. **Hybrid Approach:** Use Less QE for time-sensitive queries, GPT-4o for accuracy-critical tasks
2. **Batch Optimization:** Implement request batching to reduce per-query overhead
3. **Caching Strategy:** Cache common query patterns to reduce API costs
4. **A/B Testing:** Continuous evaluation of new model configurations
Conclusion
The GPT-4o configuration emerges as the clear winner, providing the best overall value proposition with **76.5% correctness** at a competitive cost of **$5.39**. For applications requiring ultra-low latency, the Less Query Enhancement variant offers a viable alternative with acceptable performance trade-offs.
These insights enable data-driven decisions for model deployment strategies and highlight the importance of comprehensive evaluation frameworks in LLM application development.
---
*Report generated on May 22, 2025 | Data source: W&B Weave Evaluations*
Run set
38
Run set
38
Add a comment