Skip to main content
ayush-thakur
Projects
weave-mixeval
Evaluations
Log in
Sign up
Overview
Traces
Evals
Playground
Monitors
Leaders
Threads
Assets
Evaluations
Filter
inputs
output
MixEvalScorer
AGIEval
ARC
BBH
Trace
Feedback
Status
model
self
true_count
true_fraction
true_count
true_fraction
true_count
llama-3.1-70B
fe4c
Llama3p170bInstruct:v0
Evaluation:v35
51
0.6145
3
0.75
11
gpt-4o-mini
4f4d
GPT_4o_Mini:v0
Evaluation:v34
44
0.5301
1
0.25
11
Mistral Large 2
4e61
Mistral_Large_2:v0
Evaluation:v33
57
0.6747
2
0.5
11
Claude 3.5 Sonnet
3d35
Claude_3_5_Sonnet:v7
Evaluation:v32
62
0.747
2
0.5
11
llama-3.1-405B
5e0e
Llama405B_instruct:v4
Evaluation:v24
54
0.653
2
0.5
11
gpt-4o
e269
GPT_4o:v38
Evaluation:v22
44
0.5843
2
0.5
8