Skip to main content

Evaluation Comparison Report - test

Comparing evaluations
Created on February 7|Last edited on February 7

Showing first 10 bars
resplendent-dragon-181 Run settwinkling-paper-175 Run setred-horse-174 Run setalight-noodles-173 Run setlambent-festival-172 Run setflashing-lamp-171 Run setabundant-rat-170 Run setcrimson-noodles-169 Run setfortuitous-paper-168 Run setfortuitous-chrysanthemum-167 Run set0.000.100.200.300.40
meta
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","mmlu,ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","arc_fr","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","mmlu,ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","arc_fr","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
["--model","hf","--model_args","pretrained=microsoft/phi-2,trust_remote_code=True","--tasks","ai2_arc","--device","cuda:0","--batch_size","4","--output_path","output/phi-2-mmlu-arc","--limit","2","--wandb_args","project=lm-eval-harness-integration","--log_samples"]
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"9279b05e0639dbc43b2fa1c3c35a68e2b08216b9","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"06b22f17a1b85b0f9d076b5cf5b75e452be0ba1c","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"06b22f17a1b85b0f9d076b5cf5b75e452be0ba1c","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"a8094500ec842cc467bd18f74c546495651cabbc","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"a8094500ec842cc467bd18f74c546495651cabbc","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"a8094500ec842cc467bd18f74c546495651cabbc","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"9279b05e0639dbc43b2fa1c3c35a68e2b08216b9","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"89503de1916d2c807c75e23241f4b450e22ed671","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"06b22f17a1b85b0f9d076b5cf5b75e452be0ba1c","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"06b22f17a1b85b0f9d076b5cf5b75e452be0ba1c","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"a8094500ec842cc467bd18f74c546495651cabbc","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"a8094500ec842cc467bd18f74c546495651cabbc","__typename":"GitInfo"}
{"remote":"https://github.com/ayulockin/lm-evaluation-harness","commit":"a8094500ec842cc467bd18f74c546495651cabbc","__typename":"GitInfo"}
25s
9s
12s
2h 52m 19s
4s
5s
4s
5s
4s
6s
25s
9s
12s
2h 52m 19s
4s
5s
4s
5s
4s
6s
config
task_configs
arc_challenge
metadata
1
1
1
1
-
1
1
1
1
1
1
1
1
1
-
1
1
1
1
1
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
-
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
-
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
ARC-Challenge
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
-
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
-
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
-
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
-
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
{{choices.text}}
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
-
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
-
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
-
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
-
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
{{choices.label.index(answerKey)}}
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
-
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
-
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
Question: {{question}} Answer:
-
-
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
-
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
-
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
["ai2_arc"]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
-
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
-
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
[{"metric":"acc","aggregation":"mean","higher_is_better":true},{"metric":"acc_norm","aggregation":"mean","higher_is_better":true}]
multiple_choice
multiple_choice
multiple_choice
multiple_choice
-
multiple_choice
multiple_choice
multiple_choice
multiple_choice
multiple_choice
multiple_choice
multiple_choice
multiple_choice
multiple_choice
-
multiple_choice
multiple_choice
multiple_choice
multiple_choice
multiple_choice
1
1
1
1
-
1
1
1
1
1
1
1
1
1
-
1
1
1
1
1
true
true
true
true
-
true
true
true
true
true
true
true
true
true
-
true
true
true
true
true
-
-
arc_challenge
arc_challenge
arc_challenge
arc_challenge
-
arc_challenge
arc_challenge
arc_challenge
arc_challenge
arc_challenge
arc_challenge
arc_challenge
arc_challenge
arc_challenge
-
arc_challenge
arc_challenge
arc_challenge
arc_challenge
arc_challenge
test
test
test
test
-
test
test
test
test
test
test
test
test
test
-
test
test
test
test
test
train
train
train
train
-
train
train
train
train
train
train
train
train
train
-
train
train
train
train
train
validation
validation
validation
validation
-
validation
validation
validation
validation
validation
validation
validation
validation
validation
-
validation
validation
validation
validation
validation
arc_easy
metadata
1
1
1
1
-
1
1
1
1
1
1
1
1
1
-
1
1
1
1
1
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
-
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
-
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
ARC-Easy
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
-
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
-
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
allenai/ai2_arc
Run set
215
Run set
215