Skip to main content
agentboard
Projects
llm-agent-eval-gpt-4-all
Log in
Sign up
Overview
Workspace
Runs
Automat.
Sweeps
Reports
Artifacts
Changma's workspace
Personal workspace
Automated workspace
Changes are only visible to you.
Runs
1
Name
1 visualized
gpt_azure_gpt-4
gpt_azure_gpt-4
1-1
of 1
Settings
Add panels
summary
4
1-4 of 4
scienceworld
6
1-6 of 6
runs
.
summary
["
scienceworld/metrics
"]
⏎
Filter
Progress Rate
0.7551
Success Rate
0.4
Metric Name
Metric Value (%)
1
2
scienceworld/metrics_comparison
Current Run
gpt-35-turbo
text-davinci-003
llama2-70b
0
0.2
0.4
0.6
Progress Rate (%)
Success Rate (%)
Grounding Accuracy (%)
Scienceworld Metrics Compared to Baseline Models
plotly-logomark
scienceworld/task_reward_w.r.t_steps
0
10
20
30
0
20
40
60
Model Name, Is Baseline
Current Run, False
gpt-35-turbo-16k, True
gpt-35-turbo, True
codellama-34b, True
claude2, True
lemur-70b, True
llama2-70b, True
text-davinci-003, True
Average Progress Rate (%) w.r.t Steps for scienceworld Tasks
steps
score
plotly-logomark
scienceworld/progress_score_w.r.t_difficulty
0
0.2
0.4
0.6
0.8
gpt-35-turbo-16k
llama2-70b
codellama-34b
text-davinci-003
claude2
gpt-35-turbo
lemur-70b
Current Run
Progress Rate For Easy Examples(%)
Progress Rate For Hard Examples(%)
Scienceworld Progress Rate w.r.t Difficulty
plotly-logomark
scienceworld/success_rate_w.r.t_difficulty
0
0.2
0.4
0.6
gpt-35-turbo-16k
codellama-34b
llama2-70b
lemur-70b
text-davinci-003
claude2
gpt-35-turbo
Current Run
Success Rate For Easy Examples(%)
Success Rate For Hard Examples(%)
Scienceworld Success Rate w.r.t Difficulty
plotly-logomark
runs
.
summary
["
scienceworld/predictions
"]
⏎
Filter
id
is_done
env.difficulty
env.goal
env.task_name
reward
grounding_accuracy
reward_wrt_step
trajectory
29
babyai
6
1-6 of 6
Add section
List<File<(table)>>
Ops
.contents
.count
.digest
.dropna
.filter((row) => row)
.isNone
.join(, (row) => row, (row) => row, "", "", , )
.joinToStr("")
.map((row, index) => row)
.merge("")
.size
.table
.table("")
[]
.project
.run
List<File<(table)>>
Ops
.contents
.count
.digest
.dropna
.filter((row) => row)
.isNone
.join(, (row) => row, (row) => row, "", "", , )
.joinToStr("")
.map((row, index) => row)
.merge("")
.size
.table
.table("")
[]
.project
.run