Using Wandb to Launch AgentBoard

This article shows how AgentBoard integrated with Wandb visualization aids anlaysis of LLM agents.
Created on December 6|Last edited on January 8
Comment
﻿
Enabling Wandb for AgentBoardSummary Panel:  Which Agent is Good and What it is Good atPanel for Each Task: Metrics + Analysis + LoggingDetailed Understanding Through Logging and Agent TrajectoryRun Results for GPT-4, GPT-3.5-Turbo, DeepSeek-67b, LLama2-13b
﻿
﻿
﻿
Enabling Wandb for AgentBoardAgentBoard currently integrates Wandb support. You can simply turn on --wandb switch in the evaluation arguments and name your wandb project with --project_name. Note that if you want to add/remove baselines to compare with during running, you just need to edit /data/baseline_results directory. The directory contains log files of models that have the same format as log files in your log_path. 
python agentboard/eval_main.py --cfg eval_configs/main_results_all_tasks.yaml \
                    --tasks alfworld \
                    --model  lemur-70b \
                    --log_path results/lemur-70b \
                    --wandb \
                    --project_name evaluate-lemur-70b \
                    --baseline_dir data/baseline_results \
Summary Panel:  Which Agent is Good and What it is Good at﻿
LLM Agents are generalists, and therefore it is crucial to analyze their performance under a diverse set of tasks. In our benchmark, we test LLM agents on 9 diverse tasks in 4 types of scenarios (web, tool-using, embodied, and game). In the summary panel, we first display the performances of agents: All Results displays the performance of current running model compared with baselines on all tasks. Agent Abilities displays the performance of models in terms 6 dimensions of agentic abilities: memory, planning, world modeling, retrospection, grounding, and spatial navigation (please see the paper for a detailed introduction to each ability dimension). 
Below is a summary of the Text-Davinci-003 model compared to GPT-3.5-Turbo and GPT4. Text-Davinci-003 is better at tool-using and web tasks, though worse than GPT-3.5-Turbo in embodied and games. GPT-4 outperforms all models by a large margin. In terms of abilities, GPT-4 and GPT-3.5-Turbo are more all-rounded and balanced in all dimensions, while Text-Davinci-003 is worse at world modeling and spatial navigation. 
﻿
﻿
﻿
Panel for Each Task: Metrics + Analysis + LoggingThe panel for each task consists of three subparts: 
1.Metrics Panel: The metrics panel displays the results from testing in terms of metrics: grounding accuracy, reward score and success rate. Success rate is the commonly-used metrics to measure agents' ability to finish a task. Progress Rate is our featured metric. It awards score to subgoals accomplished. This is particularly suitable for evaluating LLM agents, as LLM tends to finish a problem step-by-step (from Wei, Jason, et al). Grounding accuracy measures the ability of models to generate valid actions. 
2. Analysis Panel: The analysis panel displays plots that help understand two questions: "What is the performance of my current model compared to baselines? " and "Why is the performance not satisfactory? What is the weakness of my current model on this task?". To address these questions, we analyze metrics compared to baselines, how reward changes w.r.t steps, and metrics under different task difficulties. Metrics Summary compares the metrics (success rate, reward score, grounding accuracy) of the current run with previous baseline metrics. Progress Rate w.r.t Steps showcase how models perform under long-range interactions. Our previous observation is that open-sourced models often fail to gain information after 6 steps and is thus limited in performing complex tasks. Metrics w.r.t Difficulty studies how models respond to different tasks difficulties, where the nature of the task is largely the same, but divided into Easy and Hard based on the number of subgoals. 
3. Logging Panel: Forget all fancy visualizations. The best way to understand the behavior of an agent is to read its trajectory! 
[Task/Predictions] We log details of examples in each task, as well as their metrics and trajectory run by models to a Wandb table. You can browse through different examples in the table. 
﻿
﻿
Detailed Understanding Through Logging and Agent TrajectoryIn this section, we provide a tutorial on how to understand and debug runs with [Task/Predictions] tables.
The table contains three parts: (1) metrics of each example, including is_done, reward, and grounding accuracy. A plot for reward change is also given to help visualize when the agent makes progress in solving the problem. (2) environment details, including the name of the example, difficulty, goal, and id. (3)  trajectory of the agent, including the action, observation, and reward score at each step. Note that the reward score is not visible to the agent, but the observation is given to the agent when prompting the next action. You can browse through the table to check different examples and enlarge each part for a detailed view. 
A bonus in using Wandb tables is that you can sort metric columns and discover certain patterns. For example, below is the log of GPT-3.5-Turbo for Scienceworld. After sorting is_done by descending order, we can see that only easy problems could be solved and most of them belong to the lifespan-related task in Scienceworld. Therefore, as a user, it would be reasonable to check the model performance on other more difficult tasks. 
﻿
﻿
﻿
Reading trajectory itself is also necessary for understanding agent behavior. For example, below is the log of GPT-3.5-Turbo for Alfworld. The first problem is not finished and we could check the trajectory. We can see that the model has no problem finding the two objects, bowl and desklamp, but struggles to perform further actions (open lamp and look at the bowl under the lamp). Instead, it is still searching around even if all the objects needed are found. Therefore, it is possible that the model lacks a clear understanding of its current status, due to problems in memory. 
﻿
﻿
﻿
﻿
﻿
Run Results for GPT-4, GPT-3.5-Turbo, DeepSeek-67b, LLama2-13bHere to help AgentBoard users debug their agents, and gain a better understanding of tasks, we provide the Wandb panels for several models.
GPT4: https://wandb.ai/agentboard/llm-agent-eval-gpt-4-all﻿
GPT-3.5-Turbo: https://wandb.ai/agentboard/llm-agent-eval-gpt-35-turbo-all﻿
DeepSeek-67b: https://wandb.ai/agentboard/llm-agent-eval-deepseek-67b-all﻿﻿﻿
﻿
Add a comment