Track and Tune Your Reinforcement Learning Models With Weights & Biases
In this article, we learn how to utilize various tools from Weights & Biases for the GridWorld reinforcement learning task, and also shows integration of OpenAI Gym Environment with W&B.
Created on September 4|Last edited on November 8
Comment
This article demonstrates how Weights & Biases can be used for the GridWorld reinforcement learning task, and shows how to integrate W&B with OpenAI Gym Environment, which makes testing reinforcement learning algorithms straightforward.
Table of Contents
Table of ContentsGridWorldEnvironmentAgentTrainingTestingFinding Optimal Hyperparameters Using SweepsOpenAI Gym IntegrationEnvironmentAgentTrainingConclusion
GridWorld

GridWorld Environment Setup for this experiment
Run in Colab →
Environment
The GridWorld setup is pretty simple. The environment is set as a grid. There are walls, gold, and a bomb. The agent starts from a legal grid in the GridWorld, and the objective is to find the gold. However, the agent has to be careful not to land on a bomb since that is a negative reward.
The agent can move around the grid with up, down, left, and right actions. In this experiment, the agent can start from any of the columns in the 9th row. The reinforcement learning task is for the agent to learn a policy that can help navigate the agent successfully towards the goal while avoiding the bomb.
Each episode is defined as a set of actions and states taken by the agent starting from the START state, until the TERMINAL states. The grid locations containing the bomb and the gold are the terminal states. The walls are marked in black. The agent cannot pass through the walls and has to go around it.
The agent receives a reward of -1 for every step that it takes or bumping into a wall. Reaching the TERMINAL grid containing gold returns a reward of 10, whereas the bomb grid returns a reward -10. The goal is to find a goal with a minimum number of steps.
Agent
We deploy a simple Q-learning agent for this task. The agent updates the Q-table using the Bellman equation. The agent updates the probable reward that it could get by taking action from a particular state by propagating the reward that it receives upon taking the step. For addressing the exploitation vs. exploration problem during training, we implement a simple strategy of exploring at every step with a probability of . Thus, the agent selects a random action with a probability of during training. While testing, it greedily selects the best action, that is, the action leading to the state with the highest q-value.
With reinforcement learning, it is always difficult to decide the optimal set of training parameters. In the upcoming section, we shall see how we can use Weights & Biases Sweeps to find the optimal set of parameters with just a few extra code lines. For now, we assume some default values, as stated below:
- Learning Rate
- Probability of choosing random action
- Discounting Factor
Training
For training the Q-Learning agent, we run 500 episodes of GridWorld with the agent starting from a random position in the last row.
To track the progress of the agent, we log the cumulative reward collected by the agent in each run using the following command.
wandb.log({'reward': cumulative_reward, 'episode': trial})
This allows us to effectively visualize the agent’s progress in real-time as shown below.
Run set
1
Testing
In the testing phase, the agent chooses the best action greedily to find the gold in a minimum number of steps. We run 1000 episodes and find the average of cumulative rewards across all runs. Thus, the mean cumulative reward is our optimization objective that we want to maximize.
Finding Optimal Hyperparameters Using Sweeps
Reinforcement learning tasks such as Q-Learning are highly dependent upon the chosen value of the parameters. An optimal set of parameters can help find optimal parameters as well as prevent divergence. In this section, we shall see how we can use Weights & Biases Sweeps to derive insights into the parameter values' impact on the objective function.
To initialize a Sweep, we need to write a configuration file that tells the agent the parameters and the objective function concerning which the parameters must be optimized. Following is the code of the sweep configuration file for our GridWorld task.
%%writefile sweep.yamlproject: "rl-example"program: main.pymethod: bayesmetric:name: mean_reward_testgoal: maximizeparameters:alpha:min: 0.1max: 1gamma:min: 0.1max: 1epsilon:min: 0.01max: 0.1
The project variable specifies the W&B Project in which the results of the Sweep must be logged. The program variable specifies the file, which will be run repeatedly with the parameters specified in the parameters section. We specify the Bayesian method of optimization that formulates the optimization objective as a function of the parameters and then uses a gaussian process to select parameters with a high probability of improvement. We specify a range of plausible values for each of the parameters that we want to maximize the mean cumulative reward.
The results of the Sweep can be visualized in the table below. Based on the above results, we can visualize the impact of each of the parameters on the objective function. As seen above, the discounting factor has the highest impact on the optimization objective mean cumulative reward.
Run set
99
OpenAI Gym Integration
The Gym library by OpenAI is a toolkit for developing reinforcement learning algorithms. It makes testing reinforcement learning algorithms really easy by providing numerous environments such as Atari games, Box2D, and MuJoCo.
Environment
In this section, we shall see how we can use Weights & Biases to track the agent's progress in the Atari environment. Specifically, for this experiment, we shall be using the Ms Pacman Deterministic environment, however, the code used allows for specifying any of the Atari games environment.
Agent
Since the state space of the Atari games is large and the input is pixels, the naive Q-learning approach cannot encapsulate the entire state space. Thus, we use the Deep Q Network for approximating the function. The agent chooses a random action with a probability of . We use linear epsilon annealing to increase the exploitation over exploration.
Training
For training, we run about 80 episodes of playing Ms Pacman.
As seen in the previous example, we track the agent's Total Reward across each episode using a single line of code as follows.
wandb.log({'score': score, 'episode': episode}, step=episode)
Run set
1
Conclusion
The code for this experiment is available on Colab. Shoutout to Michael Tinsley for the Gridworld Code and to Daniel Grattola for Gym code.
I would love to hear your thoughts on this report. All feedback is appreciated. You can contact me on Twitter. My Twitter handle is @YashKotadia1.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.