Skip to main content

Track and Tune Your Reinforcement Learning Models With Weights & Biases

In this article, we learn how to utilize various tools from Weights & Biases for the GridWorld reinforcement learning task, and also shows integration of OpenAI Gym Environment with W&B.
Created on September 4|Last edited on November 8
This article demonstrates how Weights & Biases can be used for the GridWorld reinforcement learning task, and shows how to integrate W&B with OpenAI Gym Environment, which makes testing reinforcement learning algorithms straightforward.

Table of Contents




GridWorld

GridWorld Environment Setup for this experiment


Run in Colab →

Environment

The GridWorld setup is pretty simple. The environment is set as a grid. There are walls, gold, and a bomb. The agent starts from a legal grid in the GridWorld, and the objective is to find the gold. However, the agent has to be careful not to land on a bomb since that is a negative reward.
The agent can move around the grid with up, down, left, and right actions. In this experiment, the agent can start from any of the columns in the 9th row. The reinforcement learning task is for the agent to learn a policy that can help navigate the agent successfully towards the goal while avoiding the bomb.
Each episode is defined as a set of actions and states taken by the agent starting from the START state, until the TERMINAL states. The grid locations containing the bomb and the gold are the terminal states. The walls are marked in black. The agent cannot pass through the walls and has to go around it.
The agent receives a reward of -1 for every step that it takes or bumping into a wall. Reaching the TERMINAL grid containing gold returns a reward of 10, whereas the bomb grid returns a reward -10. The goal is to find a goal with a minimum number of steps.

Agent

We deploy a simple Q-learning agent for this task. The agent updates the Q-table using the Bellman equation. The agent updates the probable reward that it could get by taking action from a particular state by propagating the reward that it receives upon taking the step. For addressing the exploitation vs. exploration problem during training, we implement a simple strategy of exploring at every step with a probability of ε\varepsilon. Thus, the agent selects a random action with a probability of ε\varepsilon during training. While testing, it greedily selects the best action, that is, the action leading to the state with the highest q-value.
With reinforcement learning, it is always difficult to decide the optimal set of training parameters. In the upcoming section, we shall see how we can use Weights & Biases Sweeps to find the optimal set of parameters with just a few extra code lines. For now, we assume some default values, as stated below:
  • Learning Rate α=0.1\alpha=0.1
  • Probability of choosing random action ε=0.05\varepsilon=0.05
  • Discounting Factor γ=0.95\gamma=0.95

Training

For training the Q-Learning agent, we run 500 episodes of GridWorld with the agent starting from a random position in the last row.
To track the progress of the agent, we log the cumulative reward collected by the agent in each run using the following command.
wandb.log({'reward': cumulative_reward, 'episode': trial})
This allows us to effectively visualize the agent’s progress in real-time as shown below.



Run set
1



Testing

In the testing phase, the agent chooses the best action greedily to find the gold in a minimum number of steps. We run 1000 episodes and find the average of cumulative rewards across all runs. Thus, the mean cumulative reward is our optimization objective that we want to maximize.

Finding Optimal Hyperparameters Using Sweeps

Reinforcement learning tasks such as Q-Learning are highly dependent upon the chosen value of the parameters. An optimal set of parameters can help find optimal parameters as well as prevent divergence. In this section, we shall see how we can use Weights & Biases Sweeps to derive insights into the parameter values' impact on the objective function.
To initialize a Sweep, we need to write a configuration file that tells the agent the parameters and the objective function concerning which the parameters must be optimized. Following is the code of the sweep configuration file for our GridWorld task.
%%writefile sweep.yaml
project: "rl-example"
program: main.py
method: bayes
metric:
name: mean_reward_test
goal: maximize
parameters:
alpha:
min: 0.1
max: 1
gamma:
min: 0.1
max: 1
epsilon:
min: 0.01
max: 0.1
The project variable specifies the W&B Project in which the results of the Sweep must be logged. The program variable specifies the file, which will be run repeatedly with the parameters specified in the parameters section. We specify the Bayesian method of optimization that formulates the optimization objective as a function of the parameters and then uses a gaussian process to select parameters with a high probability of improvement. We specify a range of plausible values for each of the parameters that we want to maximize the mean cumulative reward.
The results of the Sweep can be visualized in the table below. Based on the above results, we can visualize the impact of each of the parameters on the objective function. As seen above, the discounting factor γ\gamma has the highest impact on the optimization objective mean cumulative reward.

Run set
99


OpenAI Gym Integration

The Gym library by OpenAI is a toolkit for developing reinforcement learning algorithms. It makes testing reinforcement learning algorithms really easy by providing numerous environments such as Atari games, Box2D, and MuJoCo.

Environment

In this section, we shall see how we can use Weights & Biases to track the agent's progress in the Atari environment. Specifically, for this experiment, we shall be using the Ms Pacman Deterministic environment, however, the code used allows for specifying any of the Atari games environment.

Agent

Since the state space of the Atari games is large and the input is pixels, the naive Q-learning approach cannot encapsulate the entire state space. Thus, we use the Deep Q Network for approximating the Q(s,a)Q(s, a) function. The agent chooses a random action with a probability of ε\varepsilon. We use linear epsilon annealing to increase the exploitation over exploration.

Training

For training, we run about 80 episodes of playing Ms Pacman.
As seen in the previous example, we track the agent's Total Reward across each episode using a single line of code as follows.
wandb.log({'score': score, 'episode': episode}, step=episode)

Run set
1


Conclusion

The code for this experiment is available on Colab. Shoutout to Michael Tinsley for the Gridworld Code and to Daniel Grattola for Gym code.
I would love to hear your thoughts on this report. All feedback is appreciated. You can contact me on Twitter. My Twitter handle is @YashKotadia1.

Iterate on AI agents and models faster. Try Weights & Biases today.