Skip to main content

A Gentle Introduction to OpenAI Gym

In this article, we'll give you an introduction to using the OpenAI Gym library, its API and various environments, as well as create our own environment!
Created on January 8|Last edited on July 8
Reinforcement learning is the sub-field of machine learning in which an agent performs an action to maximize the cumulative future reward, reinforcement happens through rewards.
In this piece, we are going to take a practical view of learning: we're going to get our hands dirty with a tool called “OpenAI Gym."
OpenAI Gym (or Gym) is a toolkit for developing and testing reinforcement learning algorithms. It has a huge collection of in-built environments, all ready to be used off the shelf.
In this piece, we'll give you a refresher on the basics of Reinforcement Learning, the basic structure of Gym environments, common experiments using Gym, and how to build your very own Custom Environment!
Here's what we'll cover:

Table of Contents



Let’s dive in!

What is Reinforcement Learning?

The main premise behind reinforcement learning is as follows: an agent reinforced with rewards based on its interaction with an environment would learn to operate optimally in the environment to maximize the rewards. In principle, the agent learns through trial and error.
There is a much more in-depth, dedicated blog to the basics of Reinforcement Learning, it is linked here.
This analogy is helpful for understanding. I have a [good] dog. Her name is Luna. I'm trying to teach her a few commands, such as "sit" or "roll over." When we started, and Luna didn't "sit" when asked, she didn't get a positive reward (a delicious treat), but over time, as she starts associating treats with sitting or rolling over, she is actually displaying reinforcement learning: she's trying to maximize her reward by doing the desired action.
Agent-Environment Cycle;Image by Author
As we can observe from the example (and the illustration), Luna is an agent, and everything other than Luna is considered to be the environment. Luna performs an action, thereby changing the state, and receives an appropriate reward.
Reinforcement learning task/ problem can be boiled down to 5 key components: agent, environment, state, action, and reward.
We understood the technical aspect of Reinforcement Learning. How can we implement it programmatically? Enter OpenAI Gym.

What is OpenAI Gym?

OpenAI Gym is a Pythonic API that provides simulated training environments to train and test reinforcement learning agents. It's become the industry standard API for reinforcement learning and is essentially a toolkit for training RL algorithms.
Before we get to a much more in-depth view of Gym, we need to understand the slightly convoluted history of Gym.

Who Maintains Gym?

On April 27th of 2016, OpenAI announced the public beta of Gym. Gym was developed to solve two main issues plaguing the field of reinforcement learning: benchmark and standardization. (We'll deal with these issues in the “What is the need of Gym?” sub-section.)
Gym did, in fact, address these issues and soon became widely adopted by the community for creating and training in various environments. The main problem with Gym, however, was the lack of maintenance. OpenAI didn't allocate substantial resources for the development of Gym since its inception seven years earlier, and, by 2020, it simply wasn't maintained.
In 2021, a non-profit organization called the Farama Foundation took over Gym. They introduced new features into Gym, renaming it Gymnasium. Farama seems to be a cool community with amazing projects such as PettingZoo (Gymnasium for MultiAgent environments), Minigrid (for grid world environments), and much more. Most reinforcement learning aficionados seem excited about the news.
Gymnasium is the newest version of Gym—canonically, it is version “0.27.0”. We won’t be dealing with any of these latest versions. We will be using a library called Stable-Baselines3 (sb3), which is a collection of reliable implementations of RL algorithms. sb3 is only compatible with Gym v0.21.0.
For now, just know that you cannot find the docs for “Gym v0.21.0”, (it was released in 2021), but almost all the Gym tutorials you see will be based on this version. The best way to debug would be to scour through the Github Repository.
This purpose here was to let you know why we will be using an older version of Gym, which is still used by a large part of the community. The fortunate thing is that the API (which we'll be digging into heavily) is not altered substantially. Newer versions of Gym will still follow the same basic structure and workings.
The next question is why we need Gym. The next section has the answer.

What Is the Need for Gym?

For quite a while, RL didn't have sufficient benchmarks or standardization. At least until Gym, that is. In every other field of machine learning, there are standard datasets that can be used as benchmarks to access the performance of various models (think ImageNet or MNIST, for example). This is where Gym comes in.
Reinforcement learning is a complex field, and it does not use datasets at all. Learning happens in complex software environments like games. This introduces the problem: a lack of benchmarks. And we can't compare RL algorithms if there are no benchmarks.
Another problem was the lack of standard APIs in the field. Every researcher created their very own environment with varying structures, making it difficult to compare different implementations of the same environment.
Gym effectively solved both these issues. It introduced a standard API (structure) for every environment, thereby creating a collection of standard environments. This introduces the ability to test various algorithms against a set of the standard environment creating a benchmark.
That's enough setup. Let's dig into some code!

How Does OpenAI Gym Work?

The very first thing we need to do is install Gym in our local environments.
Gym is primarily supported only on Mac and Linux but we do have a workaround in installing Gym on Windows.
You cannot run the following code in a Colab. Rendering the environment in Colab is a bit complicated, and as such we will be avoiding it. You can download the Ipython file linked here and run it locally.
💡

Installation On Windows

On Windows, we'll need 2 prerequisites: Visual Studio Build Tools, and Miniconda.
Visit the Build Tools page and download the latest version of “Build Tools for Visual Studio.” Once downloaded, run the executable file and select “Desktop Development with C++”. This will install the necessary files for Build Tools to function.

Installing Build Tools for Visual Studio; Image by Author
Visit the Miniconda website and download the latest version of Miniconda based on your computer’s platform (Windows in this case). Run the executable file and select all default options.
Then, just install Jupyter Notebook and Gym library.
python -m pip install jupyter --user
pip install gym==0.21

Installation in Mac/Linux

For installing Gym in Mac/Linux, all we need to do is install the Gym library:
pip install gym==0.21
That's it. Seriously.
Now that we've installed Gym, let's try and understand the basic structure of a Gym environment.

Framing Reinforcement Learning Problem

Remember that in our earlier “What is reinforcement learning?” section we learned that any RL task can be represented using 5 key components: agent, environment, observation space, action space, and reward function.
This representation is achieved by defining the problem as a Markov Decision Process. To create an agent-environment interaction cycle we'll need these components. Here's how to get them:

1. Observation Space and Action Space

Our observation space determines the set of possible observations in a given environment, whereas our action space determines the set of possible actions an agent can take in the environment.
Gym offers various types of spaces to represent the observation space and action space. We will be dealing with 6 major ones:
  1. Box: Used to represent continuous bounded space. Example: All possible values between -1 to +1. It can even include 0.234512
  2. Discrete: This is used to represent discrete bounded space. Example: All discrete values between 1 and 10. It cannot include decimal values.
  3. Dict: Dictionary of simple spaces.
  4. Tuple: Tuple of simple spaces.
  5. MultiBinary: It is used to represent “n-shaped” binary space.
  6. MultiDiscrete: We can think of this as Multiple Discrete values combined in the same space.
Depending on the needs of the task, we will choose the action and observation space from these spaces.

2. Reward Function

Every environment will have an inherent reward function that determines the reward for the state change. Based on the condition, the agent is rewarded with a positive, negative, or neutral reward.

3. Agent

A learning agent will perform an action, which will change the state. This state change will be rewarded with a reward (positive, negative, or neutral). The agent has no direct control over the state change, it can only pick an action that will be translated to probabilistic chances of it ending up in different states.
Here's an example using the Frozen Lake environment from Gym. Our agent is an elf and our environment is the lake. It's frozen, so it's slippery. If our agent (a friendly elf) chooses to go left, there's a one in five chance he'll slip and move diagonally instead. Here, the slipperiness determines where the agent will end up. In reinforcement learning this is factored as transition probability matrix.
Slipperiness of Environment determines the transition state; Image by Author
All we need to understand is the following: The agent takes a “step” to perform some action, and the environment determines the next state and the reward.
In Gym, it's done like this:
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
The agent selects a random action from the set of available actions through action_space.sample(). It performs the action with the step() function. As the agent takes a step, a single timestep is passed and it ends up receiving 4 signals (in Gym v0.21): observation, reward, done, and info. A note about each:
  • observation is the state change
  • reward denotes the reward the agent receives for the state change
  • done is in place to check the termination status of an episode i.e. whether the episode has been completed or not
  • info offer additional diagnostic information which can be useful for debugging and logging
In this example, we are choosing a random action. This wouldn't translate to any learning, per se.
This is where Stable-Baselines3 comes in. It has stable implementations of various RL algorithms, which can be used off-the-shelf in Gym environments.

4. Environment

The environment in Gym is represented through the env class. It offers all the major functionalities we require from any Gym environment: reset, render, step, and close.
We've already encountered the step function. It applies the agent’s actions and determines the change in the state based on the environment’s transition probability. Let's talk about those other ones:
The reset function is needed to reset the initial conditions whenever an episode begins. This ensures continuity of learning for the agent. It ensures the environmental and agent conditions are set to initial conditions. The reward is set to zero.
The render function is responsible for graphically rendering the output. This is purely for humans to visually perceive agent training.
The close function makes sure the environment is fully terminated. It avoids runaway environments running in the background.

Putting it all together

If we combine all 5 components, we get the following. You can interact with any Gym environment using the following structure:
import gym
env_name = ""
env = gym.make(env_name, options={})
observation = env.reset()

done = False
while not done:
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
env.render(mode="human")

env.close()
We create an environment using the gym.make() function. We pass in the environment name as the argument. We reset() the environment because this is the beginning of the episode and we need initial conditions. We set done = False. This determines if the episode has terminated or not. We run the loop based on done. If the episode gets terminated, then the loop will stop running.
Inside the loop, we select an action and pass the action to the step() function. We render this change in the state using the render() function.
After the episode is terminated, we close the environment using the close() function.
Phew, that was a lot to take in. The amazing thing is, every Gym environment can be “solved” using the above Boiler-Plate code. We do have to add on the learning capacity but other than that, we will follow the same structure throughout.
In the following section, we will test out our Boiler-Plate code with various environments. We will also add experiment-tracking functionality with W&B.

Common Experiments in RL using OpenAI Gym

Let's look at some common experiments in Gym. First, we need to import those libraries!

1. Importing Libraries

We'll be installing the following:
  1. stable-baselines3: a reliable set of implementations of various reinforcement learning algorithms
  2. wandb: for tracking experiment metrics and performance
  3. box2d-py: A 2D physics engine for running BipedalWalker-v3
  4. gym-super-mario-bros: Gym environment for simulating SuperMarioBros
  5. opencv-python: For image processing tasks.
!pip install 'stable-baselines3[extra]'
!pip install wandb
!pip install box2d-py
!pip install gym_super_mario_bros==7.3.0 nes_py
!pip install opencv-python
We will now import the packages along with specific functions. We'll deal with these methods as they come up in our code blocks.
import gym
import os
import wandb
import gym_super_mario_bros
from nes_py.wrappers import JoypadSpace
from gym.wrappers import GrayScaleObservation
from wandb.integration.sb3 import WandbCallback
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv,VecVideoRecorder, VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
Now, we can move on to creating and testing environments.

2. CartPole-v1

CartPole is the Hello World for reinforcement learning. It is a simple environment in which there is a cart with a pole on top of the cart. The cart has some initial velocity, it should learn to adjust its velocity and actions to balance the pole and avoid falling.
Action Space: Discrete (2); Left or Right
Observation Space: Array with the shape (4). It contains 4 signals:
  1. Cart Position
  2. Cart Velocity
  3. Pole Angle
  4. Pole Angular Velocity
Reward Function: +1 for every time step the pole is standing.
Termination: When the episode reward exceeds 475, the episode is terminated.
We'll set the configuration dictionary with the environment name, type of policy, and total time steps. We'll pass this config dictionary to our run which is initialized with wandb.init(). It will automatically log all the necessary metrics. We'll also pass in monitor_gym = True to auto-upload training videos.
config = {
"policy_type": "MlpPolicy",
"total_timesteps": 30000,
"env_name": "CartPole-v1",
}
run = wandb.init(
project="intro_to_gym",
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)
Here, we create an environment and wrap the environment with VecVideoRecorder to enable recording the rendered images as video.
def make_env():
env = gym.make(config["env_name"])
env = Monitor(env)
return env

env = DummyVecEnv([make_env])
env = VecVideoRecorder(env, f"videos/{run.id}", record_video_trigger=lambda x: x % 200 == 0, video_length=200)
Now, we can define the model with the PPO algorithm and begin our training with model.learn()
model = PPO(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
model.learn(
total_timesteps=config["total_timesteps"],
callback=WandbCallback(
gradient_save_freq=100,
model_save_path=f"models/{run.id}",
verbose=2,
),
)
We can save this trained model in the sub-directory /Training/Saved Models
PPO_path = os.path.join('Training', 'Saved Models', 'PPO_CartPole_30k')
model.save(PPO_path)
We can evaluate the model’s performance with evaluate_policy function, and complete this run.
evaluate_policy(model, env, n_eval_episodes=10, render=True)
run.finish()

CartPole Trained using PPO algorithm; Image by Author

Run set
1

As we can observe from the video panel above, over time our cart learns to adjust its velocity to balance the pole. Success is ours!
This is the boilerplate code that we will be following for every experiment. Onto the Bipedal Walker!

3. BipedalWalker-v3

Bipedal Walker is a 4-Joint walker robot that has to cross through uneven terrain. The episode terminates if the robot’s hull touches the ground.
Action Space: Box space with motor speed values ranging from -1 to +1 for each joint.
Observation Space: Box space with 17 positional values including hull angle speed and horizontal speed.
Reward Function: The robot gets up to 300 points if it reached the farthest end, it receives -100 if it falls.
The episode terminates if the robot gets 300 points in 1600 timesteps or when the hull touches the ground
We will follow the same boilerplate code, only changing the env_name and total_timesteps
env_name = "BipedalWalker-v3"
config = {
"policy_type": "MlpPolicy",
"total_timesteps": 250000,
"env_name": env_name,
}
run = wandb.init(
project="intro_to_gym",
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)

model = PPO(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
model.learn(
total_timesteps=config["total_timesteps"],
callback=WandbCallback(
gradient_save_freq=100,
model_save_path=f"models/{run.id}",
verbose=2,
),
)

PPO_path = os.path.join('Training', 'Saved Models', 'PPO_BipedalWalker_250k')
model.save(PPO_path)

evaluate_policy(model, env, n_eval_episodes=10, render=True)
run.finish()

BipedalWalker trained using PPO algorithm; Image by Author

Run set
1

As we can observe from the video panel, our robot is only able to walk a short distance before falling flat. Poor guy.

4. SuperMarioBros

Yes, this is the classic NES game in which a plumber needs to overcome waves of usually rather cute enemies to save a kidnapped princess from Bowser, a sort of turtle dinosaur. But can we teach an AI to play this game?
Of course we can.
The library gym-super-mario-bros creates a Gym version of the Super Mario Game which can act as the learning environment. At this point, I want to give a huge shoutout to Nicholas Renotte. His tutorial on Mario RL is genuinely amazing. Most of the pre-processing techniques in this section are inspired by his video.
Let’s get started. The environments in the gym_super_mario_bros library use the full NES actions space, which includes 256 possible actions. If we train our model with such a large action space, then we cannot have meaningful convergence (i.e. make our AI play well). Therefore, we are going to choose a list of simple actions aptly named SIMPLE_MOVEMENT. It offers only seven actions. We use the JoypadSpace wrapper to allow our code to control Mario.
We convert the image to greyscale using the GreyscaleObservation function, which effectively reduces our dimensions. We wrap it inside a dummy environment and stack four frames at a time to increase training at reduced timesteps.
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)
env = GrayScaleObservation(env, keep_dim=True)
env = DummyVecEnv([lambda: env])
env = VecFrameStack(env, 4, channels_order='last')
env = Monitor(env)
Now, we proceed to use the same boilerplate code but we change the policy to CnnPolicy as we are dealing with images:
env_name = "SuperMarioBros-v0"
config = {
"policy_type": "CnnPolicy",
"total_timesteps": 25000,
"env_name": env_name,
}
run = wandb.init(
project="intro_to_gym",
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)

env = VecVideoRecorder(env, f"videos/{run.id}", record_video_trigger=lambda x: x % 2000 == 0, video_length=200)
model = PPO(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
model.learn(
total_timesteps=config["total_timesteps"],
callback=WandbCallback(
gradient_save_freq=10,
model_save_path=f"models/{run.id}",
verbose=2,
),)

PPO_path = os.path.join('Training', 'Saved Models', 'PPO_SuperMario_25k')
model.save(PPO_path)

evaluate_policy(model, env, n_eval_episodes=10, render=True)
run.finish()

Run set
1

Our Mario did not learn anything useful, which is understandable since we trained it only for 25k timesteps. More training time would definitely increase its performance.
Normally, this would be the end of the blog post but if you are like me, you would want to move a step ahead and build your very own custom environment.
If we follow Gym’s basic structure, we can easily create our very own custom environment. That is what we are going to do next!!!

Building Custom Environment with Gym

In the “How does OpenAI Gym Work?” section, we saw that every Gym environment should possess 3 main methods: reset, step, and render.
For creating our custom environment, we will need all these methods along with a __init__ method.
The general structure of our custom environment should look like this:
class CustomEnv(gym.Env):
def __init__(self):
pass
def step(self,action):
pass
def render(self):
pass
def reset(self):
pass
Our custom environment should inherit from the `gym.Env` super-class. This ensures commonality and standardization.
One of my favorite movies of all time is Interstellar. In the movie, astronauts conserve energy and resources by hibernating in HyperSleep Pods. These are long metallic tubes with life support systems with many layers of insulation. We'll try to create our very own AI HyperSleep Pod which will automatically adjust the temperature in the container to lengthen our sleep.
The optimal temperature for sleeping is between 60 degrees fahrenheit and 67 degrees fahrenheit. Our AI agent will learn on its own to maintain this temperature.
HyperSleep Pods re-imagined; Image by Author
Create a file and name it HyperSleepEnv.py. We have to follow the custom environment structure:
import random
import gym
from gym import spaces
import numpy as np

class HyperSleepEnv(gym.Env):
def __init__(self):
# Actions our pod can take: increase, decrease or stay same
self.action_space = spaces.Discrete(3)
# Temperature array
self.observation_space = spaces.Box(low=np.array([50]), high=np.array([70]))
# Set start temp
self.state = 65 + random.randint(-5,5)
# Set sleep length
self.sleep_duration = 60
# Set reward
self.reward = 0
def step(self, action):
self.state += (action - 1)*5
# Reduce sleep duration by 1
self.sleep_duration -= 1
# Calculate reward
if self.state >=60 and self.state <=67:
self.reward += 100
else:
self.reward = -10
# Check if sleep duration is over
if self.sleep_duration <= 0:
done = True
else:
done = False
# Apply temperature noise
self.state += random.randint(-5,5)
# Set placeholder for info
info = {}
# Return step information
return self.state, self.reward, done, info

def render(self):
pass
def reset(self):
# Reset shower temperature
self.state = np.array([60 + random.randint(-10,10)]).astype(float)
# Reset shower time
self.sleep_duration = 60
self.reward = 0
return self.state
We begin by creating a __init__ function which initializes our environment and state. We can choose 3 possible actions: decrease the temperature by 5 degrees, maintain the same temperature, and increase the temperature by 5 degrees.
Our observation_space is the range of possible temperature values, we set it as a continuous value between 50 and 70. Starting temperature is randomly chosen between the range of 60 and 70. We set the sleep_duration to 60 timesteps. We also set the reward to zero.
In the step() function, we will apply the action chosen by the model, which will cause a change in state. For this transition, the agent will receive a reward. If the Pod maintains a temperature between 60 and 67, it will be rewarded with +100. Else, it will be rewarded -10. We'll also introduce random noise in the temperature.
We will not be implementing the render function as it is beyond the scope of this tutorial. Moving on, the reset() function will set everything back to initial conditions.
We can solve this environment using the PPO algorithm. Here is the code:
import gym
import os
import wandb
from HyperSleepEnv import HyperSleepEnv
from wandb.integration.sb3 import WandbCallback
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

config = {
"policy_type": "MlpPolicy",
"total_timesteps": 25000,
"env_name": "HyperSleep",
}
run = wandb.init(
project="intro_to_gym",
config=config,
sync_tensorboard=True,
monitor_gym=True,
save_code=True,
)

env = HyperSleepEnv()
env = Monitor(env)
env = DummyVecEnv([lambda:env])

model = PPO(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
model.learn(
total_timesteps=config["total_timesteps"],
callback=WandbCallback(
gradient_save_freq=10,
model_save_path=f"models/{run.id}",
verbose=2,
),)

PPO_path = os.path.join('Training', 'Saved Models', 'PPO_HyperSleep_25k')
model.save(PPO_path)

evaluate_policy(model, env, n_eval_episodes=10)
run.finish()

Run set
1

As we can observe from the graph, the mean reward seems to be increasing. With increased training time, we can solve this environment.

Summary

In this blog post, we learned the basics of representing a Reinforcement Learning task with OpenAI Gym, we learned various methods and environments present in Gym, and we also learned how to use these environments and solve them using PPO. Finally, we created our very own custom environment, inspired by Interstellar, and solved it using the PPO algorithm.
  1. Nicholas Renotte's "Build an Mario AI Model with Python" video
  2. Pylesson Tutorial on Solving Bipedal Walker.
Iterate on AI agents and models faster. Try Weights & Biases today.