A Gentle Introduction to Reinforcement Learning With An Example

This article provides a primer on reinforcement learning with an autonomous driving example with OpenAI Gym and Stable Baselines3 to tie it all together.
Mukilan Krishnakumar
Created on October 30|Last edited on December 16
Comment
In this article, we'll be exploring reinforcement learning, what it is, and how to implement it using an autonomous driving example. 
Before we dive in, let's look at what we'll be covering:
Table Of ContentsWhat Is Reinforcement Learning? How Does Reinforcement Learning Work? The Intuition Behind Reinforcement Learning: A Mathematical AdventureComponents of Reinforcement Learning AgentReinforcement Learning ApplicationsWhen To Use Reinforcement LearningReinforcement Learning with PythonTracking Your Reinforcement Learning Models With Weights & BiasesSummary
﻿
﻿
Let's start by answering the most fundamental question:
What Is Reinforcement Learning? ﻿Reinforcement learning is a sub-field of machine learning in which an agent performs actions in an environment to maximize the cumulative future reward (the reward reinforces the learning). Reinforcement learning is the study of sequential decision-making and the optimization of those decisions. 
The principle behind reinforcement learning is very simple: An agent is given a reward function (i.e. a reward for performing well) and a goal (a task to be rewarded), and over time, this agent learns to perform optimal actions to achieve maximum rewards. 
Created using Stable Diffusion with the prompt: Nobel Prize ceremony, person giving a robot a prize, realistic. 
There are a few key distinctions between other machine learning techniques like supervised and unsupervised learning: 
The first and foremost is the absence of a supervisor. In other words: the agent learns on its own through reward signals.
There is also the distinction of delayed feedback: a reward is received only after performing actions.
Finally, the agent and the environment interact with each other. The agent changes the environment and the environment changes the agent. 
Additionally, there are a few terms of art with reinforcement learning that aren't particularly common in other forms of ML. We'll deal with those as they come up in context.
So let's dive in. We will start with defining a reinforcement learning problem, explore the components of a reinforcement learning task, learn the key mathematics behind Markov decision process and finally test out our understanding by building a virtual autonomous driving car! 
If you'd like to perform any of these experiments on your own, we'd also like to invite you to sign up for W&B. It's free to get started and takes just a few lines of code to try out. 
﻿
﻿
﻿
﻿
Onto the RL:
How Does Reinforcement Learning Work? As we saw earlier, the main premise behind reinforcement learning is using rewards to make our agent perform the desired action. Formally, it is called the Reward Hypothesis. Here's a definition from Richard Sutton and Andrew Barto's "Reinforcement Learning: An Introduction."
Reward Hypothesis: That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called "reward").
💡
The definition states that goals can be broken down as the maximization of the cumulative sum of rewards. This is inherent to most learning we ourselves encounter (as with most animals). When we are trying to teach a dog a new trick, we're unconsciously using the reinforcement framework. 
In fact, let's spin this out a bit. Consider my dog Luna. She is a beagle pup who is frantic and has way too much energy. When I try to teach her: 
I don’t expect Luna to speak or understand any human language. 
I give her treats if she performs the correct task correlated to my command. In other words, if I said ‘sit’ and she sits, Luna gets a treat.
I don’t give her anything if she performs no actions or does something else. If I said ‘Roll’ and she sits down, I give her nothing. 
At the beginning of the training, Luna will not understand or perform the necessary actions but over enough training, she will learn to perform the optimal actions which will maximize her treats. 
Broadly, this is how reinforcement learning works. Let's define our dog training in an RL context, shall we? 
Luna (Dog) - Agent 
Me and the garden - Environment 
Treats - Reward 
State of Luna (Sitting, Rolling, Walking) - State 
Sitting, Rolling, and Walking - Actions
﻿
Reinforcement Learning with Luna, Image by Author
In our example, the Agent (Luna) performs an action transitioning from one state to another (walking to sitting). After the transition, the agent receives a reward (treats). This agent-environment cycle across all reinforcement tasks. 
Now, while the word "reward" usually has a positive connotation, in RL, we have negative rewards as well. Let's talk about the two major types of rewards in reinforcement learning. 
Positive Reinforcement LearningIn this type of learning, the agent receives a positive reward for performing the desired action. In our example with Luna, when she does roll when the command is ‘Roll’, she receives treats (positive reward), in all other scenarios she receives no treats (zero reward).
Luna will remember the commands and desired actions over a long time because of the positive reinforcement. Positive reinforcement applies to low stake situations (like training a dog) and game environments. 
Negative Reinforcement LearningIn this type of learning, the agent receives a negative reward at all time steps for not performing the desired actions. This is called penalization. And no, we won't be penalizing Luna. We are good owners. 
Instead, let's consider the case of autonomous driving. When we train it with positive reinforcement on a task such as arriving at the destination, it may learn some undesirable actions like speeding to the goal to maximize rewards.
This will be dangerous to the passenger and pedestrians. If this is applied to real life, it could even result in death. It is therefore necessary to penalize the agent for performing dangerous and unwanted actions, like driving down the wrong side of the road or ignoring signage. 
The Intuition Behind Reinforcement Learning: A Mathematical AdventureBefore we continue we need to understand some of the mathematics behind framing a reinforcement learning problem to further understand the concepts. For that purpose, I will be introducing one of my all-time favorite characters from the manga One Piece: Chopper.
Chopper is a reindeer who is also a doctor of the Straw Hat Pirates. Chopper is currently on Drum Island, which is a winter island filled with frozen lakes. Chopper is trying to get to the ship, to continue his epic voyage.
Buggy, one of the fiercest villains in the One Piece world, is out on the island searching for Chopper with his party of clowns. 
Chopper needs to traverse the frozen lakes to get to the ship. If he lands on the blocks with clowns, he will be captured. If we frame this as a reinforcement learning problem, Chopper is the agent and the frozen lake with Clowns can be considered as the environment.
Chopper performs an action (walking straight, left, or right), then the environment guides the agent to the next state. Guiding implies that even if Chopper wanted to go straight, the slippery surface might make Chopper land on a different block. Whenever the agent transitions from a state, it receives a reward. In the case of Chopper, the reward is freedom (if he lands on blocks that don’t have unwanted clowns).
﻿
Chopper Escaping Clowns, Image by Author
History, State, and ActionFor our agent to then safely traverse the environment, history (sequence of observations) becomes a vital piece of information. Represented mathematically, it looks like so: 
﻿Ht=A1,O1,R1,...,At,Ot,RtH_t = A_1, O_1, R_1, ..., A_t, O_t, R_tHt​=A1​,O1​,R1​,...,At​,Ot​,Rt​﻿﻿
Chopper decides his future course of action based on this history, the environment guides Chopper (observations) based on this history. 
Now, that we have defined observations, we have the problem of irrelevant observation. 
We cannot store the whole history, it will soon become oversized and difficult to read. We need a way of storing only the relevant information. In our example, Chopper doesn’t need to know the locations of every clown, all he needs is the location of the nearest clown so that he can avoid them. 
This relevant information is called the state. The state is the summary of the history, it contains the most useful information. The most useful information in any given scenario is the Current State. 
Mathematically, we need the whole history to be summed up with the current state:
﻿P[St+1∣St]=P[St+1∣S1,...,St]P[S_{t+1}|S_t] = P[S_{t+1}|S_1,...,S_t]P[St+1​∣St​]=P[St+1​∣S1​,...,St​]﻿﻿
Here, St+1S_{t+1} St+1​﻿ (the future state) is only dependent on the present state and not on the past state, because the current state can summarize all the previous states. 
The property which came up naturally is called Markov Property.
Formally, Markov Property states that “Future is independent of the past given the present”. 
💡
Markov Decision ProcessOur primary aim is to use this state (sufficient statistics of history) to make a decision. Here, some decisions and actions are under the control of the agent and some are random. It is essentially decision-making in Stochastic environments. 
These kinds of processes where the outcome is partly random and partly under the control of the decision maker are termed Markov Decision Processes (MDP). We formulate every reinforcement learning problem as an MDP. 
A Markov Decision Process is represented as a tuple: ⟨S,A,P,R,γ⟩\langle S, A, P, R, \gamma \rangle⟨S,A,P,R,γ⟩﻿﻿
﻿∙ S\bullet \space S∙ S﻿ is a (finite) set of states
﻿∙ A\bullet \space A∙ A﻿ is a finite set of actions
﻿∙ P\bullet \space P∙ P﻿ is a state transition probability matrix, Pss′a=P[St+1=s′∣St=s,At=a]P_{ss'}^{a} = P[S_{t+1} = s'|S_t = s, A_t = a]Pss′a​=P[St+1​=s′∣St​=s,At​=a]﻿﻿
﻿∙ R\bullet \space R∙ R﻿ is a reward function, Rsa=E[Rt+1∣St=s,At=a]R_{s}^{a} = E[R_{t+1}|S_t = s, A_t = a]Rsa​=E[Rt+1​∣St​=s,At​=a]﻿﻿
﻿∙ γ\bullet \space \gamma∙ γ﻿ is a discount factor, γϵ[0,1]\gamma \epsilon [0,1]γϵ[0,1]﻿﻿
We have already seen states, actions, and rewards. PPP﻿ is the State Transition Probability Matrix. It is the probability of going to the next state given the current state and action. 
﻿γ\gammaγ﻿ is the discount factor. It is between the range of 0 and 1. It can be thought of as the sightedness of an agent. If it is close to 0, the agent always prioritizes immediate rewards (Exploitation), if it is close to 1, then the agent focuses on long-term rewards (Exploration). 
Components of Reinforcement Learning AgentUp until now, we were concerned with representing the reinforcement learning problem as a whole. Now we will think about the RL agent. 
A reinforcement learning agent may include one or more of these three components: 
Policy π\piπ﻿: Mapping of the state to action. It formally describes how an agent should behave. 
Value function VVV﻿: It is the reward the agent receives if it follows a policy. It represents how “good” a certain state or state-action pair is. It accounts for discounted sum of future rewards. 
Model: It is the agent’s internal representation of the environment. It is how the agent thinks the environment will behave. 
Taxonomy of Reinforcement Learning AlgorithmsBased on these three components we can roughly divide the field of reinforcement learning into five main sub-fields which are further split: 
Value-Based Methods
Policy-Based Methods
Model-Based Methods
Model-Free Methods
Actor Critic
﻿
Taxonomy of Reinforcement Learning Agents, Source: David Silver's UCL Course
The problem with this classification is that the methods are overlapping and this may lead to more confusion. I am going to stick with the classification of Reinforcement Learning Algorithms into Model-Based and Model Free. 
﻿
Taxonomy of Reinforcement Learning Agents, Source: OpenAI Spinning Up
We will further explore these methods in detail in the upcoming blog posts. 
Reinforcement Learning Applications
Personalized Recommendations﻿Recommendation systems have become an essential part of everyday life, from Amazon’s Suggested Products to Netflix’s Show Recommendations. These companies are using many complex algorithms and techniques to provide seamless recommendations.
One such framework is RecSim, a reinforcement learning framework developed by Google. It allows for the optimization of complex recommendation systems. 
RecSim is used to uncover the latent (hidden) states of a data source (like that of Users). It also optimizes long-term Click-Thru-Rate (CTR). It didn’t fully solve the issue of combinatorial decision spaces (making a huge number of decisions, like Show recommendations for Netflix), but it does provide better performance than vanilla Reinforcement Learning.  
RoboticsRobotics involves complex machinery acting in unison to achieve some kind of goal. Framing of a robotics problem is very similar to a reinforcement learning problem, there are States, Actions, and Goals. 
Reinforcement learning is a great way to train a robot. Robotic peripherals and parts are very costly and damage-prone, training a reinforcement learning agent in real life, might prove worse to the robot as it could damage itself. Simulations software and physics engines, like MuJoCo and Gazebo, have solved these issues by providing virtual environments in which we can train our RL agents. 
Robotics training also involves continuous action spaces, these are difficult. A variant of DQN solves this issue. It is called QT-Opt. Hence suitable for robotics problems.
Autonomous Driving Autonomous driving is a growing field. Fully autonomous vehicles would revolutionize our navigation and safety. Some tasks of an autonomous vehicle include motion planning, trajectory optimization, and scenario-based learning policies. 
Autonomous driving is also a good reinforcement learning problem, we can simulate the real world and let our agent train on it. 
﻿Amazon’s DeepRacer is an easy, beginner-friendly way to get started building your own autonomous vehicles. DeepRacer is a small racing car that is fully autonomous. It learns to traverse the track through reinforcement learning. 
DeepRacer uses AWS SageMaker and AWS RoboMaker to create 3D real-time simulations. 
I have personally worked on this. A DeepRacer league took place in a college near my house and I took part in it. It was amazing to see reinforcement learning in action, with a very small learning curve.
In the competition, however, my car did not perform well. My friend did win the competition, so here are the pictures. :)
﻿
Autonomous Driving with AWS DeepRacer, my friend won the Championship, Image by Author
Finance and Trading Finance and Trading involve buying, holding, and selling stocks. Reinforcement learning agents can be trained to perform the same actions, we can check the quality of their decisions by comparing their returns to that of the market. If they have more returns, they have “beat” the market, if not then the market “beat” them. 
We do have to keep in mind, the classic problem of reinforcement learning, Exploration, and Exploitation which deals with short-term and long-term rewards.
GamesGame environments have an inherent reward signal and a clear goal, they are perfect for reinforcement learning. 
Solving a retro Atari game or solving the world’s most complex Go is both interesting and difficult. DeepMind’s primary research and interest in solving these games have pushed the growth of reinforcement learning agents who are game-specific champions. AlphaGo beat the world champion in Go, Lee Sedol, with impressive results of 4-1 in 2016. 
I personally feel that when reinforcement learning is taught with games, it is much more interesting and exciting.
When To Use Reinforcement LearningReinforcement learning, at the end of the day, is a tool. It is not necessarily great for every situation. In fact, in some situations, it achieves negative results. In high stake situations, reinforcement learning should be avoided. If our agent makes a mistake (which they are meant to do by design) it could result in dire situations and even death. 
Reinforcement learning should be used in situations that present a clear reward signal and desired behavior. Most optimization problems can be thought of as reinforcement learning problems. 
To illustrate, reinforcement learning is not suitable in the following cases: 
Generalization: Reinforcement Learning is extremely goal-oriented, it cannot generalize if new features are introduced. 
Low Signal-to-Noise Ration: If there are noisy features, then our RL agent will fit the noise. 
Desirable actions cannot be predicted: If we use the RL agent for surgery, it might find an action that is undesirable and unpredictable. 
Long-Time horizons: It cannot handle long-time decisions. 
Reinforcement Learning with PythonNow that we have some background, let's try out hands at some reinforcement learning.
The EnvironmentWe are going to test out the `CarRacing-v0` environment. We will be using OpenAI Gym which offers a simple pythonic way of representing reinforcement learning problems. It also offers various environments out of the box, which reduces the time spent setting up the environment. 
Our environment consists of a red race car (with powerful rear drive) and a track. The objective of the agent (car) is to maximize the cumulative rewards (points). 
It is easier to learn with continuous control, which is set as Default. There are three continuous actions: Steer, Gas, and Brake. 
The environment uses negative reinforcement, it rewards -0.1 every frame and +1000/N for every track tile visited, N is the total number of tiles visited in the track. If our agent gets more than 900 points then we can consider the problem to be solved. 
﻿
CarRacing-v0 environment, Image by Author
PPOWe will also be using Stable-Baselines3 which is a repository of reliable implementation of reinforcement learning algorithms. It removes the job of hard coding algorithms into simple import statements. We will be using the Proximal Policy Optimization (PPO) algorithm. 
The main idea behind PPO is that updated policy should not be too far from the old policy. It uses Clipping to avoid any major updates. 
We will learn more about PPO in the “Policy Optimization in Model Free RL”  blog post. 
**Note:** You cannot run the following code in Colab. Rendering the environment in Colab is a bit complicated, and as such we will be avoiding it. You can download the Ipython file linked here and run it locally. 
Step 1: Importing LibrariesWe will install stable-baselines3, we are adding [extra] postfix to download all other dependencies and libraries needed for stable-baselines3 to function. We will also install wandb, which we will use later for tracking our experiment metrics and performance. 
!pip install 'stable-baselines3[extra]'
!pip install wandb
﻿
import gym
import os
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
Step 2: Load EnvironmentWith  gym.make  we initialize a gym environment and pass it CarRacing-v0 argument to create the autonomous racing environment.
We vectorize our environment with DummyVecEnv . It needs a function that returns an environment as an argument. Using lambda function we can return an environment with a single expression. 
env = gym.make('CarRacing-v0')
env = DummyVecEnv([lambda:env])
Step 3: Train The RL ModelBefore we proceed to the next step, we need to create a directory called Training. Inside this directory, create two more sub-directories called Logs and Saved Models. We will be saving our tensorboard logs in Trainings/Logs. 
﻿
Directory Structure, Image by Author
We are using PPO algorithm, we will initialize our model using PPO(), then pass on the environment.
We start the training process by calling the learn function. We train the initial model for 20000 time-steps. 
log_path = os.path.join('Training','Logs')
﻿
model = PPO('MlpPolicy', env, verbose = 1)
﻿
model.learn(total_timesteps = 20000)
Step 4: Evaluate the RL ModelWe can use evaluate_policy to evaluate our model. We pass in render = True  to visualize our environment and agent. 
evaluate_policy(model,env,n_eval_episodes = 10, render = True)
Tracking Your Reinforcement Learning Models With Weights & BiasesWe need a comprehensive way to track our agent performance, reproduce results and visualize those results. Weights & Biases is the perfect tool to do the job. It provides seamless integration with Stable Baselines3. 
Step 1: Importing Librariesimport gym
import os 
import wandb
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv,VecVideoRecorder
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from wandb.integration.sb3 import WandbCallback 
Step 2: Initializing Our RunAll we need to do is create a config dictionary and initialize our run with wandb.init().
It will automatically log all the necessary metrics. We pass in sync_tensorboard = True to auto-upload Stable-Baseline3’s tensorboard metrics. We pass in monitor_gym = True to auto-upload video footage of the agent interacting with the environment. 
config = {
    "policy_type": "MlpPolicy",
    "total_timesteps": 250000,
    "env_name": "CarRacing-v0",
}
run = wandb.init(
    project="intro_to_rl",
    config=config,
    sync_tensorboard=True,  
    monitor_gym=True,  
    save_code=True,  
)
Step 3: Load EnvironmentThe only additional process we use is VecVideoRecoder, it wraps a Vector Environment to enable recording the rendered images as a video. We start recording every 2000 time steps and record for 200 time steps. 
def make_env():
    env = gym.make(config["env_name"])
    env = Monitor(env) 
    return env
﻿
env = DummyVecEnv([make_env])
env = VecVideoRecorder(env, f"videos/{run.id}", record_video_trigger=lambda x: x % 2000 == 0, video_length=200)
Step 4: Train RL ModelIn the model.learn function, we pass in WandbCallback which automatically logs history data from any available metric. 
model = PPO(config["policy_type"], env, verbose=1, tensorboard_log=f"runs/{run.id}")
﻿
﻿
model.learn(
    total_timesteps=config["total_timesteps"],
    callback=WandbCallback(
        gradient_save_freq=100,
        model_save_path=f"models/{run.id}",
        verbose=2,
    ),
)
Step 5: Saving RL ModelWe can save our model using model.save function. 
PPO_path = os.path.join('Training', 'Saved Models', 'PPO_Driving_model_250k')
model.save(PPO_path)
Step 6: Evaluate the RL ModelWe can evaluate the model using evaluate_policy. We terminate the run by calling run.finish(). 
evaluate_policy(model, env, n_eval_episodes=10, render=True)
run.finish()
Step 7: Visualization and Analysis﻿
﻿
SummaryIn this article, we explored what reinforcement learning is and how it works, we formulated a reinforcement learning problem, types of reinforcement, the basic mathematics behind reinforcement learning, applications of it, when to use it and when not to, and finally topped it off with implementing reinforcement learning with Python and tracking our agent with Weights & Biases.  
Hopefully, I made things easier for a beginner to get started with the wonderful world of reinforcement learning. 
Recommended ReadingDavid Silver’s Introduction to Reinforcement Learning﻿
“Reinforcement Learning: An Introduction” book by Richard Sutton and Andrew Barto.
Yash Kotadia’s Tracking RL Experiment with WandB﻿
﻿
﻿
﻿
Add a comment
Tags: Reinforcement Learning, Articles, Tutorial, Autonomous Vehicles, Intermediate, Panels, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.