Fundamentals of Reinforcement Learning with Example Code

This article covers everything about reinforcement leanring from Sequential Decision Making to the Markov Decision Process, Return, Value Function, Bellman Equations, and Dynamic Programming.
Mukilan Krishnakumar
Created on February 25|Last edited on December 1
Comment
This article is your friendly introduction to the fundamentals of reinforcement learning. In it, we will be covering everything from Sequential Decision Making, Markov Decision Process, Return, Policy, Value Functions, Bellman Equations, and Dynamic Programming. 
Here's what we'll be covering: 
What is Reinforcement Learning?Our Reinforcement Learning ExampleIntergalactic AmnesiaThe Reinforcement Learning ChallengeSequential Decision MakingMarkov Decision ProcessReturn and PolicyValue Functions and Bellman EquationsBellman EquationsOptimalityBellman Optimality EquationDynamic Programming1. Policy Evaluation2. Policy Iteration3. Value IterationConclusionSummaryRecommended Reading and Credits
﻿
﻿
What is Reinforcement Learning?﻿Reinforcement Learning (RL) is the sub-field of machine learning in which an agent learns to perform optimal actions through continuous interaction with the environment.
In the previous blog posts, we got a general introduction to Reinforcement Learning and OpenAI Gym. They covered a lot of important material, but today we're going to dig into the fundamentals.
Our Reinforcement Learning ExampleTo help illustrate reinforcement learning will be solving problems from the perspective of Jean-Luc Picard. Jean-Luc is the captain of the Federation Starship USS Enterprise. Our Lieutenant Commander Data, an android, seems to be facing a huge problem.
Grab some popcorn 🍿 and put on your seat belts; we are about to go on an Interactive Intergalactic Adventure!!
Take your time with the exercises and try solving them all. They all build on top of previously introduced concepts, so I highly recommend going in order. Have fun, and remember to go boldly where no one has gone before!
Intergalactic Amnesia🚀Cpt. Picard: Captain’s Log, Stardate 41153.7. A couple of hours ago, our Constitution-Class Starship, USS Enterprise, came upon the planet “A2410”. The planet seemed to be inhabited by some life form, immune to the high volume of hyperionic radiation abundantly present on the planet. 
Hyperionic radiation would be fatal to humans. Hence, we agreed to send Lieutenant Commander Data as our ambassador. Upon landing, Data was attacked by a primitive life form. We beamed him back immediately, but Data doesn’t remember anything other than his name. Data’s memory has been wiped clean. Our resident doctor, Beverly Crusher, is looking into it as we speak. 
🚀Cpt. Picard: Captain’s log, supplemental. Dr. Crusher confirmed that Data is suffering from amnesia. It has never been observed before in androids, hence the lack of a better term. Fortunately, Data seems to understand commands and perform them when needed. If we can teach Data to perform basic actions, we can probably retrieve his memory. 
The Reinforcement Learning ChallengeData is suffering from Intergalactic Amnesia. Fortunately, Data can understand basic commands and can be taught to perform some actions or sequences of actions. We can program every move Data should make, but this defeats the purpose of retrieving his memory. We should teach Data some skills. This is the perfect opportunity for our tool: Reinforcement Learning. 
Sequential Decision MakingWe are going to simplify the problem and teach Data to walk autonomously without hitting any object in between. Data should reach the door (goal) to his quarters without hitting any objects in between. 
Autonomous Navigation; Image by Author
Data is equipped with sensors throughout his body which can simulate the human nervous system. When he hits an object, he doesn’t “technically” feel any pain, but he can observe something is wrong. 
Using code, have complete control over his reward system. We can train him to navigate autonomously by using this reward system. His interaction with the environment can be summed up as follows:
Data performs an action (e.g, left)
The performed action results in a change of state (Data moves left)
This transition in the state is accompanied by a reward (+10 for reaching the goal, -10 for hitting on debris).
Data performs another action etc. 
This interaction is a repetition of sequential decision-making done under uncertainty. It can be represented as an agent-interaction cycle. 
Data-Spaceship Interaction Cycle; Image by Author
Generally, 
Data - Agent
Starship Quarters - Environment
Pain (-10 points) and Goal (+10 points)- Reward
State of the environment (Position of Data) - States
Left, Right, Forward, Backward - Actions
An important observation would be that the agent has no say in the change of state. The environment decides the change in the state depending on the action performed by the agent. 
A good analogy for this agent-environment interaction would be the game of chess. The agent (Data) and the environment (Starship) have equal chances of playing the game. The first move is made by Data (e.g., Action “left”). The environment then performs the necessary change in the state depending on the action (if the floor is slippery, then Data could land “right”). Understanding this would be of greater importance later. 
Markov Decision ProcessThe agent-environment interaction produces a long sequence of observations. 
﻿A1,O1,R1,...,At,Ot,RtA_1, O_1, R_1, ... ,A_t, O_t, R_tA1​,O1​,R1​,...,At​,Ot​,Rt​﻿﻿
This information is useful in assisting us with decision-making. 
Now, another question arises, how do we store these sequences?
We can store them completely, which would require large storage and a longer time to traverse. Or, we can store only the current state, which should be sufficient. 
The first option would correspond to collecting the whole sequence. Formally described as History, it contains every observation from the past.
﻿Ht=A1,O1,R1,...,At,Ot,RtH_t = A_1, O_1, R_1, ... ,A_t, O_t, R_tHt​=A1​,O1​,R1​,...,At​,Ot​,Rt​﻿﻿
The second option would be to only look at the current state. Assuming that the current state is the sufficient statistic of the whole history. 
In our example, the current state would be the current position of Data. We can calculate our trajectory and distance purely based on this current state. 
Hence the current state is a sufficient statistic of the whole history. We need not know what Data did 2000 timesteps ago. All we need to know is the current state of Data. 
Mathematically,
﻿P[St+1∣St]=P[St+1∣S1,S2,...St]P[S_{t+1}|S_t] = P[S_{t+1} | S_1,S_2,...S_t]P[St+1​∣St​]=P[St+1​∣S1​,S2​,...St​]﻿﻿
The future state (St+1S_{t+1}St+1​﻿) is independent of the past given the present state (StS_tSt​﻿). This is formally described as Markov Property.Now, we move on to using this state (relevant information) to make a decision. Here the chess analogy comes in handy. 
As in the game of chess, the decision maker (agent) has some control over the decision-making. But an equal role is played by the environment, which acts in a stochastic way (fancy term for probabilistic) and imparts randomness. These processes are formally termed Markov Decision Processes. 
Every Reinforcement Learning problem can be termed an MDP. 
A MDP is formally described as a tuple ⟨S,A,P,R,γ⟩\langle S, A, P, R, \gamma \rangle⟨S,A,P,R,γ⟩﻿
- SSS﻿ is a set of states
- AAA﻿ is a set of actions
- PPP﻿ is the State Transition Probability Matrix
- RRR﻿ is the reward function
- γ\gammaγ﻿ is the discount factorWe need to delve further into two main concepts. 
Firstly, the State Transition Probability Matrix. It is a matrix (table) that contains the probability of transition from the current state to the next. 
Data performs the action Left. The environment now decides the probability of landing left or right (state change).
State Transition Probability; Image by Author
Mathematically, 
﻿Pss′a=P[St+1=s′,St=s,At=a]



P_{ss'}^{a} = P[S_{t+1} = s', S_t = s, A_t = a]



Pss′a​=P[St+1​=s′,St​=s,At​=a]﻿﻿
Now, we can move on to Discount Factor. It is much more intuitive to understand after understanding Return (next section). Hence, we will return to Discount Factor shortly. :)
Now, you should have the expertise to frame any reinforcement learning problem into an MDP. Remember an MDP should contain 5 essential components: States, Actions, State Transition Probability Matrix, Reward Function, and Discount Factor. 
Intergalactic Exercise No. 1!!!
Formulate the problem of teaching Data autonomous navigation as an MDP. 
💡
Did you try it? Let’s compare it with one possible solution. 
One Possible Solution:
MDP ⟨S,A,P,R,γ⟩\langle S, A, P, R, \gamma \rangle⟨S,A,P,R,γ⟩﻿﻿
﻿SSS﻿ is the set of states, i.e., positions of Data in the room
﻿AAA﻿ is the set of actions, i.e., Left, Right, Forward, and Backward 
﻿PPP﻿ is the State Transition Probability Matrix. Every action has an 80% probability of executing the desired action, 10% of executing the opposite action, and 10% of executing no action. 
﻿RRR﻿ is the reward function. -10 points if Data collides with any object. 
﻿γ\gammaγ﻿ is the discount factor. 
🚀Cpt. Picard: Captain’s Log, Supplemental. We have formulated the problem of teaching Data into an MDP. We are making progress altogether at a slower pace; we have to hurry. 
Return and PolicyWe have defined the problem; now, we need to solve this problem. How can we be sure that we have solved the problem? 
One good way to be sure of the solution is to assign a metric that will keep a tab of our rewards. Analogous to the “Score” in any video game, it gives us real-time performance. 
This “Score” would be better if it could predict the reward we are expected to get from the current state. This is formally described as Return. Return GtG_tGt​﻿ is the total discounted reward starting from the time step ttt﻿. 
Mathematically, 
﻿Gt=Rt+1+γRt+2+γ2Rt+3+...=∑k=0∞γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k = 0}^{\infty} \gamma^k R_{t+k+1}Gt​=Rt+1​+γRt+2​+γ2Rt+3​+...=∑k=0∞​γkRt+k+1​﻿﻿
Discount Factor is rearing its head again. Let us confront it. 
The discount factor is a numerical value between 0 and 1. The discount factor is analogous to the lens. The closer it is to 0, the more myopic it gets, i.e., it only concentrates on the short-term reward. The farther it is from 0, it concentrates on the long-term reward. 
We introduce discounting because of the delayed nature of expected rewards. We only receive a reward if we transition to a successor state. We can go for immediate reward or go for long-term reward. 
Discounting is found in nature. One such example is diet; we can concentrate on short-term rewards and eat the sugary treat or focus on long-term rewards and eat vegetables. 
Coming back to the problem, it would be amazing if we could have an assistant which can guide Data through the optimal actions. This assistant can be as simple as a lookup table, which shows the probability of taking an action for a given state. 
This table is formally described as Policy. 
Mathematically, 
Policy π\piπ﻿ is a distribution over actions given states,  
π(a∣s)=P[At=a∣St=s]\pi(a|s) = P[A_t = a| S_t = s]π(a∣s)=P[At​=a∣St​=s]﻿﻿Our simple lookup table can locate the path to the door and guide us by showing the next action with the highest probability. 
Policy Lookup Table; Image by Author
Policies are of varying degrees. It can be as simple as “taking left in every state”, or it can be a complex long-winded instruction set. Numerically, we can have many policies: π1,π2,...,πn\pi_1, \pi_2,..., \pi_nπ1​,π2​,...,πn​﻿. 
How do we distinguish such policies and arrive at the best policy? After all, we want to find the best possible way to reach the goal state. 
Value Functions and Bellman EquationsThe answer lies in comparing states or state-action pairs. 
We can only compare two different things if we have some numerical value attached to them. 
We can attach numerical values to states and compare two or more of them. This will help our agent in navigating successor states. 
For example, if Data had a map that showed values distributed over states, he could continuously traverse to the successor state which has the highest value. 
State Values; Image by Author
This numerical value given to a state is called “State-Value”. We can also do the same for a state-action pair and term it the “Action-Value”. 
We can assign arbitrary values to the states but that would not assist Data in traversing. Hence, we come up with assigning values to states based on the expected reward we can receive starting from that state following some policy. 
Mathematically, we assign values to states with the help of the state-value function
﻿vπ(s)=Eπ[Gt∣St=s]v_{\pi}(s) = E_{\pi}[G_t | S_t = s]vπ​(s)=Eπ​[Gt​∣St​=s]﻿﻿
The state value function vπv_{\pi}vπ​﻿ of an MDP is the expected return the agent can receive starting from state sss﻿ and following policy π\piπ﻿.
We can do the same for state-action pairs.  qπ(s,a)q_{\pi}(s,a)qπ​(s,a)﻿ is the action-value, which is the expected return, starting at some state sss﻿, taking action aaa﻿, and following policy π\piπ﻿. 
﻿qπ(s,a)=Eπ[Gt∣St=s,At=a]q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t = a]qπ​(s,a)=Eπ​[Gt​∣St​=s,At​=a]﻿﻿
We can now substitute the complete equation for the return GtG_tGt​﻿ into the value-functions equations. 
For the state-value function, 
﻿vπ(s)=Eπ[Gt∣St=s]v_{\pi}(s) = E_{\pi}[G_t | S_t = s]




vπ​(s)=Eπ​[Gt​∣St​=s]﻿﻿
﻿=Eπ[Rt+1+γRt+2+...]=E_{\pi}[R_{t+1} + \gamma R_{t+2} + ...]




=Eπ​[Rt+1​+γRt+2​+...]﻿﻿
﻿=Eπ[Rt+1∣St=s]+Eπ[γvπ(St+1)∣St=s]=E_{\pi}[R_{t+1} | S_t = s] + E_{\pi}[\gamma v_{\pi}(S_{t+1}) | S_t = s]

=Eπ​[Rt+1​∣St​=s]+Eπ​[γvπ​(St+1​)∣St​=s]﻿﻿
We can observe from this equation that the state value function for the current state  vπ(s)v_{\pi}(s)vπ​(s)﻿ , is equal to the immediate reward Rt+1R_{t+1}Rt+1​﻿ in addition to the state value function of the successor state vπ(s′)v_{\pi}(s')vπ​(s′)﻿.
Following the same substitution for the Action-Value Function, we get. 
﻿qπ(s,a)=Eπ[Rt+1∣St=s,At=a]+Eπ[γqπ(St+1,At+1)∣St=s,At=a]q_{\pi}(s,a) = E_{\pi}[R_{t+1} | S_t = s, A_t = a] + E_{\pi}[\gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]







qπ​(s,a)=Eπ​[Rt+1​∣St​=s,At​=a]+Eπ​[γqπ​(St+1​,At+1​)∣St​=s,At​=a]﻿﻿
In all of the equations, we are using the term $E$. It represented expectation. The expectation is a fancy term for the average value. Let’s do a quick intergalactic exercise. 
Intergalactic Exercise No. 2!!!
Data is faced with a particularly sticky problem. He has a 30% chance of stepping left and receiving 20 points or a 70% chance of stepping right and receiving 10 points. What is the expected (average) reward?
💡
You can look at the following figure for a visual perspective. 
Finding Average; Image by Author
We intuitively do the following math:
﻿Reward From state=0.3∗20+0.7∗10=13Reward \space From \space state= 0.3 * 20 + 0.7 * 10 = 13 Reward From state=0.3∗20+0.7∗10=13﻿﻿
How did we do this? 
We multiplied the probability of taking that particular action by the reward we receive from that particular state-action pair. This is essentially multiplying the policy $\pi$ with the action value of the state-action pair $q_{\pi}(s,a)$. 
Hence, 
﻿vπ(s)=∑aϵAπ(a∣s)qπ(s,a)v_{\pi}(s) = \sum_{a \epsilon A} \pi(a|s) q_{\pi}(s,a)vπ​(s)=∑aϵA​π(a∣s)qπ​(s,a)﻿﻿
Pictorially, it can be represented as follows
One Step Look Ahead from the current state; Image by Author
The empty dot represents state $s$, and the filled dots represent two possible actions and their action values: qπ(s1,a1),qπ(s2,a2)q_{\pi}(s_1,a_1),q_{\pi}(s_2,a_2)qπ​(s1​,a1​),qπ​(s2​,a2​)﻿. Think of this as the agent’s chance in the game of chess.
We can now use the action value as the starting point. It is now the chance of the environment. The environment will provide a reward RRR﻿ and transitions to a successor state based on State Transition Probability Matrix (PPP﻿). 
Mathematically, 
﻿qπ(s,a)=Rsa+γ∑s′ϵSPss′avπ(s′)



q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{\pi}(s')



qπ​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​vπ​(s′)﻿﻿
One Step Look Ahead from Chosen Action, Controlled by Environment; Image by Author
This is called a one-step look ahead. We look to the next successor state or state-action pair, determine the value and return to the current state. We step once, see the value, return, and calculate the current value. 
We can now do a two-step look ahead. I am going to skip the reasoning, and I assume you understood it. If not, hit me up in the comments section. 
Two-Step Look Ahead from Current Position; Image by Author
From the two-step look ahead, we arrive at the following equations. 
﻿vπ(s)=∑aϵAπ(a∣s)[Rsa+γ∑s′ϵSPss′avπ(s′)]v_{\pi}(s) = \sum_{a \epsilon A} \pi(a | s) [R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{\pi}(s')]vπ​(s)=∑aϵA​π(a∣s)[Rsa​+γ∑s′ϵS​Pss′a​vπ​(s′)]﻿﻿
﻿qπ(s,a)=Rsa+γ∑s′ϵSPss′a∑a′ϵAπ(a′∣s′)qπ(s′,a′)q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} \sum_{a' \epsilon A} \pi(a'| s') q_{\pi}(s',a')qπ​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​∑a′ϵA​π(a′∣s′)qπ​(s′,a′)﻿﻿
Wow, that was a ton of math. Let us combine all the major equations and put them under one sub-section. 
Bellman Equations1. vπ(s)=∑aϵAπ(a∣s)qπ(s,a)v_{\pi}(s) = \sum_{a \epsilon A} \pi(a|s) q_{\pi}(s,a)vπ​(s)=∑aϵA​π(a∣s)qπ​(s,a)﻿﻿
2. qπ(s,a)=Rsa+γ∑s′ϵSPss′avπ(s′)q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{\pi}(s')qπ​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​vπ​(s′)﻿ 
3. vπ(s)=∑aϵAπ(a∣s)[Rsa+γ∑s′ϵSPss′avπ(s′)]v_{\pi}(s) = \sum_{a \epsilon A} \pi(a | s) [R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{\pi}(s')]vπ​(s)=∑aϵA​π(a∣s)[Rsa​+γ∑s′ϵS​Pss′a​vπ​(s′)]﻿﻿
4. qπ(s,a)=Rsa+γ∑s′ϵSPss′a∑a′ϵAπ(a′∣s′)qπ(s′,a′)q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} \sum_{a' \epsilon A} \pi(a'| s') q_{\pi}(s',a')qπ​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​∑a′ϵA​π(a′∣s′)qπ​(s′,a′)﻿﻿
These equations are together called the Bellman Equations. The first couple of equations expresses the value functions in terms of their counterparts i.e vvv﻿ in terms of qqq﻿ and vice-versa. The last couple of equations describes the value functions in terms of themselves. 
What is the use of these equations? 
These equations introduce recursion into the mix. Recursion is when a bigger problem is solved by solving smaller instances of the same problem. 
OptimalityWe are going through calculating value functions for successor states and state-action pairs. The search could be much better if we only search the state value with maximum values. 
Then, there would be an optimal state-value function v∗(s)v_{*}(s)v∗​(s)﻿ which has the maximum state-value overall policies. 
Mathematically, 
﻿v∗(s)=maxπvπ(s)v_*(s) = max_{\pi}v_{\pi}(s)v∗​(s)=maxπ​vπ​(s)﻿﻿
Similarly, we can repeat it for action values. 
﻿q∗(s,a)=maxπqπ(s,a)q_*(s,a) = max_{\pi}q_{\pi}(s,a)q∗​(s,a)=maxπ​qπ​(s,a)﻿﻿
An MDP is solved when we find this optimal value function. We can use this optimal value function to reach the terminal state and receive maximal rewards. 
This introduces an interesting property, if we can differentiate and order different value functions, we can also order policies. Thereby leading to the optimal policy π∗\pi_*π∗​﻿.
Now, we can apply optimality to all the Bellman Equations. 
We have the one-step look ahead from state and action value and choose to take the maximum action value represented with the arc. 
﻿
One Step Optimality; Image by Author
This results in, 
﻿v∗(s)=maxaq∗(s,a)v_{*}(s) = max_a q_*(s,a)v∗​(s)=maxa​q∗​(s,a)﻿﻿
﻿q∗(s,a)=Rsa+γ∑s′ϵSPss′a∑a′ϵAπ(a′∣s′)qπ(s′,a′)q_*(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} \sum_{a' \epsilon A} \pi(a'| s') q_{\pi}(s',a')q∗​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​∑a′ϵA​π(a′∣s′)qπ​(s′,a′)﻿﻿
Note that the second equation is the same as Bellman Equation because it is not under the control of the agent. The environment has complete control. 
Similarly, we proceed to the last couple of equations. 
﻿v∗(s)=maxaRsa+γ∑s′ϵSPss′av∗(s′)v_{*}(s) =  max_a R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{*}(s')v∗​(s)=maxa​Rsa​+γ∑s′ϵS​Pss′a​v∗​(s′)﻿﻿
﻿qπ(s,a)=Rsa+γ∑s′ϵSPss′amaxa′q∗(s′,a′)q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} max_{a'} q_{*}(s',a')qπ​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​maxa′​q∗​(s′,a′)﻿﻿
Putting them all together, 
Bellman Optimality Equation1. v∗(s)=maxaq∗(s,a)v_{*}(s) = max_a q_*(s,a)v∗​(s)=maxa​q∗​(s,a)﻿﻿
2. q∗(s,a)=Rsa+γ∑s′ϵSPss′a∑a′ϵAπ(a′∣s′)qπ(s′,a′)q_*(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} \sum_{a' \epsilon A} \pi(a'| s') q_{\pi}(s',a')q∗​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​∑a′ϵA​π(a′∣s′)qπ​(s′,a′)﻿﻿
3. v∗(s)=maxaRsa+γ∑s′ϵSPss′av∗(s′)v_{*}(s) =  max_a R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{*}(s')v∗​(s)=maxa​Rsa​+γ∑s′ϵS​Pss′a​v∗​(s′)﻿﻿
4. qπ(s,a)=Rsa+γ∑s′ϵSPss′amaxa′q∗(s′,a′)q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} max_{a'} q_{*}(s',a')qπ​(s,a)=Rsa​+γ∑s′ϵS​Pss′a​maxa′​q∗​(s′,a′)﻿﻿
Until now, we have formulated MDP and evaluated value functions and policy. Now, we can get to solving the MDP.
🚀Cpt. Picard: Captain’s Log, Supplemental. We have assigned state-value to different states and are prepared to solve the MDP. We want our dear Lieutenant back.  
Dynamic ProgrammingDynamic Programming is a problem-solving methodology that splits a complex problem into recursive sub-problems and solves them. It stores these solutions and reuses them to solve the bigger one.  It is one way of solving MDP. 
It is optimal for Reinforcement Learning because of the following properties:
Recursion: Bellman Equations introduced recursion
Value Functions store values and reuses them. 
There are three paradigms for Dynamic Programming:
Policy Evaluation
Policy Iteration
Value Iteration
We have been dealing with theory for a long while. In this section, we will switch gears and implement the theory. We will learn a bit of theory and implement it in code. Let us get started. 
The code can be found here. 
1. Policy EvaluationThis is a prediction task, a policy is supplied, and we only evaluate the given policy. 
We evaluate the policy by finding the value function of that policy. We start by randomly guessing, let us call it v1v_1v1​﻿, and iteratively apply Bellman Equation to converge at the actual optimal value function. 
﻿v1→v2→...vπv_1 \rightarrow v_2 \rightarrow ... v_{\pi}v1​→v2​→...vπ​﻿﻿
The value functions are updated synchronously. We update the current value of the state s,vk+1(s)s, v_{k+1}(s)s,vk+1​(s)﻿ from the previous iteration’s value of the successor state s′,vk(s′)s',v_{k}(s')s′,vk​(s′)﻿. 
Mathematically, 
﻿vk+1(s)=∑aϵAπ(a∣s)(Rsa+γ∑s′ϵSPss′avk(s′))v_{k+1}(s) = \sum_{a \epsilon A} \pi(a|s) \Bigg( R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{k}(s') \Bigg)vk+1​(s)=∑aϵA​π(a∣s)(Rsa​+γ∑s′ϵS​Pss′a​vk​(s′))﻿﻿
This iterative process is stopped when the difference between iterations is minimal. 
2. Policy IterationPolicy Iteration involves performing policy evaluation and policy updation. 
Here is the algorithm:
Random initialization of policy π\piπ﻿﻿
Repeat until the difference between the current policy and the previous policy is minimal
	a. Evaluate state-value vπv_{\pi}vπ​﻿ by policy evaluation. 
	b. Perform synchronous updates, for each state sss﻿ let,
﻿π(s):=argmaxaϵAq(s,a)=argmaxaϵA(Rsa+γ∑s′ϵSPss′avπ(s′))\pi(s):= argmax_{a \epsilon A} q(s,a) = argmax_{a \epsilon A} \Bigg( R_{s}^{a} + \gamma \sum_{s' \epsilon S} P_{ss'}^{a} v_{\pi}(s') \Bigg)







π(s):=argmaxaϵA​q(s,a)=argmaxaϵA​(Rsa​+γ∑s′ϵS​Pss′a​vπ​(s′))﻿ 
﻿
Intergalactic Exercise No. 3!!!
Implement Policy Iteration in any programming language of your preference. It should initialize with a random policy, perform policy evaluation (calculating state value) and perform synchronous updates (until the difference is minimal). 
💡
Try coming up with a structure and compare it with the code below. 
2.1 Import Librariesimport random
import wandb
We import random to initialize a random policy and import wandb to log the results and intermediate state values. 
2.2 Arguments and SetupREWARD = 0 
DISCOUNT = 0.99
MAX_ERROR = 10**(-3)
We set REWARD to 0, the environment provides zero reward for every state transition. We set DISCOUNT to 0.99, and we are prioritizing long-term rewards. We also set MAX_ERROR to be 10^-3. This will be used to calculate the difference between two state-value functions. 
# Set up 
NUM_ACTIONS = 4
ACTIONS = [(1, 0), (0, -1), (-1, 0), (0, 1)] # Down, Left, Up, Right
NUM_ROW = 4
NUM_COL = 4
U = [[0, 0, 0, 0],[-10, 0, -10, +10],[0, -10, 0, 0],[0, 0, 0, -10]]
policy = [[random.randint(0, 3) for j in range(NUM_COL)] for i in range(NUM_ROW)] # random policy
There are four possible actions: Down, Left, Up, and Right. We initially set up state values for the grid. Every position other than Debris and Goals is set to 0. We create a random policy using the random.randint function. 
2.3 Logging using W&Bconfig = {
    "iteration_type" : "Policy Iteration",
    "environment_name" : "Intergalactic Amnesia"
}
run = wandb.init(
    project = "fundamentals_rl",
    config = config,
    save_code = True,
)
We create a config dictionary that holds the iteration_type and environment_name keys. We set them to Policy Iteration and Intergalactic Amnesia. We initialize run and start the logging process. 
2.4 Visualization# Visualization
def printEnvironment(arr, policy=False):
    res = ""
    for r in range(NUM_ROW):
        res += "|"
        for c in range(NUM_COL):
            if r == 1 and c == 0:
                val = "-10" 
            elif r == 2 and c == 1:
                val = "-10" 
            elif r == 1 and c == 3:
                val = "+10"
            elif r == 1 and c == 2:
                val = "-10"
            elif r == 3 and c == 3:
                val = "-10"
            else:
                val = ["Down", "Left", "Up", "Right"][arr[r][c]]
            res += " " + val[:5].ljust(5) + " |" # format
        res += "\n"
    print(res)
We create our visualization function, set outer walls, and avoid the regions with debris and goal. We set them with their respective reward values. We go through the policy and assign values to the remaining positions. The policy decides the action we need to take. 
2.5 Calculating State Value Function# getU function
def getU(U, r, c, action):
    dr, dc = ACTIONS[action]
    newR, newC = r+dr, c+dc
    if newR < 0 or newC < 0 or newR >= NUM_ROW or newC >= NUM_COL: # collision with boundary
        return U[r][c]
    else:
        return U[newR][newC]
﻿
# calculateU function
def calculateU(U, r, c, action):
    u = REWARD
    u += 0.1 * DISCOUNT * getU(U, r, c, (action-1)%4)
    u += 0.8 * DISCOUNT * getU(U, r, c, action)
    u += 0.1 * DISCOUNT * getU(U, r, c, (action+1)%4)
    return u
We use the getU function to get the current state value and store our new actions in two variables: dr and dc. We check if this update increases the pre-existing state value (U). If the current value is higher, we update; if not, we proceed with the previous value. 
We use the calculateU function to account for State Transition Probability, PPP﻿. Whenever Data performs an action, he has a 90% chance of actually landing where he wanted and a 10% chance of landing perpendicular to that action. 
2.6 Policy Evaluation# Evaluate the policy
def policyEvaluation(policy, U):
    while True:
        nextU = [[0, 0, 0, 0],[-10, 0, -10, +10],[0, -10, 0, 0],[0, 0, 0, -10]]
        error = 0
        for r in range(NUM_ROW):
            for c in range(NUM_COL):
                if r == 1 and c == 0:
                    continue
                elif r == 2 and c == 1:
                    continue 
                elif r == 1 and c == 3:
                    continue
                elif r == 1 and c == 2:
                    continue
                elif r == 3 and c == 3:
                    continue
                nextU[r][c] = calculateU(U, r, c, policy[r][c]) # simplified Bellman update
                error = max(error, abs(nextU[r][c]-U[r][c]))
        U = nextU
        wandb.log({"Position_1" : U[3][0],"Position_2" : U[3][1],"Position_3" : U[3][2],"Position_4" : U[3][3],
                    "Position_5" : U[2][0],"Position_6" : U[2][1],"Position_7" : U[2][2],"Position_8" : U[2][3],
                    "Position_9" : U[1][0],"Position_10" : U[1][1],"Position_11" : U[1][2],"Position_12" : U[1][3],
                    "Position_13" : U[0][0],"Position_14" : U[0][1],"Position_15" : U[0][2],"Position_16" : U[0][3]})
        if error < MAX_ERROR * (1-DISCOUNT) / DISCOUNT:
            break
    return U
We create the policyEvaluation method to evaluate a policy. During every iteration, we reset the state values to the initial state-value grid. We make sure not to disturb the state values of Debris and Goals. Hence we use continue to skip those positions. We calculate the new state value and check the difference with the previous iteration. We make updates and log the state values. 
Every position is split and logged individually. We return the current state-value list. 
2.7 Policy Iterationdef policyIteration(policy, U):
    print("During the policy iteration:\n")
    while True:
        U = policyEvaluation(policy, U)
        unchanged = True
        for r in range(NUM_ROW):
            for c in range(NUM_COL):
                if r == 1 and c == 0:
                    continue
                elif r == 2 and c == 1:
                    continue 
                elif r == 1 and c == 3:
                    continue
                elif r == 1 and c == 2:
                    continue
                elif r == 3 and c == 3:
                    continue
                maxAction, maxU = None, -float("inf")
                for action in range(NUM_ACTIONS):
                    u = calculateU(U, r, c, action)
                    if u > maxU:
                        maxAction, maxU = action, u
                if maxU > calculateU(U, r, c, policy[r][c]):
                    policy[r][c] = maxAction # the action that maximizes the utility
                    unchanged = False
        if unchanged:
            break
        printEnvironment(policy)
    return policy
We create the policyIteration method to go through policies and compare them, if any improvement is shown, then update them. Here, we again skip those positions with Debris and Goals. 
The update is done by choosing an action and calculating the state value for all possible actions, and choosing the action with the maximum state value. 
2.8 Mainprint("The initial random policy is:\n")
printEnvironment(policy)
﻿
# Policy iteration
policy = policyIteration(policy, U)
﻿
# Optimal policy
print("The optimal policy is:\n")
printEnvironment(policy)
We initially print the environment, then perform Policy Iteration. We print the intermediate results and print the solved environment. 
This is a throwback to the illustration showing the state values for all positions. Those were not arbitrary values but actual state values calculated!!!
﻿
﻿
﻿
As we can observe from the logs, the values for each state (position) is set arbitrarily, it slowly converges to the true value. 
Now, we can move to Value Iteration.
3. Value IterationValue iteration is another way to arrive at optimal value functions and policy. 
Here is the algorithm:
For each state sss﻿, start by initializing v(s)=0v(s) = 0v(s)=0﻿﻿
Repeat until the difference between the current step value function and the previous step value function is minimal, 
	a. Perform synchronous updates for each state sss﻿,
    v(s):=maxaϵA(Rsa+γs′ϵSPss′av(s′))v(s):= max_{a \epsilon A} (R_{s}^{a} + \gamma_{s' \epsilon S} P_{ss'}^{a} v(s'))v(s):=maxaϵA​(Rsa​+γs′ϵS​Pss′a​v(s′))﻿﻿
Important Distinction:
Policy Iteration involves iterating through policy and its corresponding value functions, Value Iteration on the other hand only iterates through value functions. If we stop the iteration in between, for policy iteration, we would get a perfectly fine working policy. We cannot assure the same for Value Iteration, in-between value functions may or may not map to a real policy. From the above-given paradigms, only Policy Iteration and Value Iteration can solve MDP. Any Reinforcement Learning problem can be represented with an MDP and our Dynamic Programming paradigms can solve them. 
﻿
Intergalactic Exercise No. 4!
This is the final exercise. We have seen the Value Iteration algorithm and have practiced coding for Policy Iteration. Use all that knowledge to code a Value Iteration program. 
💡
Hopefully, you tried coding! 
Here is one possible solution. I am not going to explain in detail as I have already explained most of this code in the previous section. 
import wandb
# Arguments
REWARD = 0 
DISCOUNT = 0.99
MAX_ERROR = 10**(-3)
﻿
# Set up 
NUM_ACTIONS = cond4
ACTIONS = [(1, 0), (0, -1), (-1, 0), (0, 1)] # Down, Left, Up, Right
NUM_ROW = 4
NUM_COL = 4
U = [[0, 0, 0, 0],[-10, 0, -10, +10],[0, -10, 0, 0],[0, 0, 0, -10]]
﻿
# Logging
config = {
    "iteration_type" : "Value Iteration",
    "environment_name" : "Intergalactic Amnesia"
}
run = wandb.init(
    project = "fundamentals_rl",
    config = config,
    save_code = True,
)
﻿
# Visualization
def printEnvironment(arr, policy=False):
    res = ""
    for r in range(NUM_ROW):
        res += "|"
        for c in range(NUM_COL):
            if r == 1 and c == 0:
                val = "-10" 
            elif r == 2 and c == 1:
                val = "-10" 
            elif r == 1 and c == 3:
                val = "+10"
            elif r == 1 and c == 2:
                val = "-10"
            elif r == 3 and c == 3:
                val = "-10"
            else:
                val = ["Down", "Left", "Up", "Right"][int(arr[r][c])]
            res += " " + val[:5].ljust(5) + " |" # format
        res += "\n"
    print(res)
﻿
# getU function
def getU(U, r, c, action):
    dr, dc = ACTIONS[action]
    newR, newC = r+dr, c+dc
    if newR < 0 or newC < 0 or newR >= NUM_ROW or newC >= NUM_COL: # collide with the boundary or the wall
        return U[r][c]
    else:
        return U[newR][newC]
﻿
# CalculateU
def calculateU(U, r, c, action):
    u = REWARD
    u += 0.1 * DISCOUNT * getU(U, r, c, (action-1)%4)
    u += 0.8 * DISCOUNT * getU(U, r, c, action)
    u += 0.1 * DISCOUNT * getU(U, r, c, (action+1)%4)
    return u
﻿
def valueIteration(U):
    print("Value iteration:\n")
    while True:
        nextU = [[0, 0, 0, 0],[-10, 0, -10, +10],[0, -10, 0, 0],[0, 0, 0, -10]]
        error = 0
        for r in range(NUM_ROW):
            for c in range(NUM_COL):
                if r == 1 and c == 0:
                    continue
                elif r == 2 and c == 1:
                    continue 
                elif r == 1 and c == 3:
                    continue
                elif r == 1 and c == 2:
                    continue
                elif r == 3 and c == 3:
                    continue
                nextU[r][c] = max([calculateU(U, r, c, action) for action in range(NUM_ACTIONS)]) # Bellman update
                error = max(error, abs(nextU[r][c]-U[r][c]))
        U = nextU
        wandb.log({"Position_1" : U[3][0],"Position_2" : U[3][1],"Position_3" : U[3][2],"Position_4" : U[3][3],
                    "Position_5" : U[2][0],"Position_6" : U[2][1],"Position_7" : U[2][2],"Position_8" : U[2][3],
                    "Position_9" : U[1][0],"Position_10" : U[1][1],"Position_11" : U[1][2],"Position_12" : U[1][3],
                    "Position_13" : U[0][0],"Position_14" : U[0][1],"Position_15" : U[0][2],"Position_16" : U[0][3]})
        if error < MAX_ERROR * (1-DISCOUNT) / DISCOUNT:
            break
    return U
﻿
# Optimal Policy from U
def getOptimalPolicy(U):
    policy = [[-1, -1, -1, -1] for i in range(NUM_ROW)]
    for r in range(NUM_ROW):
        for c in range(NUM_COL):
            if r == 1 and c == 0:
                continue
            elif r == 2 and c == 1:
                continue 
            elif r == 1 and c == 3:
                continue
            elif r == 1 and c == 2:
                continue
            elif r == 3 and c == 3:
                continue
            # Choose the maximum action
            maxAction, maxU = None, -float("inf")
            for action in range(NUM_ACTIONS):
                u = calculateU(U, r, c, action)
                if u > maxU:
                    maxAction, maxU = action, u
            policy[r][c] = maxAction
    return policy
﻿
print("The initial U is:\n")
printEnvironment(U)
﻿
# Value iteration
U = valueIteration(U)
﻿
# Optimal Policy
policy = getOptimalPolicy(U)
print("The optimal policy is:\n")
printEnvironment(policy, True)
The only new addition to the code is valueIteration method. It performs the same Bellman Update but doesn’t convert it to a policy. We keep iterating till we converge at optimal State-Value. 
﻿
﻿
This is very similar to Policy Iteration, but it has a faster convergence as we are not converting value functions to policy in the intermediate stages. 
ConclusionOne main problem with Dynamic Programming techniques, i.e., all that we have done until now is their time for convergence. Time for convergence can be thought of as the total time it would take to solve an MDP. DP techniques take a long time to converge. 
Dynamic Programming techniques involve going through every state in every iteration. They cannot scale well if the states increase.
There are algorithms out there, which are much more efficient than Dynamic Programming, and converge at a faster rate.
We will be seeing a few of these algorithms in the upcoming blog posts!!
🚀Cpt. Picard: Captain’s Log. Supplemental. We solved the MDP, and it miraculously brought back all the lost memory. Our Lieutenant Officer Data is back and fully functional. We can now continue our journey through the Alpha Quadrant!
SummaryIn this blog post, we traveled with Captain Picard to understand the fundamentals of Reinforcement Learning. We learned everything from Sequential Decision Making, Markov Decision Process, Return, Policies, Value Function, Bellman Equations, and Dynamic Programming. 
Thank you for sticking with the piece, and thank you for taking your time and doing the exercises. Hopefully, I transferred everything I intended to. 
Recommended Reading and Credits﻿David Silver’s Lectures on Reinforcement Learning
﻿Xintian Han’s Mathematical Compilation
﻿SparkShen02’s starter code for MDP - It went through a lot of modification, but the source remains the same. 
DALL E's rendition of Data from Star Trek: The Next Generation
﻿
﻿
﻿
Add a comment
Tags: Articles, Reinforcement Learning, Domain Agnostic, Beginner
Iterate on AI agents and models faster. Try Weights & Biases today.