Reinforcment Learning Q learning
Created on May 10|Last edited on May 13
Comment
Link
In pdf, it messes up the graph and gameplay are not able to be shown correctly show. We can find Online versions of this report - Report
Description
This project is about learning and implementing Deep Q learning and analysze the effect due to discount factor and replay buffer(batch) vs online training.
Q learning
In Q learning we are using a fixed policy that chooses action according to our utility and we try to learn a function that can predict the value of the utility given a state and action.
We call it deep Q learning because we try to learn this function as a Nueral network.
Algorithm
Online Deep Q learning

Deep Q learning with Replay Buffer

Policy
I used an epsilon greedy policy with an enhancement of epsilon decay. So at start, our agent can explore, and after a few iterations agent is inclined to take more greedy actions.
Results
Analysis
We take a running average of the last 50 episodes' rewards to measure the model performance.
- With a lower value of gamma, we saw we can not able to get high rewards. which is kind to intuitive according to our environment. Because our high rewards ( for the final goal ) achieve at the end of the episode by the proper landing of our rocket.
- We Online learning we are able to achieve good performance but in the last few iterations, our model's performance started to decrease. Our epsilon decay did a good amount of exploration in the start and our state space has low dimensionality that's why I think the online version is also converged
Graphs
Add a comment