Skip to main content

Reinforcment Learning Q learning

Created on May 10|Last edited on May 13
In pdf, it messes up the graph and gameplay are not able to be shown correctly show. We can find Online versions of this report - Report

Description

This project is about learning and implementing Deep Q learning and analysze the effect due to discount factor and replay buffer(batch) vs online training.


Q learning

In Q learning we are using a fixed policy that chooses action according to our utility and we try to learn a function that can predict the value of the utility given a state and action.
We call it deep Q learning because we try to learn this function as a Nueral network.

Algorithm

Online Deep Q learning



Deep Q learning with Replay Buffer



Policy

I used an epsilon greedy policy with an enhancement of epsilon decay. So at start, our agent can explore, and after a few iterations agent is inclined to take more greedy actions.

Results

Analysis

We take a running average of the last 50 episodes' rewards to measure the model performance.
  • With a lower value of gamma, we saw we can not able to get high rewards. which is kind to intuitive according to our environment. Because our high rewards ( for the final goal ) achieve at the end of the episode by the proper landing of our rocket.
  • We Online learning we are able to achieve good performance but in the last few iterations, our model's performance started to decrease. Our epsilon decay did a good amount of exploration in the start and our state space has low dimensionality that's why I think the online version is also converged

Graphs