Experiment Journal: Landing a Rocket

Getting less and less confused
Created on February 23|Last edited on June 6
Comment
This is my experiment journal for the starship landing projet. Each section is an experiment that follows the same structure: motivations/asumptions/current understanding/hypothesis -> experiment runs -> gained insight.
1. Did I Mess up Observation Augmentations?2. Scaling Rewards is Confusing3. How Useful are the HistoryWrapper and Big Batch SizeGained Insight: HistoryWrapper ✅ Big Batch Size 🤷4. How Long Should the History be?Gained Insight: 2 looks good5. Does Size Matter?6. Tuning the Distance Reward - X7. Is Noise Helping? Random Initial State,  Goal and Sim ParametersGained Insight:Training Stability: Higher Batch Size?Gained Insight: None6. Can I make a Sparse Reward work?Her Keval ep len/why does the eval ep clocks outFine Tuning?
﻿
1. Did I Mess up Observation Augmentations?
2. Scaling Rewards is Confusing
3. How Useful are the HistoryWrapper and Big Batch SizeA breakthrough during this project was using an HistoryWrapper. An HistoryWrapper allows the agent to see k past observations (and actions). In robotics it can be used to cope with the control delay.
Here our system as a lot of inertia/delay (big object moving a physic sim) and it seems like the HistoryWrapper is helping to deal with that. My intuition is that being able to observe multiple timesteps better help matching actions to their consequences. Because there is lag if you only observe one step it is hard to understand what the action did.
But then, our batch size is 1024 and, in my understanding, the state space is quite big. Maybe 1024 was too small of a subset of states to learn anything and maybe a big batch size has a similar effect to the HistoryWrapper? Let's compare.
﻿
High Batch Size No History (blue) vs History Low Batch_Size3
﻿
Gained Insight: HistoryWrapper ✅ Big Batch Size 🤷The HistoryWrapper helps a lot, a big batch size doesn't seem to help that much.
After properly re-scaling the reward (cf. 2. Scaling Rewards is Confusing) it looks like the agent manages to learn even without the HistoryWrapper, contrary to what I thought earlier (cf. 4. How Long Should the History be?).
﻿
4. How Long Should the History be?A longer History has a cost: it increases the input layer size and therefore makes the neural net slower to train. An History that's too long might also decreases performances. Let's try and find the best History length.
﻿
Run set4
﻿
Gained Insight: 2 looks goodLooks like it is not that important afterall! The effect was way more pronounced when the reward was badly shaped.
It still look like observing two steps helps (maybe it allows to know/compute the accelerations)? 
Experiment Idea
It could be interesting to compare adding the accelerations to the observation vs. history wrapper without accels vs. only positions (without speeds and accels) and history wrapper. I'm curious to understand if the agent "understands" how to deduce speeds and accelerations, like can it figure out some sort of Kalman filter on it's own? And then can we compare that to a real Kalman filter? 
﻿
5. Does Size Matter?
6. Tuning the Distance Reward - XThe astronauts keep dying in failed landing we must do better. We can't afford a success rate < 1.
I want to try tuning the distance reward to see if I can get better results.
trying to lower the power on the distance reward so that the penalty when being far away is less important. what we care about is that the rocket is as close to the goal as possible at the end of the episode. we don't care that much about what happens before
﻿
Run set3
﻿
7. Is Noise Helping? Random Initial State,  Goal and Sim ParametersI want to further randomize the initial state in the hope that it will help the agent better grasp the rocket dynamics.
Maybe I should also randomize the physics sim !!
﻿
Run set3
﻿
Some initial conditions make the problem impossible to solve (e.g spawning too close to the horizontal limits with a high horizontal velocity: rocket instantly goes out of scope). This makes the comparison unfair. 
One fix is to evaluate the agent on an env with fixed goal, initial state and simulation parameters. That would make the comparison more fair.
﻿
Run set4
﻿
Gained Insight:The success rate looks unstable. It is also slightly weird that it is binary. Either working or not. -> that because eval policy is deterministic. need random goal to get a more nuanced/accurate view of how the agent is perfoming.
That said agents trained on random init state, goal and sim parameters seem to be able to solve the task.
Training Stability: Higher Batch Size?Looking at 7. Is Noise Helping? Random Initial State,  Goal and Sim Parameters, the training is somewhat unstable. Let's try drastically increasing the batch size to see if it helps.
﻿
Run set3
﻿
Gained Insight: NoneIt looks like it's even more unstable?
6. Can I make a Sparse Reward work?sparse reward = good bc I know what success look like but not how to get there.
reward shaping = difficult + might end up with a suboptimal policy
Let's try with the simplest version of the problem first (fixed initial state, goal and constants): hopefully that'll make the sparse reward slightly less sparse? but sparsity shouldn't be a issue bc of HER 
compare with eager-valley-281 to see with a somewhat shaped reward
pleasant-bee-285
would also need to try something even less arbitrary like -1/120 is weird
should try to penalize the action. bc it uses kerosen right
﻿
Run set2
﻿
Her Ksuccess rate unstable, mayb k too high?
2 vs 4 vs 5 vs 8
﻿
Run set5
﻿
for now k=2 looks a bit better.
need to try k=0
﻿
eval ep len/why does the eval ep clocks outweirdly eval ep len seem to need more than 500 timesteps? while training needs less than 400. I wanna try with a max len of 1000 for train and eval just to see.
might also need to try train 500 vs eval 1000
I guess that's when the rocket learns to hover. try the same but with a 2x bigger penalty on each step
﻿
Run set3
﻿
skilled lion looks a bit more stable.
﻿
Fine Tuning?training a first agent with a shaped reward and trying to improve it by further training it with a sparse reward.
﻿
Run set3
﻿
﻿
continuity cost! overwrite compute_reward in wrapper
﻿
add a colorful sweep thingy 
﻿
Add a comment