journal: Making It Work in Sim

Experiment journal
Created on October 25|Last edited on November 2
Comment
Because of technical challenges I did not visualize simulation results until now (10/22/21). Turns out that even though the reward is pretty high the agent wasn't always performing the desired task.
﻿
charts/episodic_return
charts/episodic_return
200k400k600k800k1MStep20040060080010001200
Qube-100-v0__td3_train__1__1634897751
video.0
This run didn't log media for key "video.0", step 102247, index 0. Docs →
Step
Agent exploiting my reward  😠 1
﻿
﻿
The visualization can look weird because it's a rotary pendulum visualized as a cartpole pendulum.
💡
Also, I did some quick iterations and did not track every relevant metrics. Furthermore, some simulation parameters did not match the robot (as I didn't know how to calculate/find them). 
I refactored the code to log every relevant detail and found the missing simulation parameters. Let's get this simulation to work. 
Here we'll try to understand: 
The impact of the action limiter and the theta limit
The impact of the sampling/control frequencies. What happens if they are the same? Is it tremendously better to take 2 sim step per action?
Work out a working reward: quanser reward or simple reward?
﻿
once working in sim:
transfer the weights and continue training
try and transfer the weights and also replay buffer
﻿
Experiment 0: Let's try my robot params vs Quanser's params (keeping my script's default hyperparameters)
﻿
Quanser Sim params (left) vs My params (right)2
﻿
It looks like the Quanser simulation parameters make it easier to balance the robot.
next experiments:
try to transfer weights and buffer and train with my sim params and see how this goes; or maybe randomize sim params??
try to feed n previous step to the policy?
﻿
Experiment 1: Control frequencyafter reading a random .txt from github that said:
On my computer, this can run at around 200 hz, which is fast enough to stabilize the pendulum.  If it runs much slower, it will likely be impossible to stabilize upright, since the pendulum dynamics are fairly fast. The max control frequency I've tried so far is 100Hz, let's try 200Hz:https://wandb.ai/armandpl/furuta/runs/12rb1lbp﻿
﻿
100 Hz vs 200 Hz2
﻿
not sure about the results. reward scaled by dt so might be normal that the return is smaller? might have needed to make ep len twice as long for these to match.
try to make ep len twice as long so that return scale match
let's try 300hz and sampling frequency of 600?
﻿
Experiment 2: try copying hyperparams from this GitHub repo.﻿
Stealing Hyperparameters 😈1
﻿
Experiments 3, 4, 5, 6After 2M steps this it's still not working as expected. Where to we go from there?
3: Keep the weights and replay buffer and remove the Action Limiter.the action limiter makes the action 0 near the motor angle limits. Looking at the visualization the cart jitters near the limits. 
Removing the action limiter might incentivize it to stay near the center as the episode will be terminated each time it gets close to the limit.
2lrvg16g didn't work, also seems like it only goes to the left???
4: Keep the weights and change the reward function. the current reward function is the convoluted quanser one that takes into accounts speed and action. A much simpler one is to reward the pendulum angle only.
can't keep the replay buffer as the reward won't be scale the same.
25k random steps, 25k current policy steps then start learning
iw5li62v didn't work
5: Keep the weights, Revisit the frequency experimentcan't keep replay buffer as reward scaled by dt
25k random steps, 25k current policy steps then start learning
make the frequency 250 Hz and make the max step for episode 3000*2.5. Then compare with run 12rb1lbp. 
1cs8ubkb didn't work
6: Keep the weights and make the pendulum arm much much longerintuition: balancing a broom is much simpler than balancing a pencil. let's see if I could modify my physical robot to simplify the problem.
discard the replay buffer, 25k random steps, 25k current policy steps then start learning
xadg15rj
making it less long and make the rotary arm 3 cm longer
﻿
﻿
Training for ages then bigger batch size seemed to work? d6hy5z6c
next up:
training w/ bigger batch size from scratch
training w/bigger batch size from scratch and w/ my sim params
training w/bigger batch size from scratch and w/ my sim params and without speed limitation
and compare those three
﻿
﻿
implemented Atonin Arafin's tricks:
continuity cost
sGDE
HistoryWrapper:
let's run sweep to evaluate these tricks
﻿
Add a comment