journal: Making It Work in Sim
Experiment journal
Created on October 25|Last edited on November 2
Comment
Because of technical challenges I did not visualize simulation results until now (10/22/21). Turns out that even though the reward is pretty high the agent wasn't always performing the desired task.
video.0
Agent exploiting my reward 😠
1
The visualization can look weird because it's a rotary pendulum visualized as a cartpole pendulum.
💡
Also, I did some quick iterations and did not track every relevant metrics. Furthermore, some simulation parameters did not match the robot (as I didn't know how to calculate/find them).
I refactored the code to log every relevant detail and found the missing simulation parameters. Let's get this simulation to work.
Here we'll try to understand:
- The impact of the action limiter and the theta limit
- The impact of the sampling/control frequencies. What happens if they are the same? Is it tremendously better to take 2 sim step per action?
- Work out a working reward: quanser reward or simple reward?
once working in sim:
- transfer the weights and continue training
- try and transfer the weights and also replay buffer
Experiment 0: Let's try my robot params vs Quanser's params
(keeping my script's default hyperparameters)
Quanser Sim params (left) vs My params (right)
2
It looks like the Quanser simulation parameters make it easier to balance the robot.
next experiments:
- try to transfer weights and buffer and train with my sim params and see how this goes; or maybe randomize sim params??
- try to feed n previous step to the policy?
Experiment 1: Control frequency
On my computer, this can run at around 200 hz, which is fast enough to stabilize the pendulum. If it runs much slower, it will likely be impossible to stabilize upright, since the pendulum dynamics are fairly fast.
The max control frequency I've tried so far is 100Hz, let's try 200Hz:https://wandb.ai/armandpl/furuta/runs/12rb1lbp
100 Hz vs 200 Hz
2
not sure about the results. reward scaled by dt so might be normal that the return is smaller? might have needed to make ep len twice as long for these to match.
- try to make ep len twice as long so that return scale match
- let's try 300hz and sampling frequency of 600?
Experiment 2: try copying hyperparams from this GitHub repo.
Stealing Hyperparameters 😈
1
Experiments 3, 4, 5, 6
After 2M steps this it's still not working as expected. Where to we go from there?
3: Keep the weights and replay buffer and remove the Action Limiter.
- the action limiter makes the action 0 near the motor angle limits. Looking at the visualization the cart jitters near the limits.
- Removing the action limiter might incentivize it to stay near the center as the episode will be terminated each time it gets close to the limit.
- 2lrvg16g didn't work, also seems like it only goes to the left???
4: Keep the weights and change the reward function.
- the current reward function is the convoluted quanser one that takes into accounts speed and action. A much simpler one is to reward the pendulum angle only.
- can't keep the replay buffer as the reward won't be scale the same.
- 25k random steps, 25k current policy steps then start learning
- iw5li62v didn't work
5: Keep the weights, Revisit the frequency experiment
- can't keep replay buffer as reward scaled by dt
- 25k random steps, 25k current policy steps then start learning
- make the frequency 250 Hz and make the max step for episode 3000*2.5. Then compare with run 12rb1lbp.
- 1cs8ubkb didn't work
6: Keep the weights and make the pendulum arm much much longer
- intuition: balancing a broom is much simpler than balancing a pencil. let's see if I could modify my physical robot to simplify the problem.
- discard the replay buffer, 25k random steps, 25k current policy steps then start learning
- xadg15rj
- making it less long and make the rotary arm 3 cm longer
- Training for ages then bigger batch size seemed to work? d6hy5z6c
- next up:
- training w/ bigger batch size from scratch
- training w/bigger batch size from scratch and w/ my sim params
- training w/bigger batch size from scratch and w/ my sim params and without speed limitation
- and compare those three
implemented Atonin Arafin's tricks:
- continuity cost
- sGDE
- HistoryWrapper:
- let's run sweep to evaluate these tricks
Add a comment