Temperature and LR for LogSoftMax Reward
This report considers various values of temperature and learning rate for LogSoftMax Reward.
Created on March 24|Last edited on March 24
Comment
The reward looks as follows.
Let's first consider a fixed subset of expert's trajectories as negatives.
Negatives from expert trajectories
Run set
54
10 Negatives from Replay Buffer with Temperature
Run set
35
100 Negatives from Replay Buffer with Temperature
Run set
20
100 Negatives from Replay Buffer with Different Policy Learning Rates
Run set
40
100 Negatives from Replay Buffer with Temperature and Different Policy Learning Rates
Run set
71
Best Runs
Run set
51
Add a comment