Skip to main content

Temperature and LR for LogSoftMax Reward

This report considers various values of temperature and learning rate for LogSoftMax Reward.
Created on March 24|Last edited on March 24
The reward looks as follows.
ri=logexp(ϕ1(si)Tϕ2(si+1)τ)sNegativesexp(ϕ1(si)Tϕ2(s)τ)r_i = \log \frac{exp(\frac{\phi_1(s_i)^T \phi_2(s_{i+1})}{\tau})}{\sum_{s \in Negatives} exp(\frac{\phi_1(s_i)^T \phi_2(s)}{\tau})}

Let's first consider a fixed subset of expert's trajectories as negatives.

Negatives from expert trajectories


0100200300400Step-200-180-160-140-120-100
group: MountainCar_LogSoftMax_ExpertER_1000_neg_temp_1e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_10000_neg_temp_1e-2
group: MountainCar_LogSoftMax_ExpertER_10000_neg_temp_1e-1
group: MountainCar_LogSoftMax_ExpertER_1000_neg_temp_1e-1
group: MountainCar_LogSoftMax_ExpertER_100_neg_temp_1e-1
group: MountainCar_LogSoftMax_ExpertER_100_neg_temp_1e-2
group: MountainCar_LogSoftMax_ExpertER_10_neg
Run set
54


10 Negatives from Replay Buffer with Temperature


Run set
35


100 Negatives from Replay Buffer with Temperature


Run set
20


100 Negatives from Replay Buffer with Different Policy Learning Rates


Run set
40


100 Negatives from Replay Buffer with Temperature and Different Policy Learning Rates


Run set
71


Best Runs


Run set
51