Skip to main content

Imitation Learning on CartPole

Agents trained on CartPole with different rewards.
Created on March 15|Last edited on March 15

DSSM trained without negatives


Computing group metrics from first 10 groups
0100200300400Step-200-180-160-140-120-100
group: MountainCar_LogSoftMax_ResetER_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ResetExpertER_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_rs_1000_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_rs_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_rs_100_neg_temp_1e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ER_100_neg_temp_5e-1_pi_lr_7e-4
group: MountainCar_LogSoftMax_ER_100_neg_temp_3e-1_pi_lr_7e-4
group: MountainCar_LogSoftMax_ER_100_neg_temp_5e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ER_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ER_100_neg_temp_1e-1_pi_lr_3e-3
Run set
300

The agent on CartPole doesn't care for the reward as long as it is positive. To prove that a set of experiments was conducted.
The first agent optimized ri=log(1exp(ϕ1(si)Tϕ2(si+1)τ)sNegativesexp(ϕ1(si)Tϕ2(s)τ))r_i = -\log \left(1 - \frac{exp(\frac{\phi_1(s_{i})^T\phi_2(s_{i+1})}{\tau})}{\sum_{s \in Negatives} exp(\frac{\phi_1(s_{i})^T\phi_2(s)}{\tau})} \right) which is equivalent to maximizing the probability of expert's actions (denoted as NegProbLogCosine).
The second agent optimized ri=log(exp(ϕ1(si)Tϕ2(si+1)τ)sNegativesexp(ϕ1(si)Tϕ2(s)τ))r_i = -\log \left(\frac{exp(\frac{\phi_1(s_{i})^T\phi_2(s_{i+1})}{\tau})}{\sum_{s \in Negatives} exp(\frac{\phi_1(s_{i})^T\phi_2(s)}{\tau})} \right) which is equivalent to minimizing the probability of expert's actions (denoted as NegLogCosine).
As it can be seen both agent perform similarly.
The third agent optimized the same reward as the second one but with a larger number of negatives and performed better.

Run set
290

Such performance boost is solely due to the scale of the reward and has nothing to do with the underlying mechanics which proves CartPole to be irrelevant in further experiments.