Imitation Learning on CartPole

Agents trained on CartPole with different rewards.
Created on March 15|Last edited on March 15
Comment
﻿
DSSM trained without negatives﻿
AverageEnvEpRet
AverageEnvEpRet
Computing group metrics from first 10 groups
0100200300400Step-200-180-160-140-120-100
group: MountainCar_LogSoftMax_ResetER_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ResetExpertER_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_rs_1000_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_rs_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ExpertER_rs_100_neg_temp_1e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ER_100_neg_temp_5e-1_pi_lr_7e-4
group: MountainCar_LogSoftMax_ER_100_neg_temp_3e-1_pi_lr_7e-4
group: MountainCar_LogSoftMax_ER_100_neg_temp_5e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ER_100_neg_temp_3e-1_pi_lr_1e-3
group: MountainCar_LogSoftMax_ER_100_neg_temp_1e-1_pi_lr_3e-3
Run set300
﻿
The agent on CartPole doesn't care for the reward as long as it is positive. To prove that a set of experiments was conducted. 
The first agent optimized ri=−log⁡(1−exp(ϕ1(si)Tϕ2(si+1)τ)∑s∈Negativesexp(ϕ1(si)Tϕ2(s)τ))r_i = -\log \left(1 -  \frac{exp(\frac{\phi_1(s_{i})^T\phi_2(s_{i+1})}{\tau})}{\sum_{s \in Negatives} exp(\frac{\phi_1(s_{i})^T\phi_2(s)}{\tau})} \right)ri​=−log(1−∑s∈Negatives​exp(τϕ1​(si​)Tϕ2​(s)​)exp(τϕ1​(si​)Tϕ2​(si+1​)​)​)﻿ which is equivalent to maximizing the probability of expert's actions (denoted as NegProbLogCosine). 
The second agent optimized ri=−log⁡(exp(ϕ1(si)Tϕ2(si+1)τ)∑s∈Negativesexp(ϕ1(si)Tϕ2(s)τ))r_i = -\log \left(\frac{exp(\frac{\phi_1(s_{i})^T\phi_2(s_{i+1})}{\tau})}{\sum_{s \in Negatives} exp(\frac{\phi_1(s_{i})^T\phi_2(s)}{\tau})} \right)ri​=−log(∑s∈Negatives​exp(τϕ1​(si​)Tϕ2​(s)​)exp(τϕ1​(si​)Tϕ2​(si+1​)​)​)﻿ which is equivalent to minimizing the probability of expert's actions (denoted as NegLogCosine).
As it can be seen both agent perform similarly.
The third agent optimized the same reward as the second one but with a larger number of negatives and performed better. 
﻿
Run set290
﻿
Such performance boost is solely due to the scale of the reward and has nothing to do with the underlying mechanics which proves CartPole to be irrelevant in further experiments.
﻿
Add a comment