Agents trained with Imitation Learning Reward

Note that DSSMs which provide the IL reward were trained on datasets containing negatives for all states. Therefore following experiment serve more as a proof of concept. Moreover, only MountainCar and Acrobot are considered as agent learns to solve CartPole as long as reward as positive.

Artem Tsypin

Created on March 12|Last edited on March 24

Comment

﻿
Mountain Car 
DSSM trained with negatives﻿
AverageEnvEpRet
AverageEnvEpRet
0100200300400Step-200-180-160-140-120-100
group: MountainCar_Triplets_LogSoftmax_ER_10_neg
group: MountainCar_Triplets_LogSoftmax_ExpertER_10_neg
group: MountainCar_Triplets_LogCosRew
group: MountainCar_Triplets_LogSoftmax_ExpertER_100_neg
group: MountainCar_Triplets_LogSoftmax_ER_100_neg
Run set43
﻿
Training the agents with large number of negative (sampled from expert dataset or replay buffer) is cumbersome due to the scale of the reward. The probability becomes almost negligible making it hard for the agent to differentiate between actions. 
Moreover, the agent trained with ri=log(ϕ1(si)Tϕ2(si+1)+12)r_i = log(\frac{\phi_1(s_i)^T\phi_2(s_{i+1}) + 1}{2})ri​=log(2ϕ1​(si​)Tϕ2​(si+1​)+1​)﻿ fails to reach expert's quality. Despite that such agent converges to a suboptimal solution every single time whereas for the agent train with LogSoftMax reward with 100 negatives there is a number of seeds where the solution had not been found.  
Imitation Learning reward and the aforementioned scale problem can be seen on the plot below.
﻿
﻿
Run set43
﻿
 
DSSM trained without negatives﻿
﻿
Run set39
﻿
As it can be seen above SoftMax reward with negatives always manages to achieve expert performance but for both 10 and 20 negatives there is one seed for which agent doesn't manage to recover the correct policy (a closer look shows that in both cases the agent solves the task for the first time right at the end of the training). Despite that it is clear that using a fixed subset of expert dataset is not viable in this case. A possible solution might to apply temperature to the softmax. 
Acrobot
DSSM trained with negatives﻿
﻿
Run set40
﻿
Experiments with Acrobot reinforce the conclusions made above. Large number of negatives interferes with training of the agent resulting in a poor quality. The plot below shows that imitation learning reward has a higher variance in that case.
﻿
﻿
Run set40
﻿
DSSM trained without negatives﻿
Run set50
﻿
Due to the simplicity of the environment the agent with LogCosine reward trains a little faster but converges to the same quality as the agent trained with LogSoftMax reward which indicates that LogSoftMax reward is a preferred choice for Imitation Learning task as it generalizes well across both environments.
﻿

Add a comment