Skip to main content

Agents trained with Imitation Learning Reward

Note that DSSMs which provide the IL reward were trained on datasets containing negatives for all states. Therefore following experiment serve more as a proof of concept. Moreover, only MountainCar and Acrobot are considered as agent learns to solve CartPole as long as reward as positive.
Created on March 12|Last edited on March 24

Mountain Car

DSSM trained with negatives


0100200300400Step-200-180-160-140-120-100
group: MountainCar_Triplets_LogSoftmax_ER_10_neg
group: MountainCar_Triplets_LogSoftmax_ExpertER_10_neg
group: MountainCar_Triplets_LogCosRew
group: MountainCar_Triplets_LogSoftmax_ExpertER_100_neg
group: MountainCar_Triplets_LogSoftmax_ER_100_neg
Run set
43

Training the agents with large number of negative (sampled from expert dataset or replay buffer) is cumbersome due to the scale of the reward. The probability becomes almost negligible making it hard for the agent to differentiate between actions.
Moreover, the agent trained with ri=log(ϕ1(si)Tϕ2(si+1)+12)r_i = log(\frac{\phi_1(s_i)^T\phi_2(s_{i+1}) + 1}{2}) fails to reach expert's quality. Despite that such agent converges to a suboptimal solution every single time whereas for the agent train with LogSoftMax reward with 100 negatives there is a number of seeds where the solution had not been found.
Imitation Learning reward and the aforementioned scale problem can be seen on the plot below.


Run set
43


DSSM trained without negatives



Run set
39

As it can be seen above SoftMax reward with negatives always manages to achieve expert performance but for both 10 and 20 negatives there is one seed for which agent doesn't manage to recover the correct policy (a closer look shows that in both cases the agent solves the task for the first time right at the end of the training). Despite that it is clear that using a fixed subset of expert dataset is not viable in this case. A possible solution might to apply temperature to the softmax.

Acrobot

DSSM trained with negatives



Run set
40

Experiments with Acrobot reinforce the conclusions made above. Large number of negatives interferes with training of the agent resulting in a poor quality. The plot below shows that imitation learning reward has a higher variance in that case.


Run set
40


DSSM trained without negatives


Run set
50

Due to the simplicity of the environment the agent with LogCosine reward trains a little faster but converges to the same quality as the agent trained with LogSoftMax reward which indicates that LogSoftMax reward is a preferred choice for Imitation Learning task as it generalizes well across both environments.