Agents trained with Imitation Learning Reward
Note that DSSMs which provide the IL reward were trained on datasets containing negatives for all states. Therefore following experiment serve more as a proof of concept. Moreover, only MountainCar and Acrobot are considered as agent learns to solve CartPole as long as reward as positive.
Created on March 12|Last edited on March 24
Comment
Mountain Car
DSSM trained with negatives
Run set
43
Training the agents with large number of negative (sampled from expert dataset or replay buffer) is cumbersome due to the scale of the reward. The probability becomes almost negligible making it hard for the agent to differentiate between actions.
Moreover, the agent trained with fails to reach expert's quality. Despite that such agent converges to a suboptimal solution every single time whereas for the agent train with LogSoftMax reward with 100 negatives there is a number of seeds where the solution had not been found.
Imitation Learning reward and the aforementioned scale problem can be seen on the plot below.
Run set
43
DSSM trained without negatives
Run set
39
As it can be seen above SoftMax reward with negatives always manages to achieve expert performance but for both 10 and 20 negatives there is one seed for which agent doesn't manage to recover the correct policy (a closer look shows that in both cases the agent solves the task for the first time right at the end of the training). Despite that it is clear that using a fixed subset of expert dataset is not viable in this case. A possible solution might to apply temperature to the softmax.
Acrobot
DSSM trained with negatives
Run set
40
Experiments with Acrobot reinforce the conclusions made above. Large number of negatives interferes with training of the agent resulting in a poor quality. The plot below shows that imitation learning reward has a higher variance in that case.
Run set
40
DSSM trained without negatives
Run set
50
Due to the simplicity of the environment the agent with LogCosine reward trains a little faster but converges to the same quality as the agent trained with LogSoftMax reward which indicates that LogSoftMax reward is a preferred choice for Imitation Learning task as it generalizes well across both environments.
Add a comment