Imitation Learning on CartPole
Agents trained on CartPole with different rewards.
Created on March 15|Last edited on March 15
Comment
DSSM trained without negatives
Computing group metrics from first 10 groups
Run set
300
The agent on CartPole doesn't care for the reward as long as it is positive. To prove that a set of experiments was conducted.
The first agent optimized which is equivalent to maximizing the probability of expert's actions (denoted as NegProbLogCosine).
The second agent optimized which is equivalent to minimizing the probability of expert's actions (denoted as NegLogCosine).
As it can be seen both agent perform similarly.
The third agent optimized the same reward as the second one but with a larger number of negatives and performed better.
Run set
290
Such performance boost is solely due to the scale of the reward and has nothing to do with the underlying mechanics which proves CartPole to be irrelevant in further experiments.
Add a comment