Skip to main content

Train policy (negative KL divergence)

Created on June 23|Last edited on June 23
This actually reproduces the negative KL divergence issue!!!

50100150200250Time (minutes)-0.500.511.52
100200300400500600global_step-0.9-0.8-0.7-0.6-0.5-0.4
5001k1.5k2kStep-1200-1000-800-600-400-2000
Ours
1
OAI
1



Ours
1
OAI
1