[Offline-to-Online] SPOT
Created on March 27|Last edited on June 9
Comment
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
AntMaze reference scores are from Supported Policy Optimization for Offline Reinforcement Learning
Adroit scores are not available
AntMaze
umaze-v2
Reference score: 93.2 -> 99.2, regret: NaN
Our: 91.0 -> 99.5, regret: 0.01
Run set
2016
umaze-diverse-v2
Reference score: 41.6 -> 96.0, regret NaN
Our: 36.2 -> 95.0, regret: 0.21
Run set
2016
medium-play-v2
Reference score: 75.2 -> 97.4, regret: NaN
Our: 67.2-> 97.2, regret: 0.05
Run set
2016
medium-diverse-v2
Reference score: 73.0 -> 96.2, regret: NaN
Our: 73.7 -> 94.5, regret: 0.05
Run set
2016
large-play-v2
Reference score: 40.8 -> 89.4, regret: NaN
Our: 31.5 -> 87.0, regret: 0.29
Run set
2016
large-diverse-v2
Reference score: 44.0 -> 90.8, regret: NaN
Our: 17.5 -> 81.0, regret: 0.23
Run set
2016
Adroit
Pen
Cloned
Reference: NaN
Our: 6.1 -> 43.6, regret: 0.58
Run set
2016
Door
Cloned
Reference: NaN
Our: -0.2 -> 0.0, regret: 0.99
Run set
2016
Relocate
Cloned
Reference: NaN
Our: -0.2 -> -0.1, regret: 1.0
Run set
2016
Hammer
Cloned
Reference: NaN
Our: 3.9 -> 3.7, regret: 0.97
Run set
2016
Add a comment