[Offline-to-Online] IQL
Created on March 6|Last edited on June 6
Comment
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
AntMaze reference scores are from Supported Policy Optimization for Offline Reinforcement Learning
Regret references are from Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
AntMaze
umaze-v2
Reference score: 85.4 -> 96.2, regret: NaN
Our: 77.0 -> 96.5, regret: 0.07
Run set
2016
umaze-diverse-v2
Reference score: 70.8 -> 62.2, regret: NaN
Our: 59.5 -> 78.0, regret: 63.7
Run set
2016
medium-play-v2
Reference score: 68.6 -> 89.8, regret: 0.10
Our: 71.7 -> 89.7, regret: 0.09
Run set
2016
medium-diverse-v2
Reference score: 73.4 -> 90.2, regret: 0.09
Our: 64.2 -> 92.2, regret: 0.10
Run set
2016
large-play-v2
Reference score: 40.0 -> 78.6, regret: 0.52
Our: 38.5 -> 64.5, regret: 0.33
Run set
2016
large-diverse-v2
Reference score: 40.4 -> 73.4, regret: 0.46
Our: 26.7 -> 64.2, regret: 0.41
Run set
2016
Adroit
Pen
Cloned
Reference: NaN
Our: 83.7 -> 102.0, regret: 0.36
Run set
2016
Door
Cloned
Reference: NaN
Our: 1.1 -> 20.3, regret: 0.83
Run set
2016
Relocate
Cloned
Reference: NaN
Our: 0.0 -> 0.3, regret: 0.99
Run set
2016
Hammer
Cloned
Reference: NaN
Our: 1.3 -> 57.2, regret: NaN
Run set
2016
Add a comment