[Offline-to-Online] AWAC
Created on March 16|Last edited on June 6
Comment
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
AntMaze reference scores are from Offline Reinforcement Learning with Implicit Q-Learning. Note: v0 version is reported in the paper
For Adroit tasks reference scores are not available
AntMaze
umaze-v2
Reference score: 56.7 -> 59.0, regret: NaN
Our: 52.7 -> 98.7, regret: 0.04
Run set
4
umaze-diverse-v2
Reference score: 49.3 -> 49.0, regret: NaN
Our: 61.2 -> 0.0, regret: 0.88
Run set
4
medium-play-v2
Reference score: 0.0 -> 0.0, regret: NaN
Our: 0 -> 0, regret: 1.0
Run set
4
medium-diverse-v2
Reference score: 0.7 -> 0.3, regret: NaN
Our: 0.0 -> 0.0, regret: 0
Run set
4
large-play-v2
Reference score: 0.0 -> 0.0, regret: NaN
Our: 0.0 -> 0.0, regret: 1.0
Run set
4
large-diverse-v2
Reference score: 1.0 -> 0.0, regret: NaN
Our: 0.0 -> 0.0, regret: 1.0
Run set
2016
Adroit
Pen
Cloned
Reference: NaN
Our: 88.6 -> 86.8, regret: 0.46
Run set
4
Door
Cloned
Reference: NaN
Our: 0.9 -> 0.0, regret: 0.99
Run set
4
Relocate
Cloned
Reference: NaN
Our: 0.0 -> 0.0, regret: 1.0
Run set
4
Hammer
Cloned
Reference: NaN
Our: 1.8 -> 0.2, regret: 0.99
Run set
4
Add a comment