Skip to main content

[Offline-to-Online] AWAC

Created on March 16|Last edited on June 6
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
AntMaze reference scores are from Offline Reinforcement Learning with Implicit Q-Learning. Note: v0 version is reported in the paper
For Adroit tasks reference scores are not available

AntMaze

umaze-v2

Reference score: 56.7 -> 59.0, regret: NaN
Our: 52.7 -> 98.7, regret: 0.04

1.2M1.4M1.6M1.8MStep0.10.20.30.40.5
500k1M1.5MStep20406080100
Run set
4


umaze-diverse-v2

Reference score: 49.3 -> 49.0, regret: NaN
Our: 61.2 -> 0.0, regret: 0.88

Run set
4


medium-play-v2

Reference score: 0.0 -> 0.0, regret: NaN
Our: 0 -> 0, regret: 1.0

Run set
4


medium-diverse-v2

Reference score: 0.7 -> 0.3, regret: NaN
Our: 0.0 -> 0.0, regret: 0

Run set
4


large-play-v2

Reference score: 0.0 -> 0.0, regret: NaN
Our: 0.0 -> 0.0, regret: 1.0

Run set
4


large-diverse-v2

Reference score: 1.0 -> 0.0, regret: NaN
Our: 0.0 -> 0.0, regret: 1.0

Run set
2016


Adroit

Pen

Cloned

Reference: NaN
Our: 88.6 -> 86.8, regret: 0.46

Run set
4


Door

Cloned

Reference: NaN
Our: 0.9 -> 0.0, regret: 0.99

Run set
4



Relocate

Cloned

Reference: NaN
Our: 0.0 -> 0.0, regret: 1.0

Run set
4



Hammer

Cloned

Reference: NaN
Our: 1.8 -> 0.2, regret: 0.99

Run set
4