Skip to main content

[Offline-to-Online] CQL

Created on June 3|Last edited on June 3
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
Adroit reference scores are from Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. Note that we use a different version of the dataset to Cal-QL, and do not do any hacks with the positive segment subsampling.

AntMaze

umaze-v2

Reference score: NaN, regret: NaN
Our: 94.0 -> 99.5, regret: 0.02

Run set
272


umaze-diverse-v2

Reference score: NaN, regret: NaN
Our: 9.5 -> 99.0, regret: 0.08

Run set
272


medium-play-v2

Reference score: 63 -> 99, regret: 0.08
Our: 59 -> 97.7, regret: 0.08

Run set
272


medium-diverse-v2

Reference score: 68 -> 98, regret: 0.06
Our: 63.5 -> 97.2 , regret: 0.07

Run set
272


large-play-v2

Reference score: 29 -> 65, regret: 0.43
Our: 21.5 -> 88.2 , regret: 0.21

Run set
272


large-diverse-v2

Reference score: 27 -> 84, regret: 0.39
Our: 35.5 -> 91.7 , regret: 0.2

Run set
272


Adroit

pen-cloned-v1

Reference score: NaN, regret: NaN
Our: -2.8 -> -1.2, regret: 0.97

Run set
272


door-cloned-v1

Reference score: NaN, regret: NaN
Our: -0.3 -> -0.3 , regret: 1.0

Run set
272


hammer-cloned-v1

Reference score: NaN, regret: NaN
Our: 0.4 -> 2.8 , regret: 0.99

Run set
272


relocate-cloned-v1

Reference score: , regret:
Our: -0.3 -> -0.3, regret: 1.0

Run set
272