Skip to main content

[Offline-to-Online] Cal-QL

Created on June 3|Last edited on July 7
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
Adroit reference scores are from Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. Note that we use a different version of the dataset to Cal-QL, and do not do any hacks with the positive segment subsampling.

AntMaze

umaze-v2

Reference score: NaN, regret: NaN
Our: 76.7 -> 99.7, regret: 0.01

Run set
4


umaze-diverse-v2

Reference score: NaN, regret: NaN
Our: 32.0 -> 98.5, regret: 0.04

Run set
4


medium-play-v2

Reference score: 54 -> 98, regret: 0.07
Our: 71.7 -> 98.7, regret: 0.04

Run set
4


medium-diverse-v2

Reference score: 73 -> 98, regret: 0.06
Our: 62.0 -> 98.2, regret: 0.03

Run set
4


large-play-v2

Reference score: 28 -> 90, regret: 0.27
Our: 31.7 -> 97.2, regret: 0.12

Run set
4


large-diverse-v2

Reference score: 32 -> 94, regret: 0.21
Our: 44.0 -> 91.5, regret: 0.13

Run set
4


Adroit

pen-cloned-v1

Reference score: NaN, regret: NaN
Our: -2.7 -> -2.7, regret: 0.97

Run set
4


door-cloned-v1

Reference score: NaN, regret: NaN
Our: -0.3 -> -0.3, regret: 1.0

Run set
4


hammer-cloned-v1

Reference score: NaN, regret: NaN
Our: 0.2 -> 0.1, regret: 0.99

Run set
4


relocate-cloned-v1

Reference score: NaN, regret: NaN
Our: -0.3 -> -0.3, regret: 0.99

Run set
4