[Offline-to-Online] Cal-QL
Created on June 3|Last edited on July 7
Comment
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
AntMaze reference scores are from Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.
Adroit reference scores are from Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. Note that we use a different version of the dataset to Cal-QL, and do not do any hacks with the positive segment subsampling.
AntMaze
umaze-v2
Reference score: NaN, regret: NaN
Our: 76.7 -> 99.7, regret: 0.01
Run set
4
umaze-diverse-v2
Reference score: NaN, regret: NaN
Our: 32.0 -> 98.5, regret: 0.04
Run set
4
medium-play-v2
Reference score: 54 -> 98, regret: 0.07
Our: 71.7 -> 98.7, regret: 0.04
Run set
4
medium-diverse-v2
Reference score: 73 -> 98, regret: 0.06
Our: 62.0 -> 98.2, regret: 0.03
Run set
4
large-play-v2
Reference score: 28 -> 90, regret: 0.27
Our: 31.7 -> 97.2, regret: 0.12
Run set
4
large-diverse-v2
Reference score: 32 -> 94, regret: 0.21
Our: 44.0 -> 91.5, regret: 0.13
Run set
4
Adroit
pen-cloned-v1
Reference score: NaN, regret: NaN
Our: -2.7 -> -2.7, regret: 0.97
Run set
4
door-cloned-v1
Reference score: NaN, regret: NaN
Our: -0.3 -> -0.3, regret: 1.0
Run set
4
hammer-cloned-v1
Reference score: NaN, regret: NaN
Our: 0.2 -> 0.1, regret: 0.99
Run set
4
relocate-cloned-v1
Reference score: NaN, regret: NaN
Our: -0.3 -> -0.3, regret: 0.99
Run set
4
Add a comment