[Offline-to-Online] CQL
Created on June 3|Last edited on June 3
Comment
Results are averaged over 4 seeds. For each dataset we plot d4rl normalized score. Offline pretraining lasts for 1M updates followed by online tuning over 1M updates.
AntMaze reference scores are from Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.
Adroit reference scores are from Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. Note that we use a different version of the dataset to Cal-QL, and do not do any hacks with the positive segment subsampling.
AntMaze
umaze-v2
Reference score: NaN, regret: NaN
Our: 94.0 -> 99.5, regret: 0.02
Run set
272
umaze-diverse-v2
Reference score: NaN, regret: NaN
Our: 9.5 -> 99.0, regret: 0.08
Run set
272
medium-play-v2
Reference score: 63 -> 99, regret: 0.08
Our: 59 -> 97.7, regret: 0.08
Run set
272
medium-diverse-v2
Reference score: 68 -> 98, regret: 0.06
Our: 63.5 -> 97.2 , regret: 0.07
Run set
272
large-play-v2
Reference score: 29 -> 65, regret: 0.43
Our: 21.5 -> 88.2 , regret: 0.21
Run set
272
large-diverse-v2
Reference score: 27 -> 84, regret: 0.39
Our: 35.5 -> 91.7 , regret: 0.2
Run set
272
Adroit
pen-cloned-v1
Reference score: NaN, regret: NaN
Our: -2.8 -> -1.2, regret: 0.97
Run set
272
door-cloned-v1
Reference score: NaN, regret: NaN
Our: -0.3 -> -0.3 , regret: 1.0
Run set
272
hammer-cloned-v1
Reference score: NaN, regret: NaN
Our: 0.4 -> 2.8 , regret: 0.99
Run set
272
relocate-cloned-v1
Reference score: , regret:
Our: -0.3 -> -0.3, regret: 1.0
Run set
272
Add a comment