In this report, we study the benefits of Qdagger [1]. We implement Qdagger following the paper's instructions and compare it to baselines without Qdagger. From [1], assuming the student policy π(⋅∣s)=softmax(Q(s,⋅)/τ), we define the new loss as
LQDagger(D)=LTD+λtEs∼D[∑aπT(a∣s)logπ(a∣s)]
Distillation loss (offline < 500k steps)
Distillation loss (offline < 500k steps)
videos
This run didn't log media for key "videos", step 7251, index 0. Docs →
This run didn't log media for key "videos", step 7199, index 0. Docs →
This run didn't log media for key "videos", step 7162, index 0. Docs →
This run didn't log media for key "videos", step 7389, index 0. Docs →
This run didn't log media for key "videos", step 7497, index 0. Docs →
This run didn't log media for key "videos", step 7452, index 0. Docs →
This run didn't log media for key "videos", step 5852, index 0. Docs →
This run didn't log media for key "videos", step 5901, index 0. Docs →
This run didn't log media for key "videos", step 5846, index 0. Docs →
This run didn't log media for key "videos", step 5771, index 0. Docs →
This run didn't log media for key "videos", step 5813, index 0. Docs →
This run didn't log media for key "videos", step 5777, index 0. Docs →