Skip to main content

Atari: CleanRL's Qdagger

Created on June 7|Last edited on December 17
In this report, we study the benefits of Qdagger [1]. We implement Qdagger following the paper's instructions and compare it to baselines without Qdagger. From [1], assuming the student policy π(s)=softmax(Q(s,)/τ)\pi(\cdot|s)=\mathrm{softmax}(Q(s,\cdot)/\tau), we define the new loss as
LQDagger(D)=LTD+λtEsD[aπT(as)logπ(as)]\mathcal{L}_{QDagger}(\mathcal{D}) = \mathcal{L}_{TD}+\lambda_t\mathbb{E}_{s\sim\mathcal{D}}\left[\sum_a\pi_T(a|s)\log \pi(a|s)\right]



02M4M6M8M10MStep246810
videos
This run didn't log media for key "videos", step 7251, index 0. Docs →
This run didn't log media for key "videos", step 7199, index 0. Docs →
This run didn't log media for key "videos", step 7162, index 0. Docs →
This run didn't log media for key "videos", step 7389, index 0. Docs →
This run didn't log media for key "videos", step 7497, index 0. Docs →
This run didn't log media for key "videos", step 7452, index 0. Docs →
This run didn't log media for key "videos", step 5852, index 0. Docs →
This run didn't log media for key "videos", step 5901, index 0. Docs →
This run didn't log media for key "videos", step 5846, index 0. Docs →
This run didn't log media for key "videos", step 5771, index 0. Docs →
This run didn't log media for key "videos", step 5813, index 0. Docs →
This run didn't log media for key "videos", step 5777, index 0. Docs →
2M4M6M8M10MSteps0200400600800Episodic Return



CleanRL's qdagger_dqn_atari_impalacnn.py
3
CleanRL's qdagger_dqn_atari_jax_impalacnn.py
3
CleanRL's dqn_atari.py
3
CleanRL's dqn_atari_jax.py
3



CleanRL's qdagger_dqn_atari_impalacnn.py
3
CleanRL's qdagger_dqn_atari_jax_impalacnn.py
3
CleanRL's dqn_atari.py
3
CleanRL's dqn_atari_jax.py
3