SAC-without-AirSim-steps

Learning to perform some OpenAI Gym, MuJoCo control tasks (goal scores): Pendulum-v0 (-200), CartPole-v1 (500), LunarLanderContinuous-v2 (200), Hopper-v2 (3500), and HalfCheetah-v2 (4800). Soft Actor-Critic (SAC) is trained on the cost function of these tasks. SAC achieves the goal scores for the tasks, verified by looking at the "mean episode reward (RL)" chart. The learned SAC models are used to generate expert demonstrations of the task. GAIL is then trained on these demos to learn to imitate the tasks.

Prabhasa Kalkur

Created on September 29|Last edited on November 4

Comment

﻿
Section 1Add markdown, images, and LaTeX\LaTeXLATE​X
﻿
﻿
﻿
​
diff only
Group: HalfCheetah-v2
Group: Hopper-v2
Group: LunarLanderContinuous-v2
Group: CartPole-v1
Group: Pendulum-v0
meta
runtime
runtime
7h 27m 59s
7h 6m 25s
2h 18m 37s
21m 5s
15m 35s
config
algo
algo
sac
sac
sac
dqn
sac
batch_size
batch_size
256
256
256
-
-
buffer_size
buffer_size
1000000
1000000
-
50000
-
ent_coef
ent_coef
auto
0.01
-
-
-
env
env
HalfCheetah-v2
Hopper-v2
LunarLanderContinuous-v2
CartPole-v1
Pendulum-v0
exp_id
exp_id
2
1
1
1
1
exploration_final_eps
exploration_final_eps
-
-
-
0.02
-
exploration_fraction
exploration_fraction
-
-
-
0.1
-
gamma
gamma
0.99
-
-
-
-
generate_expert
generate_expert
false
-
-
-
-
gradient_steps
gradient_steps
1
1
-
-
-
hyperparams_IL
hyperparams_IL
-
{}
{}
{}
{}
learning_rate
learning_rate
0.0003
lin_3e-4
-
0.001
-
learning_starts
learning_starts
10000
1000
1000
-
1000
policy
policy
CustomSACPolicy
CustomSACPolicy
MlpPolicy
CustomDQNPolicy
MlpPolicy
prioritized_replay
prioritized_replay
-
-
-
true
-
timesteps_RL
timesteps_RL
0
0
0
0
1e5
train_freq
train_freq
1
1
-
-
-
summary
_runtime
_runtime
26879
25584
8264
1073
935
_step
_step
500
1313
409
4
125
_timestamp
_timestamp
1601132343
1600912427
1600857722
1600849320
1600846816
Parameter importance with respect tomean episode reward (RL)
1-10
 of 26
Config parameter
Importance
Correlation
exp_id
exp_id
env.value_HalfCheetah-v2
env.value_HalfCheetah-v2
policy.value_CustomSACPolicy
policy.value_CustomSACPolicy
env.value_Hopper-v2
env.value_Hopper-v2
env.value_Pendulum-v0
env.value_Pendulum-v0
timesteps_RL.value_0
timesteps_RL.value_0
timesteps_RL.value_1e5
timesteps_RL.value_1e5
algo.value_sac
algo.value_sac
env.value_CartPole-v1
env.value_CartPole-v1
env.value_LunarLanderContinuous-v2
env.value_LunarLanderContinuous-v2
Loading...
mean episode reward (RL)
mean episode reward (RL)
02004006008001k1.2k1.4kStep0200040006000800010000
group: HalfCheetah-v2
group: Hopper-v2
group: LunarLanderContinuous-v2
group: CartPole-v1
group: Pendulum-v0
Run set5
﻿
﻿

Add a comment