Complete Pong Experiments (A3C)

Created on September 19|Last edited on September 26
Comment
﻿
Initial: Learned Blocker + Modified Reward (varying N, alpha, beta)Training for N+K = 3,000,000 steps. Here, N steps is the Human Oversight Phase where the Blocker is periodically trained. Then, for the following K steps, the Learned Blocker (a CNN Classifier) takes place of the "Human" Expert Blocker. 
a3c baseline: 5 seeds of a baseline a3c policy training for Pong. There is no action replacement, reward modification, or blocker. General upper-bound for cumulative catastrophes.
n = 50k; n = 20k: N, human oversight phase, is 50,000 and 20,000 steps respectively. Below, for the 50k run, the max step count is 5,000,000 and for the 20k run, the max step count is 3,000,000.
09/20 Update: These are preliminary experiment results; as such, the max steps for training varies significantly between runs, newer runs have more variables logged to w&b, and multiple seeds have not been run yet. However, the charts below should  give you an initial understanding of the overall goal: for larger N (blocker is trained for longer), the cumulative catastrophes should be lower; for larger N, it is likely that the policy will take longer to converge to a "good" reward because (hopefully) it is more strictly avoiding the catastrophe zone. Right now, a more comprehensive set of experiments is training and the immediate next step is to do a hyper-parameter sweep (on N, alpha = coefficient of the entropy of the blocker, beta = coefficient of the probability of blocker being wrong).
﻿
episode_reward
episode_reward
0204060Step-20-1001020
: -   HIRL 20k
: -   HIRL 50k
: -   Pong Baseline
: -   Perfect Blocker
: -   Ours 20k
: -   Ours 50k
episode_reward
episode_reward
0204060Step-20-1001020
: -   HIRL 20k
: -   HIRL 50k
: -   Pong Baseline
: -   Perfect Blocker
: -   Ours 20k
: -   Ours 50k
Pong Baseline5
Ours 50k5
Ours 20k5
HIRL 50k5
HIRL 20k5
Perfect Blocker5
﻿
Initial: Learned Blocker + Modified Reward (50K)﻿
Pong Baseline5
Ours 50k5
Ours 20k0
HIRL 50k5
HIRL 20k0
Perfect Blocker5
﻿
Initial: Learned Blocker + Modified Reward (20K)﻿
Pong Baseline5
Ours 50k0
Ours 20k5
HIRL 50k0
HIRL 20k5
Perfect Blocker5
﻿
﻿
﻿
﻿
Add a comment