Complete Pong Experiments (A3C)
Created on September 19|Last edited on September 26
Comment
Initial: Learned Blocker + Modified Reward (varying N, alpha, beta)
Training for N+K = 3,000,000 steps. Here, N steps is the Human Oversight Phase where the Blocker is periodically trained. Then, for the following K steps, the Learned Blocker (a CNN Classifier) takes place of the "Human" Expert Blocker.
- a3c baseline: 5 seeds of a baseline a3c policy training for Pong. There is no action replacement, reward modification, or blocker. General upper-bound for cumulative catastrophes.
- n = 50k; n = 20k: N, human oversight phase, is 50,000 and 20,000 steps respectively. Below, for the 50k run, the max step count is 5,000,000 and for the 20k run, the max step count is 3,000,000.
- 09/20 Update: These are preliminary experiment results; as such, the max steps for training varies significantly between runs, newer runs have more variables logged to w&b, and multiple seeds have not been run yet. However, the charts below should give you an initial understanding of the overall goal: for larger N (blocker is trained for longer), the cumulative catastrophes should be lower; for larger N, it is likely that the policy will take longer to converge to a "good" reward because (hopefully) it is more strictly avoiding the catastrophe zone. Right now, a more comprehensive set of experiments is training and the immediate next step is to do a hyper-parameter sweep (on N, alpha = coefficient of the entropy of the blocker, beta = coefficient of the probability of blocker being wrong).
Pong Baseline
5
Ours 50k
5
Ours 20k
5
HIRL 50k
5
HIRL 20k
5
Perfect Blocker
5
Initial: Learned Blocker + Modified Reward (50K)
Pong Baseline
5
Ours 50k
5
Ours 20k
0
HIRL 50k
5
HIRL 20k
0
Perfect Blocker
5
Initial: Learned Blocker + Modified Reward (20K)
Pong Baseline
5
Ours 50k
0
Ours 20k
5
HIRL 50k
0
HIRL 20k
5
Perfect Blocker
5
Add a comment