Deep Reinforcement Learning for Natural but Adversarial Behavior

How robust are deep RL agents trained with self-play?

Adversarial examples are a known problem in image classification. Deep reinforcement learning policies are similarly vulnerable to adversarial manipulation of their observations. In general, an attacker cannot explicitly modify another agent's observations, but in a shared multi-agent environment one might be able to choose actions specifically to create observations (in the other agent(s)) that are reasonable/natural but adversarial. This is precisely what the Adversarial Policies project by Adam Gleave et al proves by construction in simulated zero-sum games between two humanoid robots with basic proprioception (e.g. two wrestlers, a kicker and a goalie, based on MuJoCo environments).

Screen Shot 2020-05-07 at 6.11.55 PM.png

Deep Reinforcement Learning for Natural but Adversarial Behavior

Eval videos: Sumo Humans, with Adversarial Policy (second row)

Evaluation videos: Sumo Humans, with Adversarial Policy (second row)

Training adversarial policies: 20M Timesteps

Reproducing the baseline training metrics from the paper

To train the adversarial policies referenced in the paper, I set up the W&B Tensorflow integration with sync_tensorboard=True and run the training with

python -m aprl.train with env_name=multicomp/[ENV NAME= SumoAnts,..] -v0 paper

This lets me log and compare the full training curves of the models presented in the paper and easily explore how hyperparameters changes might affect my results. Note that the full 20M timesteps of training may not be done by the time you see this report :)

Plot key metrics, compare interactively

Below, you can see that the adversarial policy converges more and might be more effective in the higher-dimensional/more complex SumoHuman environment (blue) compared to the lower-dimensional/simpler SumoAnt environment (orange). It also appears that the adversary policy is more effective in the goal-blocking scenario (KickAndDefend, red) than the line-guarding scenario (YouShallNotPass, purple).

Using the tabs and check boxes, you can turn each baseline model on and off for easier comparison. For example, you could compare the two Sumo versions alone, see more detail in the policy entropy curves if you turn off SumoAnts (orange), and read the "Fraction of wins" chart in the bottom right most easily if you select just one baseline.

Training adversarial policies: 20M Timesteps

Eval videos: Kick and Defend, with Adversarial Policy (row 2)

Evaluation videos: Kick and Defend, with Adversarial Policy (second row)

Eval videos: YouShallNotPass, Random (left) v Adversarial (right)

Evaluation videos: YouShallNotPass, Random (left) v Adversarial (right)