Adversarial Policies in Multi-Agent Settings

This article explores a range of adversarial examples in multi-agent settings, demonstrating how adversarial policies can win reliably without even playing the game.
Stacey Svetlichnaya
Created on May 7|Last edited on October 12
Comment
﻿
Deep Reinforcement Learning for Natural but Adversarial Behavior
How Robust Are Deep RL Agents Trained With Self-Play?Adversarial examples are a known problem in image classification. Deep reinforcement learning policies are similarly vulnerable to adversarial manipulation of their observations. 
In general, an attacker cannot explicitly modify another agent's observations, but in a shared multi-agent environment one might be able to choose actions specifically to create observations (in the other agent(s)) that are reasonable/natural but adversarial.
This is precisely what the Adversarial Policies project by Adam Gleave et al proves by construction in simulated zero-sum games between two humanoid robots with basic proprioception (e.g. two wrestlers, a kicker and a goalie, based on MuJoCo environments). 
﻿
﻿
﻿
Adversarial Policies Win Reliably, Without Really Playing the GameIn the presented examples, the goalies and line-guards seem uncoordinated and often fall to the ground. In the left image above, the state-of the-art line-guard tries to block the runner and misses, while the adversarially-trained guard on the right tends to curl up into a ball but wins twice as much. 
The authors also show that this adversarial approach leads to surprising activations in the defendants' policy networks and that such policies seem more successful in higher-dimensional environments (concretely, human wrestlers versus ant wrestlers with fewer degrees of freedom—but perhaps in Go versus Chess as well?). 
﻿Paper: Adversarial Policies: Attacking Deep Reinforcement Learning →﻿
Published at ICLR 2020, this paper provides supporting visualizations and more detail on how to learn such adversarial policies against state-of-the-art defenders, which have been previously trained via self-play to be robust to attackers.
﻿Code: Adversarial Policies Repo ﻿→﻿
The authors' experiment configuration and steps to reproduce are comprehensive and elaborate. In this report, I explore how evaluation videos can be logged alongside training runs to more easily organize your experiments and analyze your results.
Eval videos: Sumo Humans, with Adversarial Policy (second row)﻿
﻿
Sumo Wrestling Matches4
﻿
Adversary Acts Strangely for a "Good Wrestler"These evaluation videos help give an intuition for how the adversarial policy generates "natural but adversarial" observations in other agents. The adversary is the red wrestler in all of these scenes. In the first row, the left video shows a state-of-the-art "zoo" variant of both wrestlers, trained with self-play. The right image shows a "zoo" blue wrestler and a random red adversary (ie no learned policy). The second row shows two examples (different seeds) of the red wrestler trained with an adversarial policy using PPO. 
The agents in the top left video both act most reasonably: getting low to the ground, setting their legs wide to be stabled, and slowly approaching each other with arms wide in front of them. In the second row, the adversarial agents behave rather unusually for effective sumo wrestlers. They kneel on the ground, lean back, and flail their arms out or behind them. In many cases, the blue wrestler cannot respond as well or get as close to the red adversary. This can cause the blue wrestler to get off balance and topple backwards or out of the ring, leading to the adversarial policy's victory. Perhaps an agent trained with independent auxiliary rewards (e.g. for staying upright/balanced on their feet) would be less vulnerable to such an adversary.
Training Adversarial Policies: 20M Timesteps
Reproducing the Baseline Training Metrics from the PaperTo train the adversarial policies referenced in the paper, I set up the W&B Tensorflow integration with sync_tensorboard=True and run the training with
python -m aprl.train with env_name=multicomp/[ENV NAME= SumoAnts,..] -v0 paper
This lets me log and compare the full training curves of the models presented in the paper and easily explore how hyperparameters changes might affect my results. Note that the full 20M timesteps of training may not be done by the time you see this report :)
Plot Key Metrics, Compare InteractivelyBelow, you can see that the adversarial policy converges more and might be more effective in the higher-dimensional/more complex SumoHuman environment (blue) compared to the lower-dimensional/simpler SumoAnt environment (orange). It also appears that the adversary policy is more effective in the goal-blocking scenario (KickAndDefend, red) than the line-guarding scenario (YouShallNotPass, purple). 
Using the tabs and checkboxes, you can turn each baseline model on and off for easier comparison. For example, you could compare the two Sumo versions alone, see more detail in the policy entropy curves if you turn off SumoAnts (orange), and read the "Fraction of wins" chart in the bottom right most easily if you select just one baseline.
﻿
﻿
﻿
Sumo Humans1
KickAndDefend1
YouShallNotPass1
Sumo Ants1
﻿
﻿
Eval videos: Kick and Defend, with Adversarial Policy (row 2)﻿
﻿
﻿
Kick and Defend Matches4
﻿
﻿
Adversarial Goalie Falls Down Instead of Defending the GoalAs in the previous example, the red agent is the adversary (here, a goalkeeper) and the blue agent is the observer (here, a kicker). The first row shows a pair of state-of-the-art "zoo" agents on the left, and a "zoo" kicker with a random (no learned policy) adversary on the right. In the second row, we again see strange behavior from the adversary policies: instead of pacing up and down to mirror the kicker, these goalies fall down and flail their limbs, more closely resembling the random behavior in the top right (but with more flailing). In response to this—and likely in the absence of the expected left-right motion from the goalie—the blue agent hesitates. In the bottom two videos, we can see the blue agent run out of time, fall on the way to the ball, or miss the ball entirely about as often as the blue agent gets a goal. 
How to visualize your results as embedded videosTo log a video with W&B, simply call 
wandb.log({"video": wandb.Video([NUMPY ARRAY OR PATH TO VIDEO FILE])})
Here I reproduce two evaluation videos from the paper: a random adversary (the goal-keeper, in red) on the left and a state-of-the art adversary (again the red goalkeeper), with the following command:
python -m aprl.visualize.make_videos with adversary_path=<path/to/best_adversaries.json>
Eval videos: YouShallNotPass, Random (left) v Adversarial (right)﻿
﻿
You Shall Not Pass Matches2
﻿
﻿
Adversarial Guard Collapses Instead of Actively BlockingFor a last quick example that matches the source project's landing page, observe a pair of state-of-the-art "zoo" agents (left video) and an adversarially trained line guard in red (right video). The guard actively follows and tries to block the blue runner in the left video. In the right video, the red adversary reliably curls up in a ball, which often causes the blue runner to stumble and fall. Perhaps the blue runner lacks the expected input—the left-right motion of the red guard—in these cases, and trips in the process of avoiding the guard too widely. Again, auxiliary goals (like staying upright/balanced regardless of the other agent's behavior) might help the blue runner be more robust to this attack.
﻿
Add a comment
Tags: Intermediate, Reinforcement Learning, Experiment, Research, Github, Panels, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.