Cloned Report

This is a report illustrating our latest work *Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games* in an interactive manner.
Costa Huang
Created on October 15|Last edited on October 23
Comment
﻿
Co-author: Santiago Ontañón‬
paper: https://arxiv.org/abs/2010.03956
all metrics and videos: https://app.wandb.ai/vwxyzjn/action-guidance
source code:  https://github.com/vwxyzjn/action-guidance
Quick introTraining agents using Reinforcement Learning in games with sparse rewards is a challenging problem, since large amounts of exploration are required in order to receive a single reward. To tackle this problem, a common approach is to use reward shaping to help exploration. However, an important drawback of reward shaping is that agents sometimes learn to optimize the shaped reward instead of the true objective. 
In this report, we present a novel technique that we call action guidance that successfully trains agents to eventually optimize the true objective in games with sparse rewards yet does not lose the sampling efficiency that comes with reward shaping. We evaluate our approach in a simplified real-time strategy (RTS) game simulator called μ\muμRTS. 
Rough ideaIt works by constructing a main agent  that only learns from the sparse reward function RMR_{\mathcal{M}}RM​ and some auxiliary agents  that learn from the shaped reward function RA1,RA2,…,RAnR_{\mathcal{A}_1}, R_{\mathcal{A}_2}, \dots, R_{\mathcal{A}_n}RA1​​,RA2​​,…,RAn​​. During the initial stage of the training, the main agent has a high probability to take action guidance from the auxiliary agent, that is,  the main agent will likely execute actions sampled from the auxiliary agents. 
The main agent and auxiliary agents are updated via off-policy policy gradient. As the training goes on, the main agent will get more independent and execute more actions sampled from its own policy. Auxiliary agents learn from shaped rewards and therefore make the training sample-efficient, while the main agent learns from the sparse rewards and therefore makes sure that the agents will eventually optimize over the true objective.
Action guidance therefore gets the best of both worlds.
Demo: learn to produce combat units (sparse rewards)Naturally, the sparse reward is +1 for producing combat units (blue or yellow circles in the videos below)
The task is difficult because
Action space is large
Agent needs to learn harvesting resources, building a barracks, and finally produce combat units.

To help the sparse reward, we shape the reward, providing +1 for building barracks or harvesting resources, +7 for producing combat units
Below are selected videos of training agents with shaped reward, action guidance, sparse reward, and their corresponding charts on number of combat units produced in an episode over the entire training time. 
Feel free to click on the title of the individual video to check out every metrics about the experiment. 
﻿
﻿
﻿
Run set3
﻿
Theory on Action GuidanceSpecifically, let us define M\mathcal{M}M as the MDP that the main agent learns from and A={A1,A2,...,Ak}\mathcal{A}=\{\mathcal{A}_1, \mathcal{A}_2, ..., \mathcal{A}_k\}A={A1​,A2​,...,Ak​} be a set of auxiliary MDPs that the auxiliary agents learn from. In our constructions, M\mathcal{M}M and A\mathcal{A}A share the same state, observation, and action space. However, the reward function for M\mathcal{M}M is RMR_{\mathcal{M}}RM​, which is the sparse reward function, and reward functions for A\mathcal{A}A are RA1,...,RAkR_{\mathcal{A}_1}, ..., R_{\mathcal{A}_k}RA1​​,...,RAk​​, which are the shaped reward functions. For each of these MDPs E∈S={M}∪A\mathcal{E} \in \mathcal{S}=\{\mathcal{M}\} \cup\mathcal{A}E∈S={M}∪A above, let us initialize a policy πθE\pi_{\theta_\mathcal{E}}πθE​​ parameterized by parameters θE\theta_\mathcal{E}θE​, respectively. Furthermore, let us use πS={πθE∣E∈S}\pi_\mathcal{S} = \{\pi_{\theta_\mathcal{E}} |\mathcal{E} \in \mathcal{S}\}πS​={πθE​​∣E∈S} to denote the set of these initialized policies.
At each timestep ttt, let us use some exploration strategy SSS that selects a policy πb∈πS\pi_b \in \pi_\mathcal{S}πb​∈πS​ to sample an action ata_tat​ given sts_tst​.  At the end of the episode, each policy πθE∈πS\pi_{\theta_\mathcal{E}} \in \pi_\mathcal{S}πθE​​∈πS​  can be updated via its off-policy policy gradient (Degris, White,and Sutton 2012; Levine et al. 2020):
Eτ∼πθb[(∏t=0T−1πθE(at∣st)πθb(at∣st))∑t=0T−1∇θlog⁡πθE(at∣st)A(τ,V,t)]\mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta_\mathcal{E}}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right) \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta_\mathcal{E}}\left(a_t | s_t\right) A(\tau, V, t) \right]Eτ∼πθb​​​[(t=0∏T−1​πθb​​(at​∣st​)πθE​​(at​∣st​)​)t=0∑T−1​∇θ​logπθE​​(at​∣st​)A(τ,V,t)]
When πθE=πθb\pi_{\theta_\mathcal{E}}=\pi_{\theta_b}πθE​​=πθb​​, the gradient in the Equation above means on-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}πθE​​. Otherwise, the objective means off-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}πθE​​. 
Practical Objective and Exploration StrategyThe gradient in Equation above is unbiased, but its product of importance sampling ratio (∏t=0T−1πθ(at∣st)πθb(at∣st))\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right)(∏t=0T−1​πθb​​(at​∣st​)πθ​(at​∣st​)​) is known to cause high variance  (Wang et al. 2016) . In practice, we clip the gradient the same way as Proximal Policy Gradient (PPO)  (Schulman et al. 2017):
Eτ∼πθb[∑t=0T−1[∇θmin(ρt(θ)A(τ,V,t),clip⁡(ρt(θ),ε)A(τ,V,t))]]\mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\sum_{t=0}^{T-1}\left[\nabla_{\theta} \mathrm{min} \left( \rho_t(\theta) A(\tau, V, t), \operatorname{clip}\left(\rho_t(\theta), \varepsilon \right) A(\tau, V, t) \right) \right] \right]Eτ∼πθb​​​[t=0∑T−1​[∇θ​min(ρt​(θ)A(τ,V,t),clip(ρt​(θ),ε)A(τ,V,t))]]
ρt(θ)=πθ(at∣st)πθ0(at∣st),clip⁡(ρt(θ),ε)={1−εif ρt(θ)<1−ε1+εif ρt(θ)>1+ερt(θ)otherwise \rho_t(\theta) = \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_0}\left(a_t | s_t\right)}, \operatorname{clip}\left(\rho_t(\theta), \varepsilon\right) = \begin{cases} 1-\varepsilon &\text{if } \rho_t(\theta) < 1-\varepsilon\\ 1+\varepsilon &\text{if } \rho_t(\theta) > 1+\varepsilon \\ \rho_t(\theta) &\text{otherwise } \end{cases}ρt​(θ)=πθ0​​(at​∣st​)πθ​(at​∣st​)​,clip(ρt​(θ),ε)=⎩⎪⎪⎨⎪⎪⎧​1−ε1+ερt​(θ)​if ρt​(θ)<1−εif ρt​(θ)>1+εotherwise ​
In addition, we use ϵ\epsilonϵ-greedy as the exploration strategy SSS for determining the behavior policy. That is, at each timestep ttt, the behavior policy is selected to be πθM\pi_{\theta_\mathcal{M}}πθM​​ with probability 1−ϵ1-\epsilon1−ϵ  and πθD\pi_{\theta_\mathcal{D}}πθD​​ for D∈A\mathcal{D}\in \mathcal{A}D∈A with probability ϵ\epsilonϵ. Additionally, ϵ\epsilonϵ is set to be a constant %95\%95%95 at start for some period of time steps (e.g. 800,000), which we refer to as the shift period (the time it takes to start shifting focus away from the auxiliary agents), then it is set to linearly decay to ϵend\epsilon_{end}ϵend​ for some period of time steps (e.g. 1,000,000), which we refer to as the adaptation period (the time it takes for the main agent to fully adapt and become more independent).
Positive Learning Optimization (PLO)During our initial experiments, we found the main agent sometimes did not learn useful policies. Our hypothesis is that this was because the main agent is updated with too many trajectories with zero reward. Doing a large quantities of updates of these zero-reward trajectories actually causes the policy to converge prematurely, which is manifested by having low entropy in the action probability distribution. 
To mitigate this issue, we use a simple code-level optimization called Positive Learning Optimization (PLO). It works by skipping the gradient update for πθE∈πS\pi_{\theta_\mathcal{E}} \in \pi_\mathcal{S}πθE​​∈πS​ if the current trajectory contains no reward according to RER_{\mathcal{E}}RE​. Intuitively, PLO makes sure that the main agent learns from meaningful experience that is associated with positive rewards. 
As shown in the following figure, Action guidance without PLO (pink line) that failed to learn task eventually reaches lower entropy at around 500,000 time step compared to  Action guidance with PLO (grey line).
﻿
﻿
﻿
Run set17
﻿
Evaluation Environment
﻿
﻿
﻿
 
Run set40
﻿
﻿
﻿
Run set40
﻿
﻿
﻿
﻿
Run set3
﻿
﻿
﻿
Run set3
﻿
﻿
﻿
Run set20
﻿
﻿
﻿
Run set40
﻿
ConclusionIn this paper, we present a novel technique called  \emph{action guidance} that successfully trains the agent to eventually optimize over sparse rewards yet does not lose the sample efficiency that comes with reward shaping, effectively getting the best of both worlds. Our experiments with DefeatRandomEnemy in particular show it is possible to train a main agent on the full game of μ\muμRTS using only the match outcome reward, which suggests action guidance could serve as a promising alternative to the training paradigm of AlphaStar that uses supervised learning with human replay data to bootstrap an  agent.
As part of our future work, we would like to scale up the approach to defeat stronger opponents.
All Learning Curves
﻿
﻿
﻿
Run set90
﻿
﻿
﻿
Run set90
﻿
﻿
﻿
Run set90
﻿
﻿
Add a comment