Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-Time Strategy Games

This is a report illustrating our latest work *Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games* in an interactive manner.
Costa Huang
Created on September 16|Last edited on December 6
Comment
﻿
Co-author: Santiago Ontañón‬
﻿Paper: Action Guidance – Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games
﻿All metrics and videos﻿
﻿Source Code﻿
IntroductionTraining agents using Reinforcement Learning in games with sparse rewards is a challenging problem since large amounts of exploration are required in order to receive a single reward. To tackle this problem, a common approach is to use reward shaping to help exploration. However, an important drawback of reward shaping is that agents sometimes learn to optimize the shaped reward instead of the true objective. 
In this report, we present a novel technique that we call action guidance that successfully trains agents to eventually optimize the true objective in games with sparse rewards yet does not lose the sampling efficiency that comes with reward shaping. We evaluate our approach in a simplified real-time strategy (RTS) game simulator called μ\muμ﻿RTS. 
Rough IdeaIt works by constructing a main agent  that only learns from the sparse reward function RMR_{\mathcal{M}}RM​﻿ and some auxiliary agents  that learn from the shaped reward function RA1,RA2,…,RAnR_{\mathcal{A}_1}, R_{\mathcal{A}_2}, \dots, R_{\mathcal{A}_n}RA1​​,RA2​​,…,RAn​​﻿. During the initial stage of the training, the main agent has a high probability to take action guidance from the auxiliary agent, that is,  the main agent will likely execute actions sampled from the auxiliary agents. 
The main agent and auxiliary agents are updated via the off-policy policy gradient. As the training goes on, the main agent will get more independent and execute more actions sampled from its own policy. Auxiliary agents learn from shaped rewards and therefore make the training sample-efficient, while the main agent learns from the sparse rewards and therefore makes sure that the agents will eventually optimize over the true objective. 
Action guidance, therefore, gets the best of both worlds.
Demo: Learn to Produce Combat Units (Sparse Rewards)Naturally, the sparse reward is +1 for producing combat units (blue or yellow circles in the videos below)
The task is difficult because
Action space is large
Agent needs to learn harvesting resources, building a barracks, and finally produce combat units.
To help the sparse reward, we shape the reward, providing +1 for building barracks or harvesting resources, +7 for producing combat units
Below are selected videos of training agents with shaped reward, action guidance, sparse reward, and their corresponding charts on the number of combat units produced in an episode over the entire training time. 
Feel free to click on the title of the individual video to check out every metric about the experiment. 
﻿
﻿
﻿
Run set3
﻿
Theory on Action GuidanceSpecifically, let us define M\mathcal{M}M﻿ as the MDP that the main agent learns from and A={A1,A2,...,Ak}\mathcal{A}=\{\mathcal{A}_1, \mathcal{A}_2, ..., \mathcal{A}_k\}A={A1​,A2​,...,Ak​}﻿ be a set of auxiliary MDPs that the auxiliary agents learn from. In our constructions, M\mathcal{M}M﻿ and A\mathcal{A}A﻿ share the same state, observation, and action space. However, the reward function for M\mathcal{M}M﻿ is RMR_{\mathcal{M}}RM​﻿, which is the sparse reward function, and reward functions for A\mathcal{A}A﻿ are RA1,...,RAkR_{\mathcal{A}_1}, ..., R_{\mathcal{A}_k}RA1​​,...,RAk​​﻿, which are the shaped reward functions. For each of these MDPs E∈S={M}∪A\mathcal{E} \in \mathcal{S}=\{\mathcal{M}\} \cup\mathcal{A}E∈S={M}∪A﻿ above, let us initialize a policy πθE\pi_{\theta_\mathcal{E}}πθE​​﻿ parameterized by parameters θE\theta_\mathcal{E}θE​﻿, respectively. Furthermore, let us use πS={πθE∣E∈S}\pi_\mathcal{S} = \{\pi_{\theta_\mathcal{E}} |\mathcal{E} \in \mathcal{S}\}πS​={πθE​​∣E∈S}﻿ to denote the set of these initialized policies.
At each timestep ttt﻿, let us use some exploration strategy SSS﻿ that selects a policy πb∈πS\pi_b \in \pi_\mathcal{S}πb​∈πS​﻿ to sample an action ata_tat​﻿ given sts_tst​﻿.  At the end of the episode, each policy πθE∈πS\pi_{\theta_\mathcal{E}}   \in  \pi_\mathcal{S}πθE​​∈πS​﻿  can be updated via its off-policy policy gradient (Degris, White,and Sutton 2012; Levine et al. 2020):
Eτ∼πθb[(∏t=0T−1πθE(at∣st)πθb(at∣st))∑t=0T−1∇θlog⁡πθE(at∣st)A(τ,V,t)]    \mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta_\mathcal{E}}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right) \sum_{t=0}^{T-1}  \nabla_{\theta} \log \pi_{\theta_\mathcal{E}}\left(a_t | s_t\right) A(\tau, V, t) \right]Eτ∼πθb​​​[(∏t=0T−1​πθb​​(at​∣st​)πθE​​(at​∣st​)​)∑t=0T−1​∇θ​logπθE​​(at​∣st​)A(τ,V,t)]﻿
When πθE=πθb\pi_{\theta_\mathcal{E}}=\pi_{\theta_b}πθE​​=πθb​​﻿, the gradient in the Equation above means on-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}πθE​​﻿. Otherwise, the objective means off-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}πθE​​﻿. 
<!--- Notice the exploration strategy $S$ could have a significant impact on the performance. If, for example, we always sample actions from the main agent (i.e. $a_t\sim \pi_{\theta_\mathcal{M}}(\cdot \vert s_t)$), then the main agent will receive no action guidance at all from the auxiliary agent and likely to have bad performance; and if we always sample the action from the auxiliary agent, then the main agent will have no chance to optimize over the true objective. A possible exploration strategy $S$ might be to spend some initial fraction (e.g. 30%) of the entire training time with a high probability (e.g. 95%) of sampling actions from the auxiliary agents and then decrease this probability over time all the way to 0%, thus giving the full autonomy back to the main agent to optimize over the true objective. -->
Practical Objective and Exploration StrategyThe gradient in Equation above is unbiased, but its product of importance sampling ratio (∏t=0T−1πθ(at∣st)πθb(at∣st))\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right)(∏t=0T−1​πθb​​(at​∣st​)πθ​(at​∣st​)​)﻿ is known to cause high variance  (Wang et al. 2016) . In practice, we clip the gradient the same way as Proximal Policy Gradient (PPO)  (Schulman et al. 2017):
Eτ∼πθb[∑t=0T−1[∇θmin(ρt(θ)A(τ,V,t),clip⁡(ρt(θ),ε)A(τ,V,t))]]    \mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\sum_{t=0}^{T-1}\left[\nabla_{\theta}
        \mathrm{min} \left(
            \rho_t(\theta)  A(\tau, V, t),
            \operatorname{clip}\left(\rho_t(\theta), \varepsilon \right) A(\tau, V, t) \right) \right]
        \right]Eτ∼πθb​​​[∑t=0T−1​[∇θ​min(ρt​(θ)A(τ,V,t),clip(ρt​(θ),ε)A(τ,V,t))]]﻿
ρt(θ)=πθ(at∣st)πθ0(at∣st),clip⁡(ρt(θ),ε)={1−εif ρt(θ)<1−ε1+εif ρt(θ)>1+ερt(θ)otherwise  \rho_t(\theta)  = \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_0}\left(a_t | s_t\right)}, \operatorname{clip}\left(\rho_t(\theta), \varepsilon\right) = \begin{cases}
1-\varepsilon &\text{if } \rho_t(\theta) < 1-\varepsilon\\
1+\varepsilon &\text{if } \rho_t(\theta) > 1+\varepsilon \\
\rho_t(\theta) &\text{otherwise }
\end{cases}ρt​(θ)=πθ0​​(at​∣st​)πθ​(at​∣st​)​,clip(ρt​(θ),ε)=⎩⎨⎧​1−ε1+ερt​(θ)​if ρt​(θ)<1−εif ρt​(θ)>1+εotherwise ​﻿
In addition, we use ϵ\epsilonϵ﻿-greedy as the exploration strategy SSS﻿ for determining the behavior policy. That is, at each timestep ttt﻿, the behavior policy is selected to be πθM\pi_{\theta_\mathcal{M}}πθM​​﻿ with probability 1−ϵ1-\epsilon1−ϵ﻿  and πθD\pi_{\theta_\mathcal{D}}πθD​​﻿ for D∈A\mathcal{D}\in \mathcal{A}D∈A﻿ with probability ϵ\epsilonϵ﻿. Additionally, ϵ\epsilonϵ﻿ is set to be a constant %95\%95%95﻿ at the start for some period of time steps (e.g. 800,000), which we refer to as the shift period (the time it takes to start shifting focus away from the auxiliary agents), then it is set to linearly decay to ϵend\epsilon_{end}ϵend​﻿ for some period of time steps (e.g. 1,000,000), which we refer to as the adaptation period (the time it takes for the main agent to fully adapt and become more independent).
Positive Learning Optimization (PLO)During our initial experiments, we found the main agent sometimes did not learn useful policies. Our hypothesis is that this was because the main agent is updated with too many trajectories with zero rewards. Doing a large number of updates of these zero-reward trajectories actually causes the policy to converge prematurely, which is manifested by having low entropy in the action probability distribution. 
To mitigate this issue, we use a simple code-level optimization called Positive Learning Optimization (PLO). It works by skipping the gradient update for πθE∈πS\pi_{\theta_\mathcal{E}} \in  \pi_\mathcal{S}πθE​​∈πS​﻿ if the current trajectory contains no reward according to RER_{\mathcal{E}}RE​﻿. Intuitively, PLO makes sure that the main agent learns from meaningful experience that is associated with positive rewards. 
As shown in the following figure, Action guidance without PLO (pink line) that failed to learn task eventually reaches lower entropy at around 500,000 time step compared to  Action guidance with PLO (grey line).
﻿
﻿
﻿
Run set17
﻿
Evaluation EnvironmentWe use μ\muμ﻿RTS (https://github.com/santiontanon/microrts) as our testbed, which is a minimalistic RTS game maintaining the core features that make RTS games challenging from an AI point of view: simultaneous and durative actions, large branching factors, and real-time decision making. To interface with μ\muμ﻿RTS, we use gym-microrts (https://github.com/vwxyzjn/gym-microrts) to conduct our experiments. 
Observation SpaceGiven a map of size h×wh\times wh×w﻿, the observation is a tensor of shape (h,w,nf)(h, w, n_f)(h,w,nf​)﻿, where nfn_fnf​﻿ is a number of feature planes that have binary values. The observation space used in this paper uses 27 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, the owner being Player 1, and currently not executing any actions, then the one-hot encoding features will look like the following:
[0,1,0,0,0],[1,0,0,0,0],[1,0,0][0,0,0,0,1,0,0,0],[1,0,0,0,0,0]            [0,1,0,0,0],  [1,0,0,0,0],  [1,0,0][0,0,0,0,1,0,0,0],  [1,0,0,0,0,0][0,1,0,0,0],[1,0,0,0,0],[1,0,0][0,0,0,0,1,0,0,0],[1,0,0,0,0,0]﻿
The 27 values of each feature plane for the position in the map of such worker will thus be:
[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0]    [0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0][0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0]﻿
Action SpaceGiven a map of size h×wh\times wh×w﻿, the action is an 8-dimensional vector of discrete values as specified in the following Table. The first component of the action vector represents the unit in the map to issue actions to, the second is the action type, and the rest of the components represent the different parameters different action types can take.
﻿
﻿
﻿
﻿
﻿
 
Run set40
﻿
Task DescriptionHere is a list of tasks in which we examine the performance of action guidance.
LearnToAttack:  
﻿RMR_{\mathcal{M}}RM​﻿: The agent will get a +1 reward for each valid attack action it issues. This is a sparse reward because it requires agents to learn to attack the enemy base and one worker stationary at the other side of the map, which consists of a long sequence of actions with no intermediate rewards.
﻿RA1R_{\mathcal{A}_1}RA1​​﻿: The agent will get the difference between previous and current Euclidean distance between the enemy base and its closet unit owned by the agent as the shaped reward in addition to  RMR_{\mathcal{M}}RM​﻿.
ProduceCombatUnits: 
﻿RMR_{\mathcal{M}}RM​﻿: The agent will get a +1 reward for each combat unit produced. This is a more challenging task because the agent needs to learn 1) harvest resources when 2) produce barracks, 3) produce combat units once enough resources are gathered, 4) move produced combat units out of the way so as to not block the production of new combat units.
﻿RA1R_{\mathcal{A}_1}RA1​​﻿: The agent will get +1 for constructing every building (e.g. barracks), +1 for harvesting resources, and +7 for each combat unit it produces.
DefeatRandomEnemy:
﻿RMR_{\mathcal{M}}RM​﻿: The agent will get the match outcome as the reward (-1 on a loss, 0 on a draw, and +1 on a win). This is the most difficult task we examined because the agent is subject to the full complexity of the game, being required to make both macro-decisions (e.g. deciding the high-level strategies to win the game) and micro-decisions (e.g. deciding which enemy units to attack.
﻿RA1R_{\mathcal{A}_1}RA1​​﻿: The agent will get +5 for winning, +1 for harvesting one resource, +1 for producing one worker, +0.2 for constructing every building, +1 for each valid attack action it issues, +7 for each combat unit it produces, and +(0.2∗d)+(0.2*d)+(0.2∗d)﻿ where ddd﻿ is difference between previous and current Euclidean distance between the enemy base and its closet unit owned by the agent.
Agent SetupWe used PPO  (Schulman et al. 2017)  as the base DRL algorithm to incorporate action guidance. We compared the following strategies:
Shaped reward This agent is trained with PPO on RA1R_{\mathcal{A}_1}RA1​​﻿ for each task.
Sparse reward This agent is trained with PPO on RMR_{\mathcal{M}}RM​﻿ for each task.
Action guidance~-~long adaptation The agent is trained with PPO + action guidance with shift=2,000,000\mathit{shift}=2,000,000shift=2,000,000﻿  time steps, adaptation=7,000,000\mathit{adaptation}=7,000,000adaptation=7,000,000﻿  time steps, and ϵend=0.0\epsilon_{end}= 0.0ϵend​=0.0﻿﻿
Action guidance~-~short adaptation The agent is trained with PPO + action guidance with shift=800,000\mathit{shift}=800,000shift=800,000﻿  time steps, adaptation=1,000,000\mathit{adaptation}=1,000,000adaptation=1,000,000﻿  time steps, and ϵend=0.0\epsilon_{end}= 0.0ϵend​=0.0﻿﻿
Action guidance~-~mixed policy The agent is trained with PPO + action guidance with shift=2,000,000\mathit{shift}=2,000,000shift=2,000,000﻿  time steps and adaptation=2,000,000\mathit{adaptation}=2,000,000adaptation=2,000,000﻿  time steps, and ϵend=0.5\epsilon_{end}= 0.5ϵend​=0.5﻿. We call this agent mixed policy because it will eventually have a 50% chance to sample actions from the main agent and a 50% chance to sample actions from the auxiliary agent. It is effectively having a mixed agent making decisions jointly.
Lastly, we also toggle the PLO option for action guidance~-~long adaptation, action guidance~-~short adaptation, action guidance~-~mixed policy, and sparse reward training strategies for a preliminary ablation study.
Experimental Results
﻿
﻿
﻿
﻿
Run set40
﻿
Action Guidance Optimizes the Sparse RewardThis is perhaps the most important contribution of our paper. Action guidance optimizes the main agent over the true objective, rather than optimizing shaped rewards. Specifically:
In the ProduceCombatUnits task, as shown at the beginning of this post, the agent trained with shaped reward would only start producing combat units once all the resources have been harvested. In contrast, the agents trained with action guidance~-~short adaptation would harvest resources and producing combat units concurrently. Even though they gain similarly shaped rewards, the latter behavior matches the common pattern observed in professional RTS game players and is obviously more desirable because should the enemy attack early, the agents will have enough combat units to defend.
In the DefeatRandomEnemy task, we find the performance of action guidance to be far superior. Below shows a typical example, where the agents trained with shaped rewards learn a variety of behaviors due to the brittleness of reward shaping, some of whom learn to do a worker rush while others learn to focus heavily on harvesting resources and producing units. The agents trained with action guidance~-~long duration, however, almost always learn to do a worker rush, which an efficient way to win against a random enemy. 
﻿
﻿
﻿
Run set3
﻿
﻿
﻿
Run set3
﻿
﻿
﻿
Run set20
﻿
﻿
﻿
Run set40
﻿
ConclusionIn this paper, we present a novel technique called  \emph{action guidance} that successfully trains the agent to eventually optimize over sparse rewards yet does not lose the sample efficiency that comes with reward shaping, effectively getting the best of both worlds. Our experiments with DefeatRandomEnemy in particular show it is possible to train a main agent on the full game of μ\muμ﻿RTS using only the match outcome reward, which suggests action guidance could serve as a promising alternative to the training paradigm of AlphaStar that uses supervised learning with human replay data to bootstrap an  agent.
As part of our future work, we would like to scale up the approach to defeat stronger opponents.
All Learning Curves﻿
﻿
﻿
Run set90
﻿
﻿
﻿
Run set90
﻿
﻿
﻿
Run set90
﻿
﻿
Add a comment