Skip to main content

Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-Time Strategy Games

This is a report illustrating our latest work *Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games* in an interactive manner.
Created on September 16|Last edited on December 6
Co-author: Santiago Ontañón‬

Introduction

Training agents using Reinforcement Learning in games with sparse rewards is a challenging problem since large amounts of exploration are required in order to receive a single reward. To tackle this problem, a common approach is to use reward shaping to help exploration. However, an important drawback of reward shaping is that agents sometimes learn to optimize the shaped reward instead of the true objective.
In this report, we present a novel technique that we call action guidance that successfully trains agents to eventually optimize the true objective in games with sparse rewards yet does not lose the sampling efficiency that comes with reward shaping. We evaluate our approach in a simplified real-time strategy (RTS) game simulator called μ\muRTS.

Rough Idea

It works by constructing a main agent that only learns from the sparse reward function RMR_{\mathcal{M}} and some auxiliary agents that learn from the shaped reward function RA1,RA2,,RAnR_{\mathcal{A}_1}, R_{\mathcal{A}_2}, \dots, R_{\mathcal{A}_n}. During the initial stage of the training, the main agent has a high probability to take action guidance from the auxiliary agent, that is, the main agent will likely execute actions sampled from the auxiliary agents.
The main agent and auxiliary agents are updated via the off-policy policy gradient. As the training goes on, the main agent will get more independent and execute more actions sampled from its own policy. Auxiliary agents learn from shaped rewards and therefore make the training sample-efficient, while the main agent learns from the sparse rewards and therefore makes sure that the agents will eventually optimize over the true objective. Action guidance, therefore, gets the best of both worlds.

Demo: Learn to Produce Combat Units (Sparse Rewards)

  • Naturally, the sparse reward is +1 for producing combat units (blue or yellow circles in the videos below)
  • The task is difficult because
    • Action space is large
    • Agent needs to learn harvesting resources, building a barracks, and finally produce combat units.
  • To help the sparse reward, we shape the reward, providing +1 for building barracks or harvesting resources, +7 for producing combat units
Below are selected videos of training agents with shaped reward, action guidance, sparse reward, and their corresponding charts on the number of combat units produced in an episode over the entire training time.
Feel free to click on the title of the individual video to check out every metric about the experiment.



Run set
3


Theory on Action Guidance

Specifically, let us define M\mathcal{M} as the MDP that the main agent learns from and A={A1,A2,...,Ak}\mathcal{A}=\{\mathcal{A}_1, \mathcal{A}_2, ..., \mathcal{A}_k\} be a set of auxiliary MDPs that the auxiliary agents learn from. In our constructions, M\mathcal{M} and A\mathcal{A} share the same state, observation, and action space. However, the reward function for M\mathcal{M} is RMR_{\mathcal{M}}, which is the sparse reward function, and reward functions for A\mathcal{A} are RA1,...,RAkR_{\mathcal{A}_1}, ..., R_{\mathcal{A}_k}, which are the shaped reward functions. For each of these MDPs ES={M}A\mathcal{E} \in \mathcal{S}=\{\mathcal{M}\} \cup\mathcal{A} above, let us initialize a policy πθE\pi_{\theta_\mathcal{E}} parameterized by parameters θE\theta_\mathcal{E}, respectively. Furthermore, let us use πS={πθEES}\pi_\mathcal{S} = \{\pi_{\theta_\mathcal{E}} |\mathcal{E} \in \mathcal{S}\} to denote the set of these initialized policies.
At each timestep tt, let us use some exploration strategy SS that selects a policy πbπS\pi_b \in \pi_\mathcal{S} to sample an action ata_t given sts_t. At the end of the episode, each policy πθEπS\pi_{\theta_\mathcal{E}} \in \pi_\mathcal{S} can be updated via its off-policy policy gradient (Degris, White,and Sutton 2012; Levine et al. 2020):
Eτπθb[(t=0T1πθE(atst)πθb(atst))t=0T1θlogπθE(atst)A(τ,V,t)] \mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta_\mathcal{E}}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right) \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta_\mathcal{E}}\left(a_t | s_t\right) A(\tau, V, t) \right]

When πθE=πθb\pi_{\theta_\mathcal{E}}=\pi_{\theta_b}, the gradient in the Equation above means on-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}. Otherwise, the objective means off-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}.
<!--- Notice the exploration strategy $S$ could have a significant impact on the performance. If, for example, we always sample actions from the main agent (i.e. $a_t\sim \pi_{\theta_\mathcal{M}}(\cdot \vert s_t)$), then the main agent will receive no action guidance at all from the auxiliary agent and likely to have bad performance; and if we always sample the action from the auxiliary agent, then the main agent will have no chance to optimize over the true objective. A possible exploration strategy $S$ might be to spend some initial fraction (e.g. 30%) of the entire training time with a high probability (e.g. 95%) of sampling actions from the auxiliary agents and then decrease this probability over time all the way to 0%, thus giving the full autonomy back to the main agent to optimize over the true objective. -->

Practical Objective and Exploration Strategy

The gradient in Equation above is unbiased, but its product of importance sampling ratio (t=0T1πθ(atst)πθb(atst))\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right) is known to cause high variance (Wang et al. 2016) . In practice, we clip the gradient the same way as Proximal Policy Gradient (PPO) (Schulman et al. 2017):
Eτπθb[t=0T1[θmin(ρt(θ)A(τ,V,t),clip(ρt(θ),ε)A(τ,V,t))]] \mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\sum_{t=0}^{T-1}\left[\nabla_{\theta} \mathrm{min} \left( \rho_t(\theta) A(\tau, V, t), \operatorname{clip}\left(\rho_t(\theta), \varepsilon \right) A(\tau, V, t) \right) \right] \right]

ρt(θ)=πθ(atst)πθ0(atst),clip(ρt(θ),ε)={1εif ρt(θ)<1ε1+εif ρt(θ)>1+ερt(θ)otherwise  \rho_t(\theta) = \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_0}\left(a_t | s_t\right)}, \operatorname{clip}\left(\rho_t(\theta), \varepsilon\right) = \begin{cases} 1-\varepsilon &\text{if } \rho_t(\theta) < 1-\varepsilon\\ 1+\varepsilon &\text{if } \rho_t(\theta) > 1+\varepsilon \\ \rho_t(\theta) &\text{otherwise } \end{cases}

In addition, we use ϵ\epsilon-greedy as the exploration strategy SS for determining the behavior policy. That is, at each timestep tt, the behavior policy is selected to be πθM\pi_{\theta_\mathcal{M}} with probability 1ϵ1-\epsilon and πθD\pi_{\theta_\mathcal{D}} for DA\mathcal{D}\in \mathcal{A} with probability ϵ\epsilon. Additionally, ϵ\epsilon is set to be a constant %95\%95 at the start for some period of time steps (e.g. 800,000), which we refer to as the shift period (the time it takes to start shifting focus away from the auxiliary agents), then it is set to linearly decay to ϵend\epsilon_{end} for some period of time steps (e.g. 1,000,000), which we refer to as the adaptation period (the time it takes for the main agent to fully adapt and become more independent).

Positive Learning Optimization (PLO)

During our initial experiments, we found the main agent sometimes did not learn useful policies. Our hypothesis is that this was because the main agent is updated with too many trajectories with zero rewards. Doing a large number of updates of these zero-reward trajectories actually causes the policy to converge prematurely, which is manifested by having low entropy in the action probability distribution.
To mitigate this issue, we use a simple code-level optimization called Positive Learning Optimization (PLO). It works by skipping the gradient update for πθEπS\pi_{\theta_\mathcal{E}} \in \pi_\mathcal{S} if the current trajectory contains no reward according to RER_{\mathcal{E}}. Intuitively, PLO makes sure that the main agent learns from meaningful experience that is associated with positive rewards.
As shown in the following figure, Action guidance without PLO (pink line) that failed to learn task eventually reaches lower entropy at around 500,000 time step compared to Action guidance with PLO (grey line).



Run set
17


Evaluation Environment

We use μ\muRTS (https://github.com/santiontanon/microrts) as our testbed, which is a minimalistic RTS game maintaining the core features that make RTS games challenging from an AI point of view: simultaneous and durative actions, large branching factors, and real-time decision making. To interface with μ\muRTS, we use gym-microrts (https://github.com/vwxyzjn/gym-microrts) to conduct our experiments.

Observation Space

Given a map of size h×wh\times w, the observation is a tensor of shape (h,w,nf)(h, w, n_f), where nfn_f is a number of feature planes that have binary values. The observation space used in this paper uses 27 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, the owner being Player 1, and currently not executing any actions, then the one-hot encoding features will look like the following:
[0,1,0,0,0],[1,0,0,0,0],[1,0,0][0,0,0,0,1,0,0,0],[1,0,0,0,0,0] [0,1,0,0,0], [1,0,0,0,0], [1,0,0][0,0,0,0,1,0,0,0], [1,0,0,0,0,0]

The 27 values of each feature plane for the position in the map of such worker will thus be:
[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0] [0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0]


Action Space

Given a map of size h×wh\times w, the action is an 8-dimensional vector of discrete values as specified in the following Table. The first component of the action vector represents the unit in the map to issue actions to, the second is the action type, and the rest of the components represent the different parameters different action types can take.





Run set
40


Task Description

Here is a list of tasks in which we examine the performance of action guidance.
  • LearnToAttack:
    • RMR_{\mathcal{M}}: The agent will get a +1 reward for each valid attack action it issues. This is a sparse reward because it requires agents to learn to attack the enemy base and one worker stationary at the other side of the map, which consists of a long sequence of actions with no intermediate rewards.
    • RA1R_{\mathcal{A}_1}: The agent will get the difference between previous and current Euclidean distance between the enemy base and its closet unit owned by the agent as the shaped reward in addition to RMR_{\mathcal{M}}.
  • ProduceCombatUnits:
    • RMR_{\mathcal{M}}: The agent will get a +1 reward for each combat unit produced. This is a more challenging task because the agent needs to learn 1) harvest resources when 2) produce barracks, 3) produce combat units once enough resources are gathered, 4) move produced combat units out of the way so as to not block the production of new combat units.
    • RA1R_{\mathcal{A}_1}: The agent will get +1 for constructing every building (e.g. barracks), +1 for harvesting resources, and +7 for each combat unit it produces.
  • DefeatRandomEnemy:
    • RMR_{\mathcal{M}}: The agent will get the match outcome as the reward (-1 on a loss, 0 on a draw, and +1 on a win). This is the most difficult task we examined because the agent is subject to the full complexity of the game, being required to make both macro-decisions (e.g. deciding the high-level strategies to win the game) and micro-decisions (e.g. deciding which enemy units to attack.
    • RA1R_{\mathcal{A}_1}: The agent will get +5 for winning, +1 for harvesting one resource, +1 for producing one worker, +0.2 for constructing every building, +1 for each valid attack action it issues, +7 for each combat unit it produces, and +(0.2d)+(0.2*d) where dd is difference between previous and current Euclidean distance between the enemy base and its closet unit owned by the agent.

Agent Setup

We used PPO (Schulman et al. 2017) as the base DRL algorithm to incorporate action guidance. We compared the following strategies:
  • Shaped reward This agent is trained with PPO on RA1R_{\mathcal{A}_1} for each task.
  • Sparse reward This agent is trained with PPO on RMR_{\mathcal{M}} for each task.
  • Action guidance~-~long adaptation The agent is trained with PPO + action guidance with shift=2,000,000\mathit{shift}=2,000,000 time steps, adaptation=7,000,000\mathit{adaptation}=7,000,000 time steps, and ϵend=0.0\epsilon_{end}= 0.0
  • Action guidance~-~short adaptation The agent is trained with PPO + action guidance with shift=800,000\mathit{shift}=800,000 time steps, adaptation=1,000,000\mathit{adaptation}=1,000,000 time steps, and ϵend=0.0\epsilon_{end}= 0.0
  • Action guidance~-~mixed policy The agent is trained with PPO + action guidance with shift=2,000,000\mathit{shift}=2,000,000 time steps and adaptation=2,000,000\mathit{adaptation}=2,000,000 time steps, and ϵend=0.5\epsilon_{end}= 0.5. We call this agent mixed policy because it will eventually have a 50% chance to sample actions from the main agent and a 50% chance to sample actions from the auxiliary agent. It is effectively having a mixed agent making decisions jointly.
Lastly, we also toggle the PLO option for action guidance~-~long adaptation, action guidance~-~short adaptation, action guidance~-~mixed policy, and sparse reward training strategies for a preliminary ablation study.

Experimental Results





Run set
40


Action Guidance Optimizes the Sparse Reward

This is perhaps the most important contribution of our paper. Action guidance optimizes the main agent over the true objective, rather than optimizing shaped rewards. Specifically:
  • In the ProduceCombatUnits task, as shown at the beginning of this post, the agent trained with shaped reward would only start producing combat units once all the resources have been harvested. In contrast, the agents trained with action guidance~-~short adaptation would harvest resources and producing combat units concurrently. Even though they gain similarly shaped rewards, the latter behavior matches the common pattern observed in professional RTS game players and is obviously more desirable because should the enemy attack early, the agents will have enough combat units to defend.
  • In the DefeatRandomEnemy task, we find the performance of action guidance to be far superior. Below shows a typical example, where the agents trained with shaped rewards learn a variety of behaviors due to the brittleness of reward shaping, some of whom learn to do a worker rush while others learn to focus heavily on harvesting resources and producing units. The agents trained with action guidance~-~long duration, however, almost always learn to do a worker rush, which an efficient way to win against a random enemy.



Run set
3



Run set
3



Run set
20



Run set
40


Conclusion

In this paper, we present a novel technique called \emph{action guidance} that successfully trains the agent to eventually optimize over sparse rewards yet does not lose the sample efficiency that comes with reward shaping, effectively getting the best of both worlds. Our experiments with DefeatRandomEnemy in particular show it is possible to train a main agent on the full game of μ\muRTS using only the match outcome reward, which suggests action guidance could serve as a promising alternative to the training paradigm of AlphaStar that uses supervised learning with human replay data to bootstrap an agent.
As part of our future work, we would like to scale up the approach to defeat stronger opponents.

All Learning Curves




Run set
90



Run set
90



Run set
90