Skip to main content

Cloned Report

This is a report illustrating our latest work *Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games* in an interactive manner.
Created on October 15|Last edited on October 23

Co-author: Santiago Ontañón‬

Quick intro

Training agents using Reinforcement Learning in games with sparse rewards is a challenging problem, since large amounts of exploration are required in order to receive a single reward. To tackle this problem, a common approach is to use reward shaping to help exploration. However, an important drawback of reward shaping is that agents sometimes learn to optimize the shaped reward instead of the true objective.

In this report, we present a novel technique that we call action guidance that successfully trains agents to eventually optimize the true objective in games with sparse rewards yet does not lose the sampling efficiency that comes with reward shaping. We evaluate our approach in a simplified real-time strategy (RTS) game simulator called μ\muRTS.

Rough idea

It works by constructing a main agent that only learns from the sparse reward function RMR_{\mathcal{M}} and some auxiliary agents that learn from the shaped reward function RA1,RA2,…,RAnR_{\mathcal{A}_1}, R_{\mathcal{A}_2}, \dots, R_{\mathcal{A}_n}. During the initial stage of the training, the main agent has a high probability to take action guidance from the auxiliary agent, that is, the main agent will likely execute actions sampled from the auxiliary agents.

The main agent and auxiliary agents are updated via off-policy policy gradient. As the training goes on, the main agent will get more independent and execute more actions sampled from its own policy. Auxiliary agents learn from shaped rewards and therefore make the training sample-efficient, while the main agent learns from the sparse rewards and therefore makes sure that the agents will eventually optimize over the true objective. Action guidance therefore gets the best of both worlds.

Demo: learn to produce combat units (sparse rewards)

  • Naturally, the sparse reward is +1 for producing combat units (blue or yellow circles in the videos below)
  • The task is difficult because
    • Action space is large
    • Agent needs to learn harvesting resources, building a barracks, and finally produce combat units.
  • To help the sparse reward, we shape the reward, providing +1 for building barracks or harvesting resources, +7 for producing combat units

Below are selected videos of training agents with shaped reward, action guidance, sparse reward, and their corresponding charts on number of combat units produced in an episode over the entire training time.

Feel free to click on the title of the individual video to check out every metrics about the experiment.




Run set
3


Theory on Action Guidance

Specifically, let us define M\mathcal{M} as the MDP that the main agent learns from and A={A1,A2,...,Ak}\mathcal{A}=\{\mathcal{A}_1, \mathcal{A}_2, ..., \mathcal{A}_k\} be a set of auxiliary MDPs that the auxiliary agents learn from. In our constructions, M\mathcal{M} and A\mathcal{A} share the same state, observation, and action space. However, the reward function for M\mathcal{M} is RMR_{\mathcal{M}}, which is the sparse reward function, and reward functions for A\mathcal{A} are RA1,...,RAkR_{\mathcal{A}_1}, ..., R_{\mathcal{A}_k}, which are the shaped reward functions. For each of these MDPs E∈S={M}∪A\mathcal{E} \in \mathcal{S}=\{\mathcal{M}\} \cup\mathcal{A} above, let us initialize a policy πθE\pi_{\theta_\mathcal{E}} parameterized by parameters θE\theta_\mathcal{E}, respectively. Furthermore, let us use πS={πθE∣E∈S}\pi_\mathcal{S} = \{\pi_{\theta_\mathcal{E}} |\mathcal{E} \in \mathcal{S}\} to denote the set of these initialized policies.

At each timestep tt, let us use some exploration strategy SS that selects a policy πb∈πS\pi_b \in \pi_\mathcal{S} to sample an action ata_t given sts_t. At the end of the episode, each policy πθE∈πS\pi_{\theta_\mathcal{E}} \in \pi_\mathcal{S} can be updated via its off-policy policy gradient (Degris, White,and Sutton 2012; Levine et al. 2020):

Eτ∼πθb[(∏t=0T−1πθE(at∣st)πθb(at∣st))∑t=0T−1∇θlog⁡πθE(at∣st)A(τ,V,t)]\mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta_\mathcal{E}}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right) \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta_\mathcal{E}}\left(a_t | s_t\right) A(\tau, V, t) \right]

When πθE=πθb\pi_{\theta_\mathcal{E}}=\pi_{\theta_b}, the gradient in the Equation above means on-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}. Otherwise, the objective means off-policy policy gradient update for πθE\pi_{\theta_\mathcal{E}}.

Practical Objective and Exploration Strategy

The gradient in Equation above is unbiased, but its product of importance sampling ratio (∏t=0T−1πθ(at∣st)πθb(at∣st))\left(\prod_{t=0}^{T-1} \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_b}\left(a_t | s_t\right)}\right) is known to cause high variance (Wang et al. 2016) . In practice, we clip the gradient the same way as Proximal Policy Gradient (PPO) (Schulman et al. 2017):

Eτ∼πθb[∑t=0T−1[∇θmin(ρt(θ)A(τ,V,t),clip⁡(ρt(θ),ε)A(τ,V,t))]]\mathbb{E}_{\tau\sim\pi_{\theta_b}}\left[\sum_{t=0}^{T-1}\left[\nabla_{\theta} \mathrm{min} \left( \rho_t(\theta) A(\tau, V, t), \operatorname{clip}\left(\rho_t(\theta), \varepsilon \right) A(\tau, V, t) \right) \right] \right]
ρt(θ)=πθ(at∣st)πθ0(at∣st),clip⁡(ρt(θ),ε)={1−εif ρt(θ)<1−ε1+εif ρt(θ)>1+ερt(θ)otherwise \rho_t(\theta) = \frac{\pi_{\theta}\left(a_t | s_t\right)}{\pi_{\theta_0}\left(a_t | s_t\right)}, \operatorname{clip}\left(\rho_t(\theta), \varepsilon\right) = \begin{cases} 1-\varepsilon &\text{if } \rho_t(\theta) < 1-\varepsilon\\ 1+\varepsilon &\text{if } \rho_t(\theta) > 1+\varepsilon \\ \rho_t(\theta) &\text{otherwise } \end{cases}

In addition, we use ϵ\epsilon-greedy as the exploration strategy SS for determining the behavior policy. That is, at each timestep tt, the behavior policy is selected to be πθM\pi_{\theta_\mathcal{M}} with probability 1−ϵ1-\epsilon and πθD\pi_{\theta_\mathcal{D}} for D∈A\mathcal{D}\in \mathcal{A} with probability ϵ\epsilon. Additionally, ϵ\epsilon is set to be a constant %95\%95 at start for some period of time steps (e.g. 800,000), which we refer to as the shift period (the time it takes to start shifting focus away from the auxiliary agents), then it is set to linearly decay to ϵend\epsilon_{end} for some period of time steps (e.g. 1,000,000), which we refer to as the adaptation period (the time it takes for the main agent to fully adapt and become more independent).

Positive Learning Optimization (PLO)

During our initial experiments, we found the main agent sometimes did not learn useful policies. Our hypothesis is that this was because the main agent is updated with too many trajectories with zero reward. Doing a large quantities of updates of these zero-reward trajectories actually causes the policy to converge prematurely, which is manifested by having low entropy in the action probability distribution.

To mitigate this issue, we use a simple code-level optimization called Positive Learning Optimization (PLO). It works by skipping the gradient update for πθE∈πS\pi_{\theta_\mathcal{E}} \in \pi_\mathcal{S} if the current trajectory contains no reward according to RER_{\mathcal{E}}. Intuitively, PLO makes sure that the main agent learns from meaningful experience that is associated with positive rewards.

As shown in the following figure, Action guidance without PLO (pink line) that failed to learn task eventually reaches lower entropy at around 500,000 time step compared to Action guidance with PLO (grey line).




Run set
17


Evaluation Environment




Run set
40



Run set
40




Run set
3



Run set
3



Run set
20



Run set
40


Conclusion

In this paper, we present a novel technique called \emph{action guidance} that successfully trains the agent to eventually optimize over sparse rewards yet does not lose the sample efficiency that comes with reward shaping, effectively getting the best of both worlds. Our experiments with DefeatRandomEnemy in particular show it is possible to train a main agent on the full game of μ\muRTS using only the match outcome reward, which suggests action guidance could serve as a promising alternative to the training paradigm of AlphaStar that uses supervised learning with human replay data to bootstrap an agent.

As part of our future work, we would like to scale up the approach to defeat stronger opponents.

All Learning Curves




Run set
90



Run set
90



Run set
90