# Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-Time Strategy Games

Train agents to play an RTS game with commodity machines (one GPU, three vCPU, 16GB RAM). Made by Costa Huang using Weights & Biases
Costa Huang
Authors: Shengyi Huang, Santiago Ontañón‬, Chris Bamford, Lukasz Grela.

## Overview

In recent years, researchers have achieved great success in applying Deep Reinforcement Learning (DRL) algorithms to Real-time Strategy (RTS) games. Most notably, DeepMind trained a grandmaster-level AI called AlphaStar with DRL for the popular RTS game StarCraft II (Vinyals et al. 2019). AlphaStar demonstrates impressive strategy and game control, presenting many human-like behaviors, and is able to defeat professional players consistently. Given most previously designed bots fail to perform well in the full-game against humans (Ontañón et al. 2013), AlphaStar clearly represents a significant milestone in the field.
While this accomplishment is impressive, it comes with a high computational cost. In particular, AlphaStar (Vinyals et al. 2019) and even further attempts by other teams to lower the computational cost (Han et al. 2020) still require thousands of CPUs and TPUs to train the agents for an extended period of time, which is outside of the computational budget of most researchers.
This paper has 2 main contributions to address this issue.
• We introduce Gym-μRTS (pronounced "gym-micro-RTS'') as a faster-to-run RL environment for full-game RTS research
• We present a collection of techniques to scale DRL to play full-game μRTS as well as detailed ablation studies to demonstrate their empirical importance
Our best-trained bot can defeat every \muRTS bot we tested from the past \muRTS competitions (when working in a single-map setting), resulting in a state-of-the-art DRL agent while only taking about 60 hours of training using a single machine (one GPU, three vCPU, 16GB RAM).

## Background

Real-time Strategy (RTS) games are complex adversarial domains, typically simulating battles between a large number of combat units, that pose a significant challenge to both human and artificial intelligence (Buro 2003). Designing AI techniques for RTS games is challenging due to a variety of reasons:
1. Players need to issue actions in real-time, leaving little time computational budget.
2. The action spaces grows combinatorially with the number of units in the game.
3. The rewards are very sparse (win/loss at the end of the game).
4. Generalizing against diverse set of opponents and maps is difficult.
5. Stochasticity of game mechanics and partial observability (these last two are not considered in this paper).
StarCraft I & II are very popular RTS games and, among other games, have attracted much research attention. Past work in this area includes reinforcement learning (Jaidee and Muñoz-Avils 2012), case-based reasoning (Weber and Mateas 2009; Ontañón et al. 2010), or game tree search (Balla and Fern 2009; Churchill et al. 2021; Justesen et al. 2014; Ontañón 2017) among many other techniques designed to tackle different sub-problems in the game, such as micromanagement, or build-order generation. In the full-game settings however, most techniques have had limited success in creating a viable agents to play competitively against professional StarCraft players until recently. In particular, DeepMind introduced AlphaStar (Vinyals et al. 2019), an agent trained with DRL and self-play, that sets a new state-of-the-art bot for StarCraft II, defeating professional players in the full-game. In Dota 2, a popular collaborative online-player game that shares many similar challenges as StarCraft, Open AI Five (Berner et al. 2019) is able to create agents that can achieve super-human performance. Although these two systems achieve great performance, they come with large computational costs. AlphaStar used 3072 TPU cores and 50,400 preemptible CPU cores for a duration of 44 days (Vinyals et al. 2019). This makes it difficult for those with less computational resources to do full-game RTS research using DRL.
There are usually three ways to circumvent this computational costs. The first way is to focus on sub problems such as combat scenarios (Samvelyan et al. 2019). The second way is to reduce the full-game complexity by either considering hierarchical actions spaces or incorporating scripted actions (Sun et al. 2018; Lee et al. 2018). The third way is to use alternative game simulators that run faster such as Mini-RTS (Tian et al. 2017), Deep RTS (Andersen et al. 2018), and CodeCraft (Winter 2021).
We show that Gym-μRTS as an alternative that could be used for full-game RTS research with the full action space while using affordable computational resources.

## Gym-μRTS

Gym-μRTS is a reinforcement learning interface for the RTS games simulator μRTS. Despite having a simplified implementation, μRTS captures the core challenges of RTS games, such as combinatorial action space, real-time decision making, optionally partial observability and stochasticity. Gym-μRTS's observation space provides a series of feature maps similar to PySC2 (the StarCraft II Learning environment). Its action space design, however, is more low-level due to its lack of AI assisted actions. In this section, we introduce their technical details.

### Observation Space

The observation design is straightforward. Given a map of size h\times w, the observation is a tensor of shape (h, w, n_f), where n_f is a number of feature planes that have binary values. The observation space used in this paper uses 27 feature planes as shown in Table 1. The different feature planes result as the concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, owner being Player 1, and currently not executing any actions, then the one-hot encoding features will look like the following (see Table 1):
[0,1,0,0,0], [1,0,0,0,0], [1,0,0], \\ [0,0,0,0,1,0,0,0], [1,0,0,0,0,0]
Each feature plane contains one value for each coordinate in the map. The values for the 27 feature planes for the position in the map of such worker will thus be:
[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0]
Observation features and action components. a_r=7 is the maximum attack range.

### Action Space

Compared to traditional reinforcement learning environments, the design of the action space of RTS games is more difficult because, depending on the game state, there is a different number of units to control, where each unit might have different number of action types and parameters available. This poses a challenge for directly applying off-the-shelf DRL algorithm such as PPO that generally assumes a fixed output size for the actions. Early work on RL in RTS games simply learned policies for individual units, rather than having the policy control all the units at once (Jaidee and Muñoz-Avila 2012).
To address this issue, we decompose the action space into two parts:
• the unit action space, which describes the possible actions for a particular unit
• the player action space, which describes the possible actions of a player, which usually involve the unit actions for all the units that the player owns.
In the unit action space, given a map of size h\times w, the unit action is an 8-dimensional vector of discrete values as specified in Table 1. The first component of the unit action vector represents the unit in the map to issue commands to, the second is the unit action type, and the rest of components represent the different parameters different unit action types can take. Depending on which unit action type is selected, the game engine will use the corresponding parameters to execute the action. As an example, if the RL agent issues a "move south'' unit action to the worker at x=3, y=2 in a 16\times16 map, the unit action will be encoded in the following way:
[3+2*16,1,2,0,0,0,0,0 ]
In the player action space, we compare two ways to issue player actions to a variable number of units at each frame: Unit Action Simulation (UAS) and Gridnet (Han et al. 2019). Their mechanisms are best illustrated through an example as shown in Fig. 2,3, where the player owns two workers and a base in a 4\times5 map. The RL agent sees three units awaiting commands.
• UAS calls the RL policy iteratively. At each step, the policy chooses a unit and issues an action to it. We then compute a "simulated game state'' where that action has been issued (and any potential rewards collected). Once all three units have been issued actions, the simulated game states are discarded, and the three actions are collected and sent to the actual game environment.
• Gridnet (Han et al. 2019) works differently. The RL agent issues actions to each cell in this map in one single step, that is, issues in total 4\cdot5=20 unit actions. The environment executes the three valid actions (all actions for cells with no player-owned units are ignored).

### The Action Spaces of Gym-μRTS and PySC2

Although Gym-μRTS is heavily inspired by and shares many similarities with PySC2, the StarCraft II Learning Environment (Vinyals et al. 2017), their action space designs are considerably different. Specifically, PySC2 has designed its action space to mimic the human interface while Gym-μRTS has a more low-level action that require actions being issued for each individual unit. This distinction is rather interesting from a research standpoint because certain challenges are easier for an AI agent and some more difficult.
A demo of the "human interface" action space of PySC2. This gif is not directly related to the example below, but way actions are issued is related.
Consider the canonical task of harvesting resources and returning them to the base. In PySC2, the RL agent would need to issue two actions at two timesteps 1) select an area that has workers and 2) move the selected workers towards to a coordinate that has resources. Then, the workers will continue harvesting resources until otherwise instructed. Note that this sequence of actions is assisted by AI algorithms such as path-finding.
After the workers harvest the resources, the engine automatically determines the closest base for returning the resources, and repeating these actions to continuously harvest resources. So the challenge for the RL agent is to learn to select the correct area and move to the correct coordinates. In Gym-μRTS, however, the RL agent can only issue primitive actions to the workers such as "move north for one cell'' or "harvest resource that is one cell away at north''. Therefore it needs to constantly issue actions to control units at all times, having to learn how to perform these AI-assisted decisions from scratch. So one of the benefits of Gym-μRTS's approach is that we can study the agent's ability to learn granular sub-problems without the human interface limitations (Notice, however that μRTS offers both the low-level interface and a PySC2-style interface with AI-assisted actions, but for Gym-μRTS, we only expose the former).
The benefit of PySC2's approach is that it makes it easier to do imitation learning from human datasets and the resulting agent will have a fairer comparison when evaluated against human since the AI and the human are mostly playing the same game. That being said, the human interface could be an artificial limitation to the AI system. In particular, the human interface is constructed to accommodate the human's limitations: humans' eyes have limited range, so camera locations are designed to help capture larger maps, and humans have limited physical mobility, so hotkeys are set to help control a group of units with one mouse click. However, machines don't have theses limitation and can observe the entire map and issue actions to all units individually.

### Reward Function

For our experiments, we use a shaped reward function to train the agents, which gives the agent +10 for winning, 0 for drawing, -10 for losing, +1 for harvesting one resource, +1 for producing one worker, +0.2 for constructing a building, +1 for each valid attack action it issues, +4 for each combat unit it produces. It gives the rewards to the frame at which the events are initialized (e.g. attack takes 5 game frames to finish, but the attack reward is given at the first frame). For reporting purposes, we also keep track of the sparse reward, which is +1 for winning, 0 for drawing, -1 for losing.

## Scaling DRL to Gym-\muRTS

We use PPO (Schulman et al. 2017) as the training algorithm for all experiments in this paper. In addition to PPO's core algorithm, many implementation details and empirical setting also have a huge impact on the algorithm's performance (Engstorm and Ilyas et al. 2020, Andrychowicz et al. 2020).
We start with a PPO implementation that matches the implementation details and benchmarked performance in openai/baseline (See The 32 Implementation Details of Proximal Policy Optimization (PPO) Algorithm for details), and use it along with the architecture from Mnih et al (2013). (denoted as Nature-CNN) as the baseline. We train the RL agents using UAS and Gridnet against CoacAI, the 2020 \muRTS competition winner, in the standard 16x16basesWorkers map, where the RL agents always spawn from the top left position and end episodes after 2000 game ticks. We then incrementally include augmentations for both UAS and Gridnet and compare their relative performance.
We run each ablation with 4 random seeds each. Then, we select the best performing seeds according to the sparse reward function and evaluate them agents against a pool of 11 bots with various strategies that have participated in previous \muRTS competitions (other competition bots are not included due to either staleness or difficulty to setup) and 2 baseline bots which are mainly used for testing.
Table 2: The previous \muRTS competition bots.
All \muRTS bots are configured to use their \muRTS competition parameters and setups. The name, category and best result of these bots are listed in Table 2. The evaluation involves playing 100 games against each bot in the pool for 4000 maximum game ticks, and we report the cumulative win rate, the model size, and total run time in Figure 3.
Figure 3: Ablation study for UAS and Gridnet

Let us now describe the different augmentations we added on top of PPO.

### Action Composition

After having solved the problem of issuing actions to a variable number of units (via either UAS or Gridnet), the next problem is that even the action space of a single unit is too large. Specifically, to issue a single action a_t in \muRTS using UAS, according to Table 1, we have to select a Source Unit, Action Type, and its corresponding action parameters. So in total, there are hw\times6\times4\times4\times4\times4\times6\times a_r^2 = 9216(hwa_r^2) number of possible discrete actions, which includes many invalid actions, which is huge even for small maps (about 50 million in the map size we use in this paper).
To address this problem, we use action composition, where we consider an action as composed of some smaller independent discrete actions. Namely, a_t is composed of a set of smaller actions:
a_{t}^{\text{Source Unit}},a_{t}^{\text{Action Type}},a_{t}^{\text{Move Parameter}},a_{t}^{\text{Harvest Parameter}}a_{t}^{\text{Return Parameter}},a_{t}^{\text{Produce Direction Parameter}}, a_{t}^{\text{Produce Type Parameter}},a_{t}^{\text{Relative Attack Position}}And the policy gradient is updated in the following way (without considering the PPO's clipping for simplicity):
\sum_{t=0}^{T-1}\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)G_t = \sum_{t=0}^{T-1}\nabla_{\theta} \left( \sum_{a^{d}_{t}\in D} \log\pi_{\theta}(a^{d}_{t}|s_t) \right)G_t\\ = \sum_{t=0}^{T-1}\nabla_{\theta} \log \left( \prod_{a^{d}_{t}\in D} \pi_{\theta}(a^{d}_{t}|s_t) \right)G_t
Implementation-wise, for each action component, the logits of the corresponding shape are output by the policy, which we refer to as action component logits. Each action a^{d}_{t} is sampled from a softmax distribution parameterized by these action component logits. In this way, the algorithm has to generate hw+6+4+4+4+4+6+a_r^2 = hw + 36 + a_r^2 logits, significantly less than $9216(hwa_r^2)$ (301 vs 50 million).

The next most important augmentation in our experiments is invalid action masking, which "masks out'' invalid actions out of the action space (by exploiting the fact that we know the rules of the game), significantly reducing it. This is used in PySC2 (Vinyals et al. 2017), OpenAI Five (Berner et al. 2019), and a number of related work with large action spaces (Samvelyan et al. 2019).
Figure 4: Neural network architectures for Gridnet and UAS. The green boxes are (conditional) inputs from the environments, blue boxes are neural networks, red boxes are outputs, and purple boxes are sampled outputs.
In the interest of ablation study, we also conduct experiments that provide masking on the action types but not the action parameters, which is more similar to PySC2's settings. As shown in Figure 3, we see that having only a partial mask has little impact whereas having the full mask considerably improves performance. Although the action space and PySC2 is quite different as discussed above, masking all invalid actions maximally reduces the action space, hence simplifying the learning task. We therefore believe that the PySC2 agents could receive performance boost by providing masks on function arguments as well.

### Other augmentations

This section details other additional augmentations that contribute to the agents' performance, but not as much as the previous two (which are essential for having an agent that even starts learning to play the full game).

#### Diverse Opponents

The baseline setting is to train the agents against CoacAI. However, this lacks a diversified experience and when evaluating, we frequently see the agents being defeated by AIs as simple as WorkerRush. To help alleviate this problem, we train the agents against a diverse set of built-in bots. Since we train with 24 parallel environments for PPO, we set 18 of these environments to have CoacAI as the opponent, 2 to have RandomBiasedAI, 2 to have WorkerRush, and 2 to have LightRush. Per Figure~\ref{fig:ablation}, we see a rather significant performance boost for Gridnet, whereas in UAS the performance boost is more mild.

#### Nature-CNN vs Impala-CNN vs Encoder-Decoder

To seek better neural network architectures, we experimented with the use of residual blocks (He et al. 2015) (denoted IMPALA-CNN), which have been shown to improve the agents' performance in several domains like DMLab (Espeholt et al. 2018).
Addionally, Han et al. (2019) also experimented with an encoder-decoder network in Gridnet, so we also conducted experiments using this architecture. Per the ablation study in Figure 3, we see IMPALA-CNN helps with the performance of UAS whereas encoder-decoder benefits Gridnet.

## Discussions

### Establishing a SOTA in Gym-\muRTS.

According to Figure 3, our best agent consists of ppo + coacai + invalid action masking + diverse opponents + impala cnn, reaching the cumulative win rate of 91%. Additionally, the panel below shows the specific match results and the videos of the agent competing against bots in the pool, showing this agent can outperform all other bots in the pool.
Note that in the \muRTS competition settings the players could start in two different locations of the map whereas our agent always start from the top left. Nevertheless, due to the symmetric nature of the map, we could address this issue by "rotating'' the map when needed so that both starting locations look the same to our agent. Therefore our agent establishes the state of the art for \muRTS in the 16x16basesWorkers map. Note that generalizing to handle a variety of maps (including the asymmetric ones) in \muRTS competition settings is part of our future work.
As part of our future work, we would like to include agents like Droplet in our training process. However, search-based bots like Droplet significantly decrease the speed of training.

### Hardware Usage and Training Time

Most of our experiments are conducted using 3 vCPUs, 1 GPU, and 16GB RAM. According to Figure 3, the experiments take anywhere from 37 hours to 117 hours, where our SOTA agent takes 63 hours.

### Model size vs performance

Gridnet models have more parameters compared to the UAS models. This is because Gridnet predicts the action type and parameter logits for every cell in the map.
We did not find a strong correlation between the model's size (in number of trainable parameters) and the performance of the agents. As shown in Figure 3, it is clear that the techniques such as invalid action masking or different neural network architectures are more important to the performance than the sheer number of the model's trainable parameters in our experiments.

### Variance wrt. Shaped and Sparse Reward.

In almost all experiments conducted in this paper, we observe the RL agents are able to optimize against the shaped rewards well, showing little variance across different random seeds; however, this is not the case with respect to the sparse reward (win/loss). We report the sum of shaped rewards and sparse rewards in the episode as shaped return and sparse return respectively in the following panels, where we usually see little difference in the shaped return when the sparse (win/loss) return could be drastically different. This is a common drawback with reward shaping: agents sometimes overfit to the shaped rewards instead of sparse rewards.

### UAS vs Gridnet

The following panels show a typical result where Gridnet is able to get much higher shaped return, but it receives relatively similar sparse return as UAS.
Depending on the implementation, Gridnet agents usually have many more trainable parameters. Also, when the player owns a relatively small amount of units, it is faster to step the environment using UAS because Gridnet has to predict an action for all the cells in the map; however, when the player owns a large numer of units, Gridnet's mechanism becomes faster because UAS has to do more simulated steps and thus more inferences.

### The Amount of Human Knowledge Injected

In our best trained agents, there are usually three source of human knowledge injected: 1) the reward function, 2) invalid action masking, and 3) the use of human-designed bots such as CoacAI. In comparison, AlphaStar uses 1) human replays, 2) its related use of Statistics $z$ and Supervised KL divergence (Vinyals et al. 2019), and 3) invalid action masking.

## Conclusions and Future Work

We present a new efficient library, Gym-\muRTS, which allows DRL research to be realised in the complex RTS environment \muRTS. Through Gym-\muRTS, we conducted ablation studies on techniques such as action composition, invalid action masking, diversified training opponents, and novel neural network architectures, providing insights on their importance to scale agents to play the full game of \muRTS.
Our agents can be trained on a single CPU+GPU within 2-4 days, which is a reasonable hardware and time budget that is available to many researchers outside of large research labs

## Bonus Section 1: Reproduce our SOTA Agent

The source code of all experiments is available at GitHub. To reproduce our SOTA agent, which is PPO + invalid action masking + diverse opponents + IMPALA-CNN, run the following commands
git clone https://github.com/vwxyzjn/gym-microrts-papercd gym-microrts-paperpython -m venv venvsource venv/bin/activatepip install -r requirements.txt# if you have wandb installed and have logged in, you could dopython ppo_diverse_impala.py --capture-video --prod-mode# otherwise the code can also be run locallypython ppo_diverse_impala.py --capture-video
Note that according to our experiments shown in the panel below, the sparse return subjects to more varaince,
This means you might have to run a couple experiments to reproduce our SOTA agent by doing the following:
python ppo_diverse_impala.py --capture-video --prod-mode --seed 1python ppo_diverse_impala.py --capture-video --prod-mode --seed 2python ppo_diverse_impala.py --capture-video --prod-mode --seed 3python ppo_diverse_impala.py --capture-video --prod-mode --seed 4

## Bonus Section 2: Selfplay

We have also tried some selfplay experiments, which is a crucial components in recent work such as AlphaStar (Vinyals et al. 2019). If the agents issue actions via Gridnet, selfplay can be implemented naturally with the parallel environments of PPO. That is, assume there are 2 parallel environments, we can spawn 1 game under the hood and use return player 1 and 2's observation for the first and second parallel environments, respectively and take the player actions respectively.
However, note that the agents in the selfplay experiments are learning to handle both starting locations of the map, which is a different setting. For a fair comparison with other experiments in the main text, the other experiments would also need to be configured to learn with randomized starting locations. Nevertheless, it is fun to see the RL agent fight against itself: