Procgen and Learning to Generalize in RL
Initial results on OpenAI's Procgen environments using CleanRL's implementations
Created on February 2|Last edited on July 1
Comment
Section 1
Add markdown, images, and
How can reinforcement learning agents adapt well to new and unseen scenarios? This is a pressing question in the current field of research, as RL agents often fail to perform well outside of their training environments. Enter OpenAI’s Procgen, a set of 16 games where every episode encountered by an agent is a unique level created through procedural content generation. Though many of the games in Procgen are inspired by classic arcade games, this benchmark was designed to increase the diversity of scenarios seen by agents during training, and thus allows us to have a much better sense of different algorithms’ ability to generalize.
In this report, we present the initial results of CleanRL’s implementations of PPO and PPG against the games in the Procgen benchmark. Training is completed over the course of 100 million steps, and we note that these results are generated by the “hard” distribution offered by Procgen, rather than the “easy” distribution which can be used with more minimal compute resources. This is done in keeping with the original papers presenting the Procgen benchmark, and the Phasic Policy Gradient (PPG) algorithm.
You may note that the video recordings of these games look somewhat blurry and pixelated. This is actually done by design within the context of our CleanRL implementations. The training process in Procgen games can be run very fast in comparison with a lot of RL environments, but the means of capturing video originally proposed in the code slows down this process to a significant degree. To use CleanRL’s implementations for Procgen, or to reproduce these experiments with videos captured, you will need to make a modification to your installation of Procgen. Open up the env.py file in your installation of Procgen, scroll to the bottom of the file, and change the end of the file to the following:
class ToBaselinesVecEnv(gym3.ToBaselinesVecEnv):metadata = {'render.modes': ['human', 'rgb_array'],'video.frames_per_second' : 24}def render(self, mode="human"):info = self.env.get_info()[0]_, ob, _ = self.env.observe()if mode == "rgb_array":if "rgb" in info:return info["rgb"]else:return ob['rgb'][0]def ProcgenEnv(num_envs, env_name, **kwargs):return ToBaselinesVecEnv(ProcgenGym3Env(num=num_envs, env_name=env_name, **kwargs))
Additionally, you could reference or directly download this modified env.py from my own fork of Procgen on github: https://github.com/bragajj/procgen/blob/master/procgen/env.py
Of all the games in the Procgen benchmark, “Chaser” has become by far my favorite. At first, this game looked like a simple clone of Ms. Pacman with procedurally generated levels. But the actual consequences of the procedural content generation result in a game I believe to be far superior and more interesting than the original which inspired it. After watching just a few games of RL agents playing Chaser, it became obvious to me just how much I have been conditioned to play Ms. Pacman in a certain way.
Firstly, in “Chaser” the stars on the map are the pieces the player can reach in order to make their enemies’ vulnerable, just as the ghosts in Ms. Pacman become blue and destroyable upon the consumption of certain dots. However, these stars can spawn anywhere in the map, and it's possible to see levels where two stars spawn directly next to each other in the middle of the map. This means a player would either have to commit to using both at once, and complete the rest of the level purely being chased by enemies, or they would have to use one star and move backwards to leave the other intact. This sort of gameplay is unthinkable in the traditional Ms. Pacman game, but here it is the sort of scenario which makes “Chaser” all the more interesting. Even further, in Ms. Pacman the enemies which have been destroyed spawn back in center of the map before they return to chase the title character. In “Chaser”, there is no center of the map, and because each maze is a different generation the enemies spawn back to life at random placements in the map. With both of these simple yet challenging changes to the game which inspired it, “Chaser” becomes a much more hectic and unpredictable game to watch. I encourage you to watch videos of this game, as you may find like myself that typical strategy in Ms. Pacman is largely a consequence of its level designs.
Finally, the results on Starpilot presented here also offer a look at our future work: the modification of Nature CNN to IMPALA CNN as an improvement in the implementations. The significant performance increase offered by IMPALA CNN is the reason it has been favored in many of the academic papers reporting results on Procgen, and is thus a great candidate for future experiments for our team. With games like “Dodgeball” and “Heist” and more proving especially difficult for PPO and PPG, it will be interesting to see how much further this modification can improve performance.
Starpilot
Run set
8
Coinrun
Run set
3
Dodgeball
Run set
2
Bigfish
Run set
4
Climber
Run set
2
Jumper
Run set
4
Bossfight
Run set
4
Run set
8
Run set
8
Add a comment