Facebook AI Research’s Tim & Heinrich on democratizing reinforcement learning research

Since reinforcement learning requires hefty compute resources, it can be tough to keep up without a serious budget of your own. Find out how the team at Facebook AI Research (FAIR) is looking to increase access and level the playing field with the help of NetHack, an archaic rogue-like video game from the late 80s. .
Angelica Pan
Since reinforcement learning requires hefty compute resources, it can be tough to keep up without a serious budget of your own. Find out how the team at Facebook AI Research (FAIR) is looking to increase access and level the playing field with the help of NetHack, an archaic rogue-like video game from the late 80s.

Listen on these platforms

Apple Podcasts Spotify Google YouTube Soundcloud

Guest bios

Tim Rocktäschel is a Research Scientist at Facebook AI Research (FAIR) London and a Lecturer in the Department of Computer Science at University College London (UCL). At UCL, he is a member of the UCL Centre for Artificial Intelligence and the UCL Natural Language Processing group. Prior to that, he was a Postdoctoral Researcher in the Whiteson Research Lab, a Stipendiary Lecturer in Computer Science at Hertford College, and a Junior Research Fellow in Computer Science at Jesus College, at the University of Oxford.
Follow Tim on Twitter
Heinrich Kuttler is an AI and machine learning researcher at Facebook AI Research (FAIR) and before that was a research engineer and team lead at DeepMind.
Follow Heinrich on Twitter
and on LinkedIn

Show Notes

References

The NetHack Learning Environment
Reinforcement learning, intrinsic motivation
Knowledge transfer

Topics covered

0:00 a lack of reproducibility in RL
1:05 What is NetHack and how did the idea come to be?
5:46 RL in Go vs NetHack
11:04 performance of vanilla agents, what do you optimize for
18:36 transferring domain knowledge, source diving
22:27 human vs machines intrinsic learning
28:19 ICLR paper - exploration and RL strategies
35:48 the future of reinforcement learning
43:18 going from supervised to reinforcement learning
45:07 reproducibility in RL
50:05 most underrated aspect of ML, biggest challenges?

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Tim:
What we see right now in the field is that there's lots of interesting reinforcement learning results that come out of industry labs that have a lot of computational resources. And that makes it basically impossible for any one outside, specifically in academia, to reproduce these results. And that was exactly the kind of motivation behind that environment and that it's really complex, but at the same time should be affordable for grad students and master students and whatnot to actually do experiments.
Lukas:
You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Tim is a research scientist at Facebook AI Research, and a lecturer at the University College of London. Heinrich is a research engineer at Facebook AI Research, and previously worked at DeepMind. Together, they built the NetHack Learning Environment, which is a super exciting project to make it easier for people to build and experiment with reinforcement learning algorithms. It also operates in a game called NetHack that I've played for the last three decades and so I'm especially excited to talk to these guys.
I had been thinking for a while. I was wondering how well reinforcement learning would work on the game NetHack and then I came across your project, and you're actually making an environment where people could try different algorithms in NetHack. So maybe you could start by kind of telling me how you came to this idea, as a learning environment, and maybe describe what the game NetHack is.
Heinrich:
NetHack is this really old game that grew out of an even older game in I think the '80s or thereabouts. It's as old as Unix, basically. And it's this text-based game, Dungeons & Dragons style. The objective is to go down a dungeon and retrieve a certain item and then go back up and win and that kind of undersold it, because the actual fun is in interacting with all the monsters are picking up objects and then there's lots of in-game jokes and it's also generally quite a hard game.
So I've been playing NetHack since I was quite young, I think about 12 when it was in a dust box where someone installed NetHack and I didn't really understand what to do. And then later on when I had the internet and there were some so-called spoilers, and I started being able to look at the actual source code of NetHack I started being able to do more in the game.
Heinrich:
It's really easy to die in NetHack, you don't have a second chance. You can save the game, but then it exits and when you go back in it picks up where you were. And it's a really hard and really fun game with a still active community a nd it's text-based, but still pretty complex.
Lukas:
I feel like what's notable to me about NetHack is, I've probably played it more than maybe any other game and yet I'm still kind of surprised by things that happen in it. I still find myself looking up what's going on. It seems people even will kind of come to interesting ideas about how to use the objects in the game and the game will actually have kind of supported these sort of one in a million chance occurrences. So it seems incredibly deep, I don't even actually know how deep it goes or how simple it looks at first.
Tim:
I fully agree. So the reason why we believe this is an interesting challenge for reinforcement learning is exactly that kind of depth. It's as Heinrich mentioned, from the looks of it, it's a quite simple game and that it's a terminal-based; so everything is rendered as these ASCII characters in terminal. But in fact, it's so deep in terms of the number of items and the number of monsters that you have to learn to adapt with, there's always new things to discover.
And on top of that it's procedurally generated, that means every time you enter the game, every time you enter the dungeon, it will be generated in front of you and it will look different from any other episode that you have been playing before. So that gives it also a lot of, I guess, replay-ability. And it's much closer, I guess, in spirit to more modern games like Minecraft. Where also every time you play Minecraft, the world is generated.
And that poses very unique challenges to reinforcement learning because so far, well, for a long time we've been mostly using games like Atari games to test the limits of reinforcement learning agents. And that has been going on for a while and it has been good.
But at some point I think people started to realize that in Atari when you, let's say, play Breakout or you play even Montezuma's Revenge; which is one of the hardest games in the arcade learning environment, every time you play that game it's the same. I mean, you can basically memorize sequences of actions through the game that lead you to win the game. And that's exactly what then approaches like Go-Explore by Uber AI, have been exploited to win the game.
So I think it started roughly two, three years ago, when people started to look into these procedurally generated games. I mean, Minecraft is one example, but it's very expensive to render and expensive to simulate. But also, I guess the Obstacle Tower Challenge by Unity AI is another example of such a procedurally generated environment for reinforcement learning and then more recently, Open AI's Procjen Benchmark is another example.
So people are looking more and more for test beds where reinforcement learning agents really have to learn to generalize to novel situations, novel observations. And we believe in NetHack is a perfect example for that because it's at the same time also really fast to run and really deep, much deeper than many of the 3D games that you could play right now.
Lukas:
I guess I had never thought about this, but I'm also a huge fan of the game Go. And that game also feels deep, but it's depth seems to come from a lot of interactions with a small number of rules; whereas I imagine the NetHack code base just having this massive nest of case statements, it's almost like the complexity is intrinsic to it.
I mean, both Go and NetHack, I think are kind of deep in the sense it's kind of hard for people to do well. But what is it about reinforcement algorithms that worked really well for Go and struggled to do basic things in the NetHack world?
Tim:
So Go is a really interesting case because the depth and the complexity of Go comes from the fact that you're playing against another player. So we should, first of all, state that NetHack is a single player game; you play against the game, you're not playing against another human. And obviously, if you have a very strong human you play against then that's a really hard game.
But what makes this work for reinforcement learning is the fact that Go, as you mentioned, has very simple rules. So it's very clear for a specific action how the next state will look like and that allows you to basically exploit planning mechanisms, they allow you to basically plan ahead and think through what happens if somebody plays a specific move and then I play a specific move what will happen.
And it's still really hard because there's this humongous observation space in Go already, because you have this 19 by 19 board and then on every title there could be a white mark or a black mark on a white stone, a black stone or no stone. But in NetHack, it's fundamentally different in that the transition dynamics that govern how a state evolves from time step T to the next one, extremely complex.
First of all, it's partially observable. You don't see what's on the entire map; there might be a monster around the corner, but it's not visible to you. On top of that, it's stochastic. So every time you hit a monster, just to give an example, there's a dice roll in the back that determines really how much damage you incur.
And on top of that, there's so many possibilities in terms of what could actually be on the tiles. So there's hundreds, as I mentioned, hundreds of monsters, hundreds of items, each of them come with all kinds of specific attributes and specific mechanisms that you have to learn about in order to do well. So it's really, really hard to plan ahead. It's also really hard, all the time, to even learn about all of these mechanisms; whereas in Go you can write down the rules easily in a program and you can consummate.
Heinrich:
I think there's another aspect of this comparison with Go and with MCTS-like algorithms. The NetHack community actually has done lots of crazy things outside of research and published papers, there's a few people in the NetHack community there, for instance. There's this altered org website where you have officially recorded games.
And what you could do in the previous version of NatHack is that you would have your local NetHack on your own machine. And you would try out a few things and whatever you liked best you would do that in the actual online running game, where you basically have this perfect simulator; which is NetHack itself. Tim was saying that, how could that work? It's not a deterministic, it's heuristic.
So the way people did that is that they had a map from all starting positions of the game, with your inventory and so on, to the seat of the RNG, pre-compute this in a few days and hours of compute; and then could look up the seat, see a new game, look at your inventory and that's enough entropy to tell you what seat you are in and then you know with which seats you're trying to initialize your local version and then you can actually beat the game in no time, because you have the perfect simulator.
And then the NetHack DevTeam produced a new version of NetHack that makes this impossible, where you can no longer manipulate the RNG state by walking against walls or whatever, the way that these people did it. But it's comparable in a way to how you would do it, if you were just playing MCTS NetHack; you save the game and you try out what's happening and then you will go back to the position where you really were. And you could probably beat NetHack that way pretty easily, but you'd really only be NetHack, you wouldn't learn anything about reinforcement learning at large.
And it's also really clear that for the community and also for us, that would be considered cheating. I mean, really you should be developing agents that can given a fresh game of NetHack solve the game.
Lukas:
That's funny. I think I would be impressed... Yes, if you could see the random number generator and forecast ahead it would be much easier, but it still seems a little bit tricky. I feel like there's a fair amount of long range planning that you need to do. I've actually never won the game, so I don't even know. But I feel like even if I could see ahead, it might be hard for me to beat the game.
Heinrich:
It's still going to be super hard learning from scratch reinforcement learning algorithm. But what these guys did is that they, you basically can get infinite wishes; that's the thing in NetHack. In certain situations, there's a wish and then you can wish for any object and you can get it. And if you can force the RNG to always give you a wish and you. can get infinite amounts of wishes and you can always make this mini...
When I played NetHack when I was very young, I did this thing called save scumming which you're not supposed to do; where you the save game and then you copy the saved file and then when you die, you go back and you go back to that point in time. And what you do with that, from a scientific perspective, is you force a really unlikely trajectory.
All the games where you died and you didn't like it you threw them out and you go into this more and more unlikely space and at some point you really dodged all the bullets, but the game will just kill you a thousand times per round because you didn't repeat it. And I think this is what's likely to happen when you can force your RNG to be in a specific state, you produce this extremely unlikely trajectories of the game.
Lukas:
When you take the sort of basic reinforcement algorithm from Go or just sort of like a vanilla reinforcement learning algorithm and then you train it on NetHack, what happens? What does the character do?
Tim:
That's exactly the thing that we wanted to see. I mean, first of all, you couldn't use MCTS from Go just because you don't have that environment transition model, you don't know what happens at the next time step given the current time step in an action-
Lukas:
Actually, wait. Sorry. I need to step even back one step further. What are you actually even optimized for? I mean, in Go it's so clear that you're trying to win, but I don't think that makes sense here.
Heinrich:
That's a great question.
Tim:
It's a really great question. So ideally, we want to have agents that can win NetHack. And the way to win NetHack is to ascend to demigod by offering the Amulet of Yendor to your in-game deity. But the problem is that, that's a really of sparse reward, right?
Lukas:
Yeah.
Tim:
It's like you have to solve it before you get any reward, so that doesn't work. Then there's lots of techniques right now for providing agents with intrinsic motivation. I mean, that's what basically keeps people like you and me playing NetHack, although they haven't won NetHack yet; we're just curious about finding new quirks and new interesting situations in NetHack.
But what we basically did is we have a reinforcement agent that is trying to optimize for in-game score and that comes with all kinds of caveats actually, because you can try to maximize the in-game score by doing all kinds of things that are unrelated to actually winning the game. So for instance, you get score for killing monsters, you get score for descending deeper down into the dungeon, but that really doesn't help you to understand at some point you have to go back up again. Just to give an example.
Also, people have been when they're really good, so meaning when they already know how to play NetHack really well and they solve NetHack, they start to give themselves all kinds of interesting challenges; and one is actually to solve NetHack while minimizing the score. So you can also do that. So it's not really a very good reward function, in a sense, towards the goal of solving NetHack. But I think it's still a really good proxy for now in order to compare how well different models or different agents do.
So I think for now we're happy with that kind of setup, because we are still in a very early stage or the community I guess as a whole is in a very early stage when it comes to like making progress on NetHack. But I think eventually at some point we'll have to refine that a bit and the winning condition is actually winning the game.
Lukas:
Got it. So you're optimizing for score?
Heinrich:
You can also optimize for gold or dungeon depth of these things, but typically you do try to optimize for score.
Lukas:
Okay. So what happens when you put a vanilla agent in there?
Tim:
So what happens is quite interesting. So first of all, we thought when we started this project that just a vanilla agent wouldn't really be doing anything in NetHack, it's just so complicated. Just learning to navigate in the first dungeon level to the next dungeon level is already hard because there are all kinds of situations where you are in a room where there might not be any doors and you have to walk around the walls to find a secret door, which is actually quite tricky to learn. Then you might find doors but they might be locked and you don't have any key around, you have to actually kick in the door to even make it to the next dungeon.
And we thought this is really hard for reinforcement agents to learn, because there's no reward attached to kicking in the door. Actually, it turns out that if you kick a wall and you hurt yourself and you might die, so that actually gives you negative reward or at least terminates the episode. But actually what turns out, and this is really interesting, is that if you train in these procedurally generated environments, what happens is that occasionally there's an instance generated of this whole problem that is really simple.
The staircase down, might be just in the room next to you and the corridor might already be visible. So from your starting position, you might already see where the staircase down is. So your agent, even when just randomly exploring, might just bump into that staircase down and go downstairs and get a reward. So this is fascinating because it means with these procedurally generated environments, if you train for quite a number of episodes, there will be episodes generated that are quite simple and where the agent actually can learn to acquire certain skills to then make also progress on the harder ones. So this is one thing that we saw.
So our agents right now, just by optimizing for score, they average at a score of I think 750ish roughly, which is not bad if you are new to NetHack. So if I take a random computer scientists in the lab and I asked him to learn about NetHack and play NetHack, I think it takes them a good fair amount of time to reach 750 on average as a score.
I think the maximum score we've seen so far is maybe something like 4,000 or 5,000. They descend down to dungeon level on average five or six search. But we also see individual agents sometimes luckily, going down even dungeon level 15. And we see agents killing a lot of monsters on the way, because that gives them immediate reward. We see them passing by landmarks like the Oracle or Minetown even.
Tim:
So that was actually quite surprising to us, that the vanilla approach can already make quite steady progress on NetHack. So that's quite encouraging I think, for them building up all the extensions and more sophisticated models.
Lukas:
Well, that sounds like a basic model answer, you don't have to tweak it at all to get it to that level.
Tim:
Yeah, it's a very straightforward model. I mean, the only thing that we do is that we have basically a convolutional network that encoats the entire dungeon level that's visible so far. We have another convolutional network that's centered around a seven-by-seven crop of the agent; so that gives it basically some inductive buyers that the things that are close to the agent are more important than let's say things that are very far apart.
We have another feature representation based on the agents' statistics. And then all of that is mapped down to a lower dimensional representation that's fed into a recurrent policy parametrized by an LSTM, and then you get the action distribution out of that. So it's really nothing fancy at this point.
Heinrich:
Maybe we should have mentioned that it also does some bad things. If you optimize for score, for instance, it quickly notices that it has this pet with it in the beginning. And if the pet kills an enemy or becomes a monster, then you don't get the score. So what it learns, at some point of training it starts killing its own pet, which is really bad.
Lukas:
Oh no, that's so bad.
Heinrich:
But it will do that. And the interesting thing is, it starts playing random games, then it starts killing the pet along the training. But then if you train for longer it stops killing the pet because it notices that killing the pet actually makes an in-game NetHack deity, mad at you, and bad things happen. So it will stop doing this after a while. It's kind of an interesting behavior.
Lukas:
That's really interesting.
Heinrich:
Also I think we should mention that, from Tim says right now, you can kind of if you know the game of NqetHack you'll notice that we don't actually use all the inputs yet. So NetHack has this status bar and the stats of your strength and so on, it has the dungeon. But it also has this message and it has these little in-game windows that can pop up, like your inventory can pop up and other things can pop up. And that's actually like a research challenge of how to make use of all of this.
And the other question is, what's the action space? A human can also just play, press capital S in NetHack and save game and exit and we don't actually want our agents to be able to do that. So you can not give it access to all the full keyboard as it were, and typically what we do is we restrict the action set.
Lukas:
Will the agent know its own inventory? Could it pick up some food and eat it later?
Heinrich:
Kind of yes, but we don't have a full solution for that yet, because we would need to feed in that as a constant observation and we don't do that presently. It is hard to exclude the agent from doing this because different keys on the keyboard mean different things in different situations of the game. And in some situations, if you enable the eat action then you can eat some stuff, but maybe only those keys that are already enabled for the game. But it gets a little bit technically right now. Our agents can eat certain things in the inventory if it has the right letter, but not other things, for instance.
Tim:
Also, I think maybe it's worth emphasizing that right now we've been spending most of our time just building that NetHack learning environment where you actually do have, if you want, you have access to the inventory observation. You can, if you want, use the entire keyboard as your action space. So that's out there for everybody, if they want to, to use and we hope that lots of researchers pick this up and come up with all kinds of interesting solutions that make progress on NetHack.
And then on top of that we have this really basic agent implementation that we mentioned here, we'll release that as well so that people can piggyback on. But obviously, there're a lot of open research questions of how to make best use of all these observations that come from different modalities, as well as really deal with this really large action space.
One thing that I find super exciting is the fact that we as humans, we have all kinds of prior knowledge. When you play NetHack, although you've never heard about that game and you bump into a door and you have let's say 170 actions that you could apply like trying to drink the door or trying to sit on the door, you just don't do that. You won't even try this out, you know I can try to open this maybe if I have a key or if I don't, well, there's also this kick action. So maybe let me try to kick in the door.
So this fact that we as humans are so amazing at using our prior knowledge, our world knowledge, our common sense knowledge to then really efficiently explore in these environments is absolutely fascinating to me. And that's why I also really like NetHack as a test bed for artificial intelligence, because I think ultimately we should have agents that are capable of transferring such domain knowledge from other sources to then be really efficient in these hard simulated environments.
Heinrich:
There's a concept in the NetHack community called source diving, where you look at the source code of NetHack and try to figure out how the game dynamics work. And ideally our agents should be able to do that. Our agents should look at the source code and be able to figure out how this game will behave generally in certain actions and then just do the right thing. That would be the perfect research agenda for NetHack.
Tim:
I feel like on top of that, there's this really amazing community created natural language research which is the NetHack Wiki. So almost everybody I know of who learned to play that hack learned that by also looking up things on the NetHack Wiki. As you mentioned, you started playing NetHack when you didn't have any internet connection. So you couldn't look at any of these kinds of spoilers. That makes it almost impossible, I think to make progress on NetHack.
And even with the NetHack Wiki, it's really hard. So people sometimes play NetHack for 20 years before they first win the game. But this kind of resource is amazing. It's 3000 Wikipedia pages of explaining how certain entities, items, monsters work. And I think one direction is really exciting to me, and that's not really very different from what Heiner just described by directly looking at the source code. But what if we had agents capable of encoding information in the NetHack Wiki and using that to, for instance, explore more efficiently or avoid certain really stupid deaths? And yeah, just generally using that prior domain knowledge to be much more sample efficient and generalize better in that.
Lukas:
It's funny, actually I think it's a kind of a different game. In prep for this interview, I started playing NetHack a little bit again, and I kind of couldn't believe that I tolerated this game without the internet. It's just such a frustrating game with such little guidance. And I was reading your paper on reinforcement learning where you're talking about building a system to optimize for learning... I forget how you put it, but sort of optimize for modifying the state space of the algorithm. And then I was thinking of my daughter, who's clearly doing that.
So she's nine months old and I've just been watching her a lot and she clearly explores her environment in a way that she's just totally focused on whatever is novel. And there's no question that she's completely wired to if I show her a new toy, she loses it or anything that seems to defy her belief about the laws of physics, blows her mind. So clearly she's doing that. And then I was wondering if maybe myself as a child, I was kind of more willing or kind of more enjoyed the exploratory months necessary for figuring out NetHack.
Tim:
Yeah. That's a perfect remark. In fact, some of the research that we're doing is really centered around how can we design agents that are intrinsically motivated to learn in an environment? Because again, in NetHack, any reward function that we come up with, it's not going to be great. The actual thing we want to optimize for is solving the game and there's just not any reward function, I think that really can guide an agent step by step towards that.
And I have two daughters and in fact, my youngest daughter as well, at some point was playing with a toy kitchen and she was just opening and closing the door until at some point she had even squeezed her finger in the door. She was crying. It was clearly something really bad. She was actually in pain. She was crying for a minute and then she was continuing closing and opening the door until it became boring.
So this fact that we, as humans are just setting ourselves goals when we are in an environment. We get bored and then we think of, "Oh, what happens if you try this or that?" And then we see, can we actually control this? Are we empowered to have control over what we want to do? Are we able to actually predict what's going to happen next? And if not, then maybe that's really interesting or maybe it's noise, maybe it's just the environment being completely stochastic and there's just nothing I can control.
So how do we design agents that can do this as well? I think that's a question that's super exciting to me as specifically in the context of NetHack, because it has this stochasticity. It has this is humongous, I guess, internal mechanism that governs the state transition. So I think this will lead to lots of quite interesting research.
Heinrich:
In a sense NetHack is really a hard case there. There's almost no human who plays NetHack unspoiled. I mean, typically people that don't have a good reason to do that because they need to find out about NetHack first. But the few people really, they weren't for instance were in the situation to try NetHack without any spoilers. And it takes decades. You dies so many deaths and you don't even know what to do. You don't even know what their exact goal of the game is. The game that kind of tells you like, if you read enough Oracles, but also there's a thing called rumors in the game where you can read up what it's supposed to do, but there's also wrong ones. And if you're unlucky, you get the wrong ones that misleads you.
So there's almost no way to find out how to even beat the game, let alone get around all the obstacles if you don't spoil yourself. And we would like our computers to do that. But I want to mention another thing that Tim was saying, there's no reward that leads you to beating the game. That is true. But what there is, is recorded games in the NetHack community. We could just look at what humans do and try to imitate this. Have all of us play NetHack, which we do in our lab a lot. And then try to train an agent that predicts human actions and then go from there. That might be one option.
Lukas:
I was going to ask you about that actually, because I remember the first version of the successful Go algorithm was trained on expert games. Have you tried to train an algorithm? I mean, I guess even an amateur NetHack player would probably... you could imagine that helps the algorithm learn some strategy, right?
Heinrich:
So we're definitely thinking about doing that. The problem is getting the data, just getting a few games isn't enough for the methods that we have. We need enormous amounts of data and there's no easy way to produce it unless we pay someone to play NetHack all day and even then you have to play for a long time. Now interestingly, the NetHack community actually does have record game say out the door, but unfortunately they basically only record the outcome of the game, like video stream what the game shows, they don't record the actions that were put in by the players. And that's the research question by itself, how to make use of this kind of data. But yeah, we have certainly something that we are thinking about.
Lukas:
Has anyone built a kind of a rule-based system that can beat NetHack? That seems something like someone would try at some point.
Heinrich:
People try, but I don't think they were super successful. I think there was one system that maybe 10% of cases, or maybe Tim can ask the details on that end.
Tim:
Yeah. So if I vaguely remember, so there are cases of hard-coded bots that ascended prior versions of NetHack, where as far as I remember they used certain exploits in the game. There's something called pudding farming where you can I think get a lot of items or whatnot, and then it makes the game much easier. But these exploits, they are not in there anymore in the most current versions of NetHack. So all of these bots that have been handcrafted some sometime ago, they won't work right now.
Also, I think, ideally you want to have systems that are able to ascend meaning win the game with all kinds of character combinations. I mean, you have different roles in NetHack; races and gender and whatnot. So these bots, as far as I remember, were always quite specialized for one specific role in NetHack. But ideally we want to have agents similar to humans that can in fact win the game with all kinds of standing conditions.
Lukas:
So could you maybe describe your paper that I sort of alluded to in a little more detail. I think is the ICML paper, on the exploration and reinforcement learning strategies, and then maybe sort of say what the results were.
Tim:
I guess you were referring to the ICLR paper.
Lukas:
Oh, ICLR sorry.
Tim:
Yeah, no worries. So first of all, that was a paper that was not done on NetHack. So that was at a time where the NetHack learning environment didn't exist yet. This is a paper done by Roberta Raileanu. She's a Ph.D. student at New York University and she was interning with us at Facebook research in London. And she has done a really good job at investigating the current limits of these intrinsic motivation mechanisms for reinforcement learning.
So maybe just to give a bit more context, one really open challenge in reinforcement learning is how do you learn in environments where your reward that you get from the environment is extremely sparse. So reinforcement learning works amazingly, if you get a very dense reward function. So that means in many steps in the episode, you actually get a reward from the environment. But if your reward only comes at the very end and your episode is quite long, then it's really hard to learn from that.
But what people have been doing in the past, developing all kinds of mechanisms that provide the agent with reward that's not given by the environment, but that is basically given to the agent intrinsically. And one such thing could be how well is the agent predicting the next time step given the current action, so that you could use that?
If your agent makes a big prediction error in terms of given the current state and the next action, what the next state is going to be, then we are rewarding the agent. The problem with that is that there's this noisy TV problem, where in your environment, there's some source of stochasticity, let's say a television that just shows white noise.
So every prediction that you make as an agent will be wrong because you can't predict what's going to be on the next screen. So you just reward the agent continuously for that. And that means that kind of noisy TV becomes an attractor to the agent. So the agent will just stand in front of the noisy TV all day without actually exploring the environment.
So what Roberto was doing is she was putting on top of work that is trying to predict or calculating intrinsic reward based on the forward model, trying to predict the next state, but also given the representation of the next state and the reputation of the current state, trying to beat the action that led to that next state. So that's an inverse model.
And what she basically figured out is how can we make sure that the agents internal representation of this state is only encoding what the agent can actually control in the environment. So if there's a noisy TV and the agent over time learns that it's actions don't have any effect on the noisy TV, then it would just ignore that source of stochasticity in terms of, or with regards to providing intrinsic motivation to the agent.
And that led to at the time state of the art results on quite hard exploration problems in mini grid, again, being a grid world a bit like NetHack, but just million orders of magnitude simpler, but still really hard for contemporary reinforcement learning approaches. So that was that paper.
Lukas:
I'm not super familiar with the literature. So let me see if I understood it. Maybe I'll channel the audience here. So it sounds like there's sort of a standard pattern of trying to actually go to environments where that you can't predict what the next thing will happen. And I thought you were going to say, it's wanting to optimize for being able to predict the next step, but that's showing my supervised learning bias, where you would probably want to optimize for good predictions, but you actually kind of trying to go to places where you can't predict the next step, which makes sense because more learning would happen.
Tim:
Yeah. So I mean, hope I get this right, because it has been some time ago, but basically you should be rewarding yourself, if you find novel mechanisms in the environment that you can control. But you shouldn't be rewarding yourself for novel observations in the environment that you can't control. Because if there's a noisy TV, you shouldn't be caring about that, otherwise you'll be standing in front of that TV for eternity.
But yeah, you're right, there's always this kind of tension between learning the agent to get better at doing whatever it's doing in the environment so that will also lead to better forward predictions. But at the same time also rewarding the agent whenever it encounters the mechanism that it can control it, but that also leads to novel observations.
Now the problem is that another common approach is to actually count how often the agent observes a specific state. And that has been doing really well for instance, in these Atari games where every time you play the game, it's the same, but in procedurally generated games like NetHack, that won't work. It's just so incredibly unlikely that you will ever see the same state twice that counting them doesn't make any sense if you do see one.
Heinrich:
So basically if you have a nosy activity and you can change a channel, we don't really know what to do with it yet. And honestly, that's how humans behave as well. So I think we're pretty close to AGI there.
Lukas:
Yeah. No, I mean, it's funny. I mean, you alluded to this a little bit in the paper, but I was thinking some of the most joy I've felt is in NetHack, I really remember when you realized that you have a throw option and mostly use that to throw weapons and it kind of guides you in that way, but you can actually throw food at animals and turn them into pets. And it's this incredible joy of realizing this surprising thing that you can do.
Lukas:
Clearly there's a reward function, at least in my brain of kind of discovering something new. And in your paper, you kind of alluded to some of this coming from education literature, or early psychology literature. Did you look at any of that when you were thinking about this?
Tim:
I mean, we have a paper together with Josh Tenenbaum who was, I think really leading in that area at MIT. I have to say, I'm not very familiar with that literature. I mean, that's the honest answer to that. But I think that the thing that you just mentioned in terms of, you know that you can throw not just weapons, but you, as a human you can also throw food around, you can throw anything basically around.
And then realizing that actually in NetHack, the developers of NetHack, they thought of everything that you can actually throw food around. There was this revelation to me. I mean, I have to say, I'm not an expert NetHack player and our entire team, Heiner is the only one who actually ascended in NetHack.
So I had this revelation the other day where I was playing NetHack. And then I was always encountering graves. And I was like, "Okay, you go over this grave and you get some interesting message that's engraved on the stone. Okay, fine. But what do you actually do with graves?" I mean, there didn't seem to me any use to it.
And then the other day, I thought at some point when... Actually there're pick-axes in NetHack, what if I dig up whatever's is lying in that grave? And there's actually something in that grave. I mean, there's definitely a corpse, but they might also be items in there. Again, like for you, it was for me, so interesting to see that my kind of prior knowledge about the world also applied within NetHack, although it's this kind of terminal-based game. So that's, again, why I believe NetHack is such an amazing resource for artificial intelligence research.
Lukas:
Okay. So we've probably driven away anyone with any kind of practical mindset, but this is supposed to be for people practicing machine learning for real world applications. I mean, where do you think reinforcement learning goes? I feel like the knock on it right now is maybe that it's really just for these kinds of toy environments, like Atari games and Dota and NetHack, is it being used for things that we would experience now, or is it on a path to being useful for things? Where do you think the applications are?
Tim:
Yeah. So first of all, I think it's not necessarily fair to say that the kind of research that's done in simulated environments is not with real world applications in mind. So it's very funny in that NetHack is this really old game. So it feels like a step back from more 3D, visually appealing games like Dota. But in fact, as we, I guess, discuss now, NetHack has a lot of properties of how you also would try to solve tasks in the real world.
If I try to fix my car engine and I have no idea how to do this, maybe I can look up information on Wikipedia. I mean, probably it's not going to work, but we are so good at using what knowledge, common sense knowledge, and also acquiring a specific domain knowledge for solving tasks in the real world.
So in some sense, I feel like NetHack is even a step forward towards actually making progress in real-world tasks with reinforcement learning. Also, given the fact that it's supposed to be generated and every time the observation will look different similarly in the real world. Again, a comp based approach won't really help you that much because the world will look different tomorrow. And at the same time, I think there's also more and more applications of reinforcement learning for the real world.
So for instance, we published a workshop paper on using reinforcement learning for learning to control internet traffic. So there's these handcrafted heuristic that people have been developing for decades, TCP protocols and whatnot that govern how I'm going to... Sorry, for congestion control of window approaches to go and how I can maximize my foot put in an internet network; how can I make sure that I can send as many packages as possible without losing too many packages because of congestion of the other participants in the internet network.
And we are developing approaches that allow us to train reinforcement agents to automatically learn what's a good policy in terms of sending out how many packets per second so that they maximize number. So there are definitely more and more applications of reinforcement learning in real world. Also, advertisement is I think an example. And so I think we'll see much more of that in the future.
Heinrich:
Yeah. I think computer systems, operating systems and sound, they have all kinds of inbuilt heuristics that are often good, but perhaps not optimal. And reinforcement learning is one way to try to optimize these things. If you look at the Linux Kernal, by the way, looking at NetHack source code is a great gateway drug to becoming a kernel devloper, it's basically a mini Unix in there.
But if you look at the Linux kernel there's all kinds of heuristics and constants and wait times and so on. And potentially you could actually, not just hard-code these things, but learn them on the fly. Of course, you have a complex system, if you do that, and you may not want to do this at all times, but it's certainly an option and I think this is where the world is going.
I want to make one more comment about NetHack. We compare NetHack to Go early on, but I think the comparison I like more is StarCraft. So StarCraft 2, has famously been a challenge and of course it's a multiplayer game, so it's different in that sense. But many of the challenges that StarCraft has are also in NetHack; a big observation space, complex environment dynamics, big action space, and all these things that are technically hard.
But on top of that, to actually solve StarCraft you basically use up the energy of a small town. And to play NetHack, it's really cheap and you can do this in your university lab, on your home computer and so on. So that's far as one of the sales pitches for NetHack. As a reinforcement learning environment, NetHack is simple where it counts but hard where you want it to be be. So it's fast, but hard.
And often, we have the trickiest part in the squadrons. Often, there's reinforcement learning environments that are complex but easy. So as everything is 3D finely rendered, but the actual policy you need to execute it's left, left, right and you're done.
Lukas:
I guess what makes for a hard reinforcement learning challenge, it seems to me having to sort of save some state to use a lot later it seems to be challenging. It sounds like you do have a good intuition for what games would be easy for reinforcement learning and what games would be hard.
Tim:
So the thing that you just mentioned that's one, long range dependencies. How do you memorize or how do you remember that may be on the first level of NetHack you dropped a certain item that you've met much later or whatnot. And actually, NetHack has these very long range dependencies. Normal play of NetHack, if you succeed, is maybe on average 50 to 100,000 steps. There are expert players who can solve NetHack in 20,000 steps, but that's still an order of magnitude longer than for instance a normal game of StarCraft 2; which goes on for 15 minutes but has only a few actions per second, so I think average is around 2,000 steps. So long range dependencies is one.
Then the question of exploration. So how easy is it for the reinforcement learning agent to discover, what it can do in the environment? How it can control things in the environment? How often does it bump into reward? Another question I guess is; do you have all the information that you need to, given in the environment itself, in order to do well in the environment or do you have to have a really strong prior based on your, as I mentioned, common sense, knowledge, world knowledge or domain-specific knowledge?
If you have a very large action space that's really problematic for current approaches, but we as humans do well because we prune away lots of that action space. Can you easily plan ahead? Is your environment fully observable or is it only partially observable and you have to actually infer what's going on in the hidden parts of the environment? So these things make games or environments hard or easy for reinforcement.
Lukas:
It's funny. As you were talking, I mean, did you guys notice how I tried to kind of steer this towards general topics but I wasn't able to? But since we're back in this NetHack topic, have you thought about my other favorite game, Kerbal Space Program? Are you fans at all? Have you played this game?
Tim:
I mean, I've seen that on stream, I haven't played it myself; but I think that's a really interesting example. Again, as I mentioned, I haven't played this I only watched the trailer. The fact that we as humans can build mental models of what should work and what shouldn't work and then test them, I mean, that's I guess what you do in that game.
You have an idea of what might work out, in terms of a rocket that can fly, you build it then you see it fails and then you make modifications. Again, you plan in your head what kind of modifications you want to make, you make them and then you see again. This kind of way of experimenting in an environment, I think that probably sounds quite interesting for our reinforcement challenge. That said, I haven't played it myself. And I'm pretty sure current approaches would struggle a lot.
Lukas:
Can I ask you, is there anything just practically that changes when you're trying to train reinforcement learning algorithms, if you're kind of used to more supervised learning algorithms? What's different about that kind of training?
Heinrich:
I think there are some engineering challenges to enforcement learning. Basically, reinforcement learning, you can make it look like supervised learning but the data comes from... you generate the data yourself. As opposed to just reading photos from this view, you generate the data yourself.
And this is actually what modern reinforcement learning systems like say IMPALA or various others do, they have this part of the system that produces the data and then a part of system that learns on the data and there's all kinds of engineering challenges around this space; asynchronous processes and data communications on. But apart from that we use PyTorch, we use standard tools.
You have to have to compute. Typically, the games run on the CPU, so you have to have more CPU and run the reinforcement like a machine learning code runs on accelerators like GPUs. But once you have that in place, it looks pretty familiar. The models look familiar. They input a picture or like a game observation output as its probabilities of certain actions.
Tim:
So there's one additional thing I would want to mention and that also relates I think, to Weights and Biases and it's that; in reinforcement learning, generally your results have much higher variance, so you can train an agent once and then you train at another time and the results might actually look quite different. So you have to be careful in terms of how reliable your results are when you only train based on one run, basically.
That makes it interesting in terms of how you should plot these results in publications. I mean, ideally, you should be repeating your experiments multiple times, and you want to plot maybe the mean of the different ones, and you also want to indicate the variance to some extent. But I think in publications, we've seen all kinds of tricks of how people make results look better than they actually are.
Lukas:
I mean, how do you even think about reproducibility of reinforcement learning results, if they're inherently stochastic?
Tim:
I think it's fine as long as you make sure you train with different initializations of your model multiple times. And then that really comes down to a question of, how expensive is it to run the experiments? What we see right now in the field is that there's lots of interesting reinforcement learning results that come out of industry labs that have a lot of computational resources and that makes it basically impossible for any one outside, specifically in academia, to reproduce these results. Again, sorry to mention NetHack, but that was exactly the kind of motivation behind that environment and that it's really complex but at the same time should be affordable for grad students, masters students, and whatnot, to actually do experiments with.
Lukas:
I hadn't thought about that, that's such a great point. But still actually, you do need quite a lot of resources to even do NetHack. I was saying, you built some kind of system or you're using some kind of system to train in parallel, right?
Heinrich:
Yeah. But you can run this on a single box with say two GPUs, so you'll just wait a little bit longer. For NetHack, we don't currently use like hundreds of GPUs in parallel, we could do that but we just haven't invested in engineering hours to do that properly. But you can actually run this at home. I mean, you could even run this on your MacBook if you wanted to wait long enough and make life a little bit hard. Depends on what kind of neural networks you would apply to NetHack, but this is actually something you can do at home.
Tim:
And actually, I mean, you even can do this really well with just one GPU. I think we have, our implementation of our agents is based on Torch-based which again is based on IMPALA and we have two versions of that; one is, training based on one GPU, we have one that's training using two GPUs. I mean, just with one GPU, you can do experiments, you can write papers on NetHack with one GPU; I'm quite certain of that.
Lukas:
Cool. And there's basically just playing the game over and over and over and then updating the model?
Heinrich:
Yeah. We have this line in our paper where it mentioned how many agents died in the process, and it's a large number. Probably, by now it's played by far more games than the rest of mankind combined.
Lukas:
Have they really not found flaw NetHack had to explain? It's kind of amazing to me that there's not some tricky way that you can live forever or something.
Heinrich:
Well, our agents actually haven't explored that large part of the game yet. We are really at the beginning of the research here. People have tried what is called test tool automated speed runs with NetHack and have found exploits, some of those ones that Tim mentioned, putting farming and so on. But the DevTeam NetHack, they kind of keeps track of that and removes these things one by one from the game. So NetHack by now is pretty resilient against these kinds of exploits.
Lukas:
Are you in communication with the NetHack DevTeam?
Heinrich:
We did reach out to them at some point and they were very kind.
Lukas:
That's great.
Tim:
So NetHack has been under development for over 30 years; as I kind of mentioned, there's been a lot of effort in kind of removing all of these kind of exploits.
Lukas:
Right. Okay. Sorry. What about just really the weeds question. Does the agent have a preference, in that I can kind of go down the normal levels where you meet the Oracle and stuff or you can go down to that Minetown? Does the agent kind of learn that one path is safer than the other? I always kind of wonder which I should go to first.
Tim:
It's a great question. That's exactly the kind of high-level planning that our agents right now are not capable of, so it's basically by chance. So sometimes they just follow the main dungeon, the Oracle, they even get to the big room or even further down and then at some point die or they go into Minetown and at some point. We haven't really seen agents being strategic about first making some progress in the main Dungeons and then going back up to the fork to then go down the Minetown to get items and whatnot. But I mean, that's really one of the next I think milestones that we should get to.
Heinrich:
It's also because our agents have a really hard time remembering things long-term. The first order of transformation our agents optimized the current situation without any regard for the past. So going down Minetown; if you go down any stair and you happen to enter the Gnomish Mines, which is the special dungeon branch in NetHack, the logical thing for you to do is to kill the monsters in the vicinity, not to go back up just where you were where you already killed things. So if you optimize for really short-term things, that's how you end up playing and that's what our agents do. That said, we have seen our agents go back upstairs and we're not quite sure if this is just random chance or if this is something where it got incentivized to not play certain levels, but that's where we are.
Lukas:
All right. Well I'm really excited to play with you. I'm even more excited to play with your NetHack Dev environment, I really want to give it a run myself. I always end with these two questions, I kind of wonder how they'll work in this format. But do you have any kind of underrated aspects of reinforcement learning or machine learning that you think people should pay more attention to than they are right now?
Heinrich:
Have we mentioned really fast environments? [laughs]
Tim:
I think on top of that, in my view, people should be looking more into causality. I mean, it's something that I'm not very familiar with, but I think in terms of making progress as a community we should be looking more into causal models because essentially that's also what you are learning when you're playing NetHack over and over again. At some point you have some mental causal model in mind, "If I do this, then that happens," or at least with some probabilities something happens. And I think that's the only reasonable way we can go forward in terms of agents that really can systematically generalize to novel situations, you have to have that kind of abstract mental model in mind that you use for planning and for exploration and so on.
Heinrich:
One thing that bugs me a bit about the research and machine at large is that, we make these artificial distinctions between this is engineering and this is research where if you want to fly the moon; is that research or is it engineering? It's kind of both. And I think in particular it's especially true in reinforcement learning, where the breakthroughs that we saw recently came to a large extent from engineering breakthroughs.
Lukas:
I totally agree with that. And that's actually a good segue into the last question that we always ask which is, we usually frame it; what's the biggest challenge of machine learning in the real world? But I think maybe for you two I'd be curious, what are the surprising engineering challenges of making reinforcement learning work that you wouldn't necessarily know as a grad student doing your first toy reinforcement learning project?
Heinrich:
I think, I mean, maybe we should make this clear. What we do when we're training reinforcement learning agents in modern approaches is we have dozens or hundreds of copies of the game running simultaneously, played by the same agent and then something needs to ingest all of this information. So I'm not sure if people are aware this is how it is.
People used to think of it like, this is the world and this is my agent and my agent connects to the world and there's only one world, obviously. But things are just so much faster if you have a batch of worlds and you interact with a batch of experience. Although, that is kind of bad news for all the comparisons to how humans learn and how real biological systems work.
Tim:
I think on top of that, I would encourage people to really look at what these agents or what generally your machine learning model is actually doing on the data. So it's I think quite easy to try to chase some leader board numbers or try to chase better scores on NetHack without actually understanding what your agent is capable of or not capable of and how that informs your modeling choices, modeling decisions and generally, your research or engineering work going forward.
Lukas:
And so one final question, mainly for Heinrich I think. So for someone like me who has been playing NetHack for almost three decades and never ascended, do you have any tips on how to improve my NetHack skills?
Heinrich:
I think there's one point in NetHack where you ask a special shopkeeper in Minetown and it tells you to slow down, think about it. You have as much time as you want to do any action like, so NetHack is turn-based. So I think this is the best approach, think clearly. But it's really not human. You see this bad dragon and you want to run away from it, but there's no need for speed in that sense in NetHack. Just thinking clearly about every step is this the best approach.
Lukas:
Yet so hard to do.
Heinrich:
And read the spoilers.
Lukas:
Awesome. Thank you so much guys, that was super fun.
Heinrich:
Thank You.
Tim:
Likewise. Thank you so much for the invitation.