Why Experiment Tracking is Crucial to OpenAI

Carey Phelps, Product Lead

I had the opportunity to sit down with Peter Welinder from the robotics team at OpenAI. Here are some highlights from our conversation.

What are you working on?

I work on the robotics team at OpenAI. We try to build learning based robots that can eventually do anything that a human should be able to do— at this point they can do almost nothing that humans can do. I work on everything from figuring out the right algorithms to power these robots to building the equivalent of vision or sensory systems.

What’s the most challenging problem?

Iterating on physical robots is really really really hard because they break down. You get one trial and then some some tendon in the robot would spring and we’d have to repair it. Getting an accurate model of the real physical robot is really really hard, so in order to kind of alleviate that we would do a lot of our work in simulation. Since we’re trying to build something that really works, we always need to go back to the real robot, and that’s one of the hardest challenges.

It’s surprising to me that it is actually possible to train algorithms for learning based robots completely in simulation, and then we can just run them on a robot. We run an algorithm that has never seen the real world, and it controls the robot to do really really complicated tasks.

For example we want to have a robotic hand rotate an object around in various orientations. It's kind of hard to do that if you've never interacted with the real world, and that's essentially what our robots can do now. I frankly didn’t believe it was possible a year and a half ago.

How would you describe your work to someone nontechnical?

We’re trying to build a robot brain that could work with any robotic incarnation, whether it be a four legged robot or just a robotic hand sitting on a table. We’re building a very general robotic brain that can sense the world almost as we do, and is able to do any tasks that humans want to do.

It’s kind of programming computers such that they learn from the real world, or in our case a lot in simulator worlds. Just like children, there’s a learning process and you won’t get everything right on the first try. We’re programming robots to have this more human-like learning based behavior.

What inspired you to work on this problem?

If we build general purpose robots that can do anything humans can do, then there’s endless positive things you can do. It would free up a lot of time for everybody to focus on what they really like doing. Everybody could work less hours because the things we have today could be a lot cheaper just because robots are doing them. It's kind of an enormous positive impact on the world to build general purpose robots. I want to be part of figuring out how to do that.

What are some of the day to day challenges?

We are a team of people working together, and we need to divide up the work to try out new algorithms and iterate on them. There are a lot of problems around getting a shared state of our current progress. We need to be able to look back in history to see what worked and what didn't a few weeks ago and keep on being sure we improve what we're working on.

Working on a team on a research problem requires that you do a lot of communication and you have a lot of records of what you're doing. That's a big challenge in what we're working on.

How do you figure out if the model is improving?

We collect datasets all the time, and then we have benchmarks. We need to check that we're continuously improving against existing datasets and we're not regressing while we train on new datasets. Being able to do these kinds of comparisons continuously as we train new models— that's the way we make sure that we are making progress.

How do you visualize training?

There's a couple of reasons why you want to see a model through the duration of training. If it's performing really badly, you want to save your money and stop early. The systems we train are very big and they cost a lot of money to run, so it's very nice to be able to save money there. But also, the way a model kind of progresses in training can tell you something— does it flatten out, how fast does it train in the beginning— it can tell you about the algorithms you're using and the hyperparameters you're trying to tune. It's pretty common practice to look at those curves.

I use Weights and Biases in a team context— it's very useful to be able to see all the experiments your colleagues are running. You don't want to replicate stuff that's already been done, but also see the current state of affairs— how well are you doing against your baseline benchmarks. It's also really good to, in a very visual way, go in and dive deeper into into aspects of training. How well did a particular run do on some metrics you might care about. Having easy access to that is super important.

How do you use Weights & Biases in your team?

We use the dashboard to see what's going on right now, which models are being trained, and the progress of everybody in the team. It's also a way of keeping our shared history of all the things we've done. We're able to easily go back and compare against all experiments or download a model we've trained a couple of weeks back and deploy it on some new dataset. It's kind of a shared log book for the team.

We use Weights & Biases with continuous integration a lot. It's extremely important to see that your model doesn't regress. When we have 10 to 20 people working with our code base, at any point someone could commit a change and break something. Weights & Biases has been a very simple way for us to have peace of mind. Things are going down and to the right in terms of error rates, and they don't suddenly blow up. The worst thing that can happen is that you find out after a few weeks that you have a regression and then you have two weeks of commits to go through and figure out what went wrong. You easily lose a week or two of work.

Before we started using Weights & Biases, everyone had their own little setup of how they would get results. Some people were using Tensorboard, some people would be using their own homebrew version of some visualization tool. Everything was very fragile. If I wanted to share results, the best I could usually hope for was a screenshot of my graph pasted in an email. Whenever they would need to get something more, more often than not I’d have to rerun the experiment. So now we have a central place to have all of that information, and it's very easy for anybody to access that transparently and compare against each other's results.

My colleague Lillian, for example, can take whatever she has trained and compare that with what I trained, create a quick report, and I can download the model she trained. I don't have to ask her where it is— I can go in and look at other metrics very easily since I have all the raw data. It's reduced a lot of the overhead in communication to make us focused on what matters.

Comparing results in general it's much faster when you have all the data in one place. So we do this a lot in our workflows comparing against old baselines and so on. So we can keep on having old runs available and compare against us over and over and over again.

How do you use the Weights & Biases system metrics?

The main thing is making sure we fully utilize our machines and save money. It’s kind of the worst thing to see that we used like 10 percent of a GPU or 10 percent of a CPU, because it means that we could have run 10 times as many experiments. It's been an easy and transparent way to see our utilization and figure out where bottlenecks are so we can fix them and really optimize our system.

What is your vision for the project?

One of the things we've been working on the project is to get a robotic hand to manipulate real objects so you can put like a block in the hand and we can oriented to it any orientation. This is a kind of a problem that had eluded the robotics community for decades, so it might sound simple for a human but to make a robot do this it's very hard. You have to trust me on that one. But you know it would be really cool if a robotic hand can do more things than rotate objects. Can we get robotic hands to do more of what humans can do, maybe even write letters and stuff like that— it would be amazing. So that's the direction we want to move towards, and as part of that we need to build general purpose systems that can understand their environment. The robot not only needs to understand how to control its hand but to understand what is it that somebody put in my hand So we're trying to build these general purpose systems.

How do we make progress towards this kind of more general purpose a robotic system? It's mostly thinking, “What is the next hardest thing we can do?” You don't want to go from a robot being able to manipulate and object to say, “Okay, now go build a rocket for me.” You know that's that's pretty hard. So we want to find something slightly simpler. Okay so now you rotated a block, let’s see if you can rotate a pen or maybe you can rotate a cup that I put in the hand. What if you put a robotic hand on an arm, then what could it do. You want to continuously make the tasks harder and the system more capable.

We’re still in the really early stages of what can be done in artificial intelligence and deep learning. There’s quite a lot we can still get out of them, many problems we still haven’t applied them to. There is a big opportunity in machine learning in general— you know we humans are learning based machines, that’s kind of the pursuit of where we want to go. The limit is really being able to do anything that humans can do. And I think while we're still far away from that, we will make steady progress towards that over the next few years or decades.

Join our mailing list to get the latest machine learning updates.