Vladlen Koltun — The Power of Simulation and Abstraction
From legged locomotion to drones and autonomous driving, Vladlen explains how simulation and abstraction help us understand embodied intelligence.
Listen on these platforms
Vladlen Koltun is the Chief Scientist for Intelligent Systems at Intel, where he leads an international lab of researchers working in machine learning, robotics, computer vision, computational science, and related areas.
Connect with Vladlen:
0:00 Sneak peek and intro
1:20 "Intelligent Systems" vs "AI"
3:02 Legged locomotion
9:26 The power of simulation
14:32 Privileged learning
18:19 Drone acrobatics
20:19 Using abstraction to transfer simulations to reality
25:35 Sample Factory for reinforcement learning
34:30 What inspired CARLA and what keeps it going
41:43 The challenges of and for robotics
Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to email@example.com. Thank you!
I wanted to understand how we train intelligent agents that have this kind of embodied intelligence that you see in us and other animals. Where we can walk through an environment gracefully, deliberately, we can get to where we want to go, we can engage with the environment, if we need to rearrange it, we rearrange it. We clearly act spatially intelligently, and by intelligently in an embodied way. And this seems very important to me. And I want to understand it, because I think this underlies other kinds of intelligence as well.
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald.
Vladlen Koltun is the Chief Scientist for Intelligent Systems at Intel, where he runs a lab of researchers working on computer vision, robotics, and mapping simulations to reality. Today, we're going to talk about drones, four legged robots and a whole bunch of cool stuff.
All right, Vladlen, thanks so much for talking with us. I saw your title, it's somewhat evocative. It's the chief scientist for Intelligent Systems at Intel. Can you say a little bit about what the scope of that is? It sounds intriguing.
Yeah, I prefer the term Intelligent Systems to AI. AI is a very loaded term with a very long history, a lot of baggage. As you may remember, the term fell out of favor for a very long time, because AI over-promised and under-delivered in the 80s, and 90s. And when I became active in the field, when I really learned quite a bit about AI, the term AI was not used by many of the most serious people in the field. People avoid the term artificial intelligence, people identified primarily as machine learning researchers. And that persisted into I'd say, the mid 2010s actually. It's only very recently that the term AI became respectable again. And serious researchers on a large scale has started to identify themselves as artificial intelligence researchers.
I somehow find that term Intelligent Systems broader. First of all, because it doesn't have the word artificial. So if we're interested in Intelligent Systems, we clearly are interested in artificial intelligent systems, but also natural intelligent systems. We want to understand the nature of intelligence, we are concerned with intelligence, understanding it and producing it and using it in systems that we create. It's a more neutral term with less baggage, I like it. I don't mind AI, but somehow I'm more predisposed to Intelligent Systems.
Cool. I love it. And I always try to take the perspective of these as someone who knows about machine learning or Intelligent Systems, who maybe isn't an expert in your field, which will be super easy in this interview, because I know very little about robotics and a lot of the stuff that you've been working on. But I am very intrigued by it. And I think anyone kind of understands how cool this stuff is. So I'd love to ask you about some of the papers that I was looking at. I mean, one that kind of just stuck out to my, well, to myself now, but also my younger self is just like, unbelievably cool, was the paper that you wrote in quadruped locomotion, where you have like a walking robot, navigating terrain.
And I think what was maybe most evocative about it was, you say that you basically train them completely in simulation, And so then it's sort of like zero shot learning in new terrain. And I guess, could you say for someone like me actually, who's not an expert in the field, kind of, what's like hard about this? Like, just in general, and then kind of what did your paper offer that was sort of new to this challenge?
Yeah, legged locomotion is very hard, because you need to coordinate the actuation of many actuators. And there is one very visceral way to understand how hard it is. Which is to control an animated character with simple legs, where you need to actuate their different joints or their different muscles with different keys on the keyboard. And there are games like this, and you can try doing this even with just four joints. So try actuating four joints yourself, and it's basically impossible. It's just brutally brutally hard. It's this delicate dance where at the same time, in synchrony, different muscles need to fire just right and one is firing more and more strongly and the other needs to subside and this needs to be coordinated, this is a very precise trajectory in a very high dimensional space.
This is hard to learn, and if you look at human toddlers learning it, it takes them a good couple of years to learn it. This is even for human intelligence, which is awesome. And I use the term awesome here in this original meaning, I don't mean awesome, like a really good cup of coffee. I mean awesome, right? Even for this level of intelligence, it takes a couple of years of experience to get a hang of legged locomotion. So this is very, very hard, and we want our systems to discover this, to master this delicate dance, that as adult humans, we basically take for granted.
And you can look at basically the most successful, I would say attempt so far, which is Boston Dynamics. Which is a group of incredibly smart, incredibly dedicated, insightful engineers who're some of the best in the world at this, a large group, and it took them 30 years. It took them 30 years to really get it, to really design and tune legged locomotion controllers that are very robust. We did this and depending how you count, but I would say about two, three years, primarily with two graduate students. Now, these are amazing graduate students, these are really extraordinary graduate students. But still, the fact that we could do this in two, three years speaks to the power of the approach.
And the approach is essentially taking the system through a tremendous amount of experience in simulation, and have it do all the trying and falling in simulation. And then the key question after that is what happens when you learn in simulation and put the controller on the real robot, in reality, will it work? And there are a few ideas that make it work, and a few pleasant surprises where it worked better than we expected. One key idea that was introduced in our previous paper, the Science Robotics paper that we published a couple of years ago, is to empirically characterize the actuators that are used on the real robot. So you basically, the measure, and you do system identification, you measure the dynamics model of each actuator, empirically by just perturbing the robot, actuating the actuator and just seeing what happens, seeing how the system responds.
And that means that you don't need to model complex motors with their delays and the electromechanical phenomena that happened in the actuators, you don't need to model that analytic. You can just fit a little neural network, little function approximator, to what you see. Then you take this empirical actuator model into your simulated legged system, then you have the legged system walk around on simulated terrain. That's where the pleasant surprise comes, which is that, we didn't have to model all the possible behaviors of simulated terrains and all the types of simulated terrains in simulation. We didn't have to model vegetation, we didn't have to model gravel, we didn't have to model crumbling, we didn't have to model snow and ice, just with a few simple types of terrains, and aggressively randomized geometry of these terrains, we could teach the controller to be incredibly robust.
And the amazing thing that we discovered, which is maybe the most interesting outcome of this work is that in the real world, the controller was robust to things it never really explicitly saw in simulation. Snow, vegetation, running water, soft yielding compliant terrain, sand, things that would be excruciatingly hard to model. Turns out, we didn't need to model them at all.
That's so cool. I guess we've talked to a whole bunch of people that work on different types of simulated data often just for the cost savings, right? Of being able to generate infinite amounts of data. And it seems like, if I could summarize what they seem to say, it's that you often benefit from still like a little bit of real world data in addition to the simulated data. But it sounds like in this case you didn't actually need it. Did it literally work like the first time you tried it or were there some tweaks that you had to make to the simulation to actually get it to bridge the gap between simulation and reality?
It worked shockingly well. And what helped a lot is that Joonho just kept going. And I love working with young researchers, young engineers, young scientists, because they do things that would seem crazy to me. And if you ask me to predict, I would say that's not going to work. But fortunately, often, they don't ask me and they just try things. And so we would just watch Joonho, try things out, and things kept working. So the fact that you don't need to model these very complex physical behaviors, in the terrain, in the environmen, this is an empirical finding, we basically discovered this, because Joonho tried it, and it worked. And then he kept doing it, and it kept working and it kept working remarkably well. So somehow, it was very good that he didn't ask me and others, "Is this a good idea? Should I try this?"
It seems like there's these obvious extensions that would be amazingly useful, like if you tried to do bipedal locomotion, and then making the robot, it's like usefully engaging with its world. Where does this line of inquiry get stuck? It seems so promising.
We're definitely pushing this along a number of avenues. I'm very interested in bipeds. And we do have a project with bipeds. We're also continuing to work with quadrupeds, we have multiple projects with quadrupeds and we're far from done with quadrupeds. There's definitely more, there's more to go. And then you mentioned interaction, you mentioned engaging with the world. And this is also very interesting frontier and we have projects like this as well. So ultimately, you want not to just navigate through the world, you also want to interact with this more deliberately. Not just be robust and not fall and get to where you want to go. But after you get to where you want to go, you actually want to do something, maybe take something or somewhere else or manipulate the environment in some way.
What physics simulator did you use? Is this something you built? Or did you use off-the-shelf?
This is a custom physics simulator built by Jemin Hwangbo, who led the first stage of that project. That's why I said by the way that it took three years, because I'm including that previous iteration that was done by Jemin that laid a lot of the groundwork, and a lot of the systems infrastructure we ended up using. So Jemin basically built a physics simulator from scratch, to be incredibly, incredibly efficient. So it's very easy for the simulation times to get out of hand. And if you're not careful, you start looking at training times on the order of a week or more.
And I've seen I've seen this happen when people just code in Python and take off-the-shelf components, they get hit with so much overhead and so much communication. And then I tell them that they can get one or two or three orders of magnitude if they do it themselves and sometimes it's really necessary. And so the debug, our debug cycle was a couple of hours in this project, so that helped.
That's incredible. And that seems like such an undertaking to validate a physics simulator from scratch. Was it somehow constrained to make it a more tractable problem?
So I think what helped is that Jemin did not build a physics simulator for this project. It's not that he started this project, and then he said, "I need to pause the research for about a year to build a custom high performance physics simulator, and then I'll get to do what I want to do."
He built it up during his PhD, during many prior publications, and it's a hobby project just like every self- respecting computer graphics student has a custom rendering engine that they're maintaining. So in this area, a number of people have custom physics engines that they're maintaining just because they're frustrated with anything they get off the shelf, because it's not custom enough, it doesn't provide the interfaces they want, it doesn't provide the customizability that they want.
One of the things you've mentioned in the paper, or one of the papers, was using privileged learning as a learning strategy. Just something I hadn't heard of. Could you describe what that is?
Yeah. It's an incredibly powerful approach that we've been using in multiple projects. And it splits the training process into two stages. In the first stage you train a sensory motor agent that has access to privileged information. That's usually the ground truth state of the agent, for example, where it is, exactly what its configuration is. So for example, for an autonomous car, it would be absolutely precise ground truth position in the world down to the millimeter. And also the ground truth configuration of the environment, everything that matters in the environment. The geometric layout of the environment, the positions of the other participants, the other agents in the environment and maybe even how they're moving and where they're going and why.
So you you get this God's eye view into the world, the ground truth configuration of everything. And this is actually a much easier learning problem, you basically don't need to learn to perceive the world through incomplete and noisy sensors, you just need to learn to act. So the teacher, this first agent, we call it the teacher, the privilege teacher, it just learns to act. Then you get this agent, this teacher, that always knows what to do, it always knows how to act very, very effectively. And then this teacher trains the student that has no access to privileged information. The student operates only on real sensors that you would have access to in the real world, noisy, incomplete sensors, maybe cameras, IMU, only onboard sensors, only onboard computation.
But the student can always query the teacher and ask "What would you do? What is the right thing to do? What would you do in this configuration? What would you do in this configuration?" So the learning problem is, again, easier, because the student just needs to learn to perceive the environment. It essentially has a supervised learning problem now, because in any configuration it finds itself, the teacher can tell it, here is the right thing to do, here is the right thing to do. Okay?
So the sensory motor learning problem is split into two. First, learning to act without perception being hard. And second, learning to perceive without action being hard. Turns out, that's much easier than just learning the two together in a bundle.
That's really interesting. So in the way you did the second part of the training, let me make sure I got this. This second model with the realistic inputs, is it trying to match what the teacher would have done?
But it doesn't actually try to figure out an intermediate true representation of the world. It's just kind of matching the teacher, does it somehow try to actually do that mapping from noisy sensors to real world state?
Right. It doesn't need to reconstruct the real world state. So there are different architectures we can imagine with different intermediate representations. But the simplest instantiation of this approach is that you just have a network that maps sensory input to action and then this network is just trained in a supervised fashion by the actions that the teacher produces.
I see, cool.
So I'm really just cherry picking your papers that just seem kind of awesome to me. But I was also pretty impressed by your paper, where you taught drones to do like, crazy acrobatics. Do you know what I'm talking about?
So you talk about the simulation in that one, and it seemed like it must be really hard to simulate what actually happens to a drone as it like kind of flies in crazy ways. I mean, I'm not sure, but it seems so stochastic to me just like watching a drone. It's so hard to control a drone I was actually wondering if that... It seems like it must have been a real simulation challenge to actually make that work. Also, we should put a link to the videos because they're super cool.
Yeah, yeah. This was an amazing project driven again by amazing students from University of Zurich, Antonio Loquercio and so on. First we benefited from some infrastructure that the quadrotor community has, which is they have good quadrotor simulators, they have good models for the dynamics of quadrotors. We also benefited from some luck, which is that, not everything that can happen to a quadrotor needs to be simulated to get a good quadrotor control. So for example, we did not simulate aerodynamic effects, which are very hard to simulate. So if a quadrotor goes close to a wall, it then gets aerodynamic push back. It gets really, really hairy. But we did not simulate that and turns out we didn't need to.
Because, the neural network makes decisions moment to moment, moment to moment. And if it gets a bit off track, if it's thrown around, no problem, in the very next moment, it adjusts to the state that it finds itself in. So this is closed loop control. If it was open loop control, well, it would have failed.
I see. Interesting. Were there any other details that you had to get right to make that work? I mean, I'm really impressed the way you're... It seems like you're sort of effortlessly able to jump from simulation to reality. And everyone else that I talk to is like, this is like the most impossible step. But it's something about these domains or something you're doing seems to work really effectively for you?
Yeah, yeah. So we're getting a hang of this. And there are a few key ideas that have served us well. One key idea is abstraction. So abstraction is really, really key. The more abstract the representation that a sensor or a sensory modality produces, the easier it is to transfer from simulation to reality.
So what do you mean by abstract? Can you give me an example about abstract versus not abstract?
Yeah. Let's look at three points on the abstraction spectrum. Point number one, a regular camera, like the camera that is pointing at you now and the camera that is pointing at me now, point number one. Point number two, a depth map coming out of a stereo camera. So we have a stereo camera, it's a real sensor, it really exists, produces a depth map. Let's look at that depth. Point number three, sparse feature tracks that a feature extractor like SIFT would produce. So just very salient points in the image and just a few points that are being tracked through time so you're getting just a document.
So the depth map is more abstract than the color image. Why is that? Because there are degrees of variability that would affect the color image, that the depth map is invariant to. The color of that rack behind you would massively affect the color image, but would not affect affect the depth map. Is it sunny? Is it dark? Are you now at night with your environment lit by lamps?
All of that affects the color image and it's brutally hard to simulate. And it's brutally hard to simulate. And it's brutally hard to nail the appearance so that the simulated appearance matches the statistics of the real appearance. Because we're just not very good at modeling the reflectance of real objects. We're not good at dealing with translucency, refraction, we're still not so great at simulating light transport.
So all these things that determine the appearance of the color image, very, very hard to simulate. The depth map is invariant to all of that, it gives you primarily a reading of the geometric layout of the environment. So, if you have a policy that operates on depth maps, it will transfer much more easily from simulation to reality, because things that we are not good at simulating, like the actual appearance of objects, they don't affect the depth map. And then if you take something even more abstract, let's say you run a feature extractor, a sparse feature tracker through time, the video will just be a collection of points, like a moving dot, a moving point display.
It actually still gives you a lot of information about the content of the environment. But now it's invariant to much more, it's invariant also to geometric details and quite a lot of the content of the environment. So maybe you don't even have to get the geometry of the environment and the detailed content of the environment right, either. So now that's even more abstract.
And that last representation, is the representation that we used in the deep drone robotics project. So the drone, even though it has a camera, and it could look at the color image, it deliberately doesn't. It deliberately abstracts away all the appearance and the geometric detail and just operates on sparse feature tracks. And turns out that we could train that policy with that sensory input in very simple simulated environments, and they would just work out of the box in the real world.
Well, it's so interesting, it makes me wonder, I mean, people that we've talked to talked about sort of end to end learning with like autonomous vehicles versus pieces. And I guess I've never considered that if you kind of break it up more or have like more intermediate representations, it might make simulation easier transferring from simulation to the real world. But that actually makes total sense.
Yeah. So I think for example, the output of a LIDAR is easier to simulate, than the original environment that gave rise to that output. So if you look at the output of a LIDAR, it's a pretty sparse points. If you train a policy that operates on the sparse point set, maybe you don't need a very detailed super high fidelity model of the environment, certainly maybe not of its appearance, because you don't really see that appearance reflected much in the LIDAR reading.
I guess I also wanted to ask you about another piece of work that you did that was intriguing, which is this sample factory paper, where you have kind of a setup to train things much faster.
And I have to confess I kind of struggled to understand what you were doing. So I would love just kind of a high level explanation. I mean, like, maybe, I'm not reinforcement learning expert at all. So maybe kind of like set up what the problem is, and kind of what your contribution is that made these things run so much faster.
Yeah. So, our goal is to see how far we can push the throughput of a sensory motor learning systems in simulation. And we're particularly interested in sensory motor learning in immersive three dimensional environments. I'm personally a bit less jazzed by environments such as board games, or even Atari, because it's still quite far from the real world.
Although you have done a fair amount of work on it, haven't you?
Right. So we've done some, but what really excites me deeply is the training systems that work in immersive 3d environments, because that, to me, is the big prize. If we do that really, really well, that brings us closer to deploying systems in the physical world. The physical world is three dimensional, the physical world is immersive, perceived from a first person view, onboard sensing and computation by animals, including humans. And these are the kinds of systems that I would love to be able to be able to create. So that's where we tried to go in our simulated environments. And these simulated environments tend to be, if you're not careful, they're pretty computationally intensive. And if you just use, again, if you use out of the box systems, you will notice a pattern here.
If you just use tools out of the box, and have some high level Python scripting on top of existing tools, you'll basically have a simulation environment that runs at 30 frames per second, maybe 60 frames per second. You're roughly collecting experience, and something that corresponds to real time. Now, as we mentioned, it takes a human toddler, a couple of years of experience to learn to walk. And a human toddler is a much better learner, a much more effective learner than any system we have right now. So two years is a bit slow if you ask me for a debug cycle. I don't want to have a debug cycle of two years. And in fact, what we need to do is take this amount of experience, and then multiply it by several orders of magnitude, because the models that we're training are much more data hungry, and they're much poorer learners, than the human toddler.
So then basically, we're looking at compressing maybe centuries of experience until we get better at learning algorithms and the models we design. But with the current models and algorithms, the challenge is to compress perhaps centuries of experience into overnight, and overnight training, which is a reasonably comfortable debug cycle. You launch a run, you go home, you come back in the morning, you have experimental results. That basically means that you need to operate, you need to collect experience and use it for learning on the orders of hundreds of thousands of frames per second, millions of frames per second. And this is where we're driving.
So in this paper, we demonstrate that a system architecture that in an immersive environment, trains agents that act, collect experience, and learn in these 3d immersive environments on the order of 100,000 frames per second on a single machine, single server.
And the key was basically a bottom up from scratch, from first principles, system design with a lot of specialization. So we have processes that just collect experience, agents just run nonstop collect experience. We have other processes that just learn and update the neural network weights. So it's not that you have an agent, that goes out, collects experience, then does some gradient descent step steps, updates its weights, goes back into the environment, collect some more experience with better weights, and so on and so forth.
Everything happens in parallel, everybody is busy all the time. And the resources are utilized very, very close to 100% utilization. Everything is connected through high bandwidth memory, everything is on the same node, so there is no message passing. Because if you look at these rates of operation, if you're operating at hundreds of 1000s of frames per second, message passing is too slow.
The fastest message passing protocol you can find is too slow, the message passing becomes the bottleneck in the system. So what happens is that these processes just read and write from shared memory. They just all access the same memory buffers. When the new neural network weights are ready they're written into the memory buffer, when a new agent is ready to go out collect experience, it just reads the latest weights from the memory buffer.
And there is a cute idea that we borrowed from computer graphics, which is double buffering. And double buffering is one of the very, very first things I learned in computer graphics as a teenager, we wrote the assembly code. And basically lesson one in computer graphics, how do you even display the image? Double buffer is part of lesson one.
The idea is that there are two buffers, that display points to the front buffer and that's what's being displayed, that's the active buffer. In the meantime, the logic of your code is updating the back buffer with the image of the next frame. When the back buffer is ready, you just swap pointers. So the display starts pointing to the back buffer, that becomes the primary one. And then the logic of your code are operating what used to be the front buffer. So the back buffer becomes the front buffer, the front buffer becomes the back buffer, you keep going.
We introduced this idea into reinforcement learning, again, to just keep everybody busy all the time. So the learning processes work on a buffer and then write out the new weights and the experience collectors have their own buffer that they're writing out sensory data into. And then they swap buffers, there's no delay, and they just keep going.
Interesting. Would it be possible to scale this up if there were multiple machines and there was a delay in the message passing?
So the distributed setting is more complex, we have avoided it so far. If you are connected over a high speed fabric, then it should be possible.
We've deliberately maybe handicapped ourselves still, even in a follow up project that we have now that was accepted to ICLR. We limited ourselves to a single node, because we felt that we will learn useful things if we just constrain ourselves to a single node and ask how far can we push single node performance. And in this latest paper that was just accepted to ICLR, we basically showed that with a single node, if we again take this holistic end-to-end from first principles system design philosophy, we can match results that previously were obtained on an absolutely massive industrial scale cluster.
Yeah, I mean, your learning speed is so fast to me, it seems faster than actually what I would expect from like supervised learning, where you're literally just pulling the images off your hard drive. Am I wrong about that or?
Oh, yeah. So in the latest work it's basically the forward pass through the ConvNet is one of the big bottlenecks. It's no longer the simulation, we can simulate so fast, we can simulate the environment so fast, it's no longer the bottleneck. It's actually like routine processing. Like even just doing the forward pass in the ConvNet.
So I guess like one more project that you worked on that I was kind of captivated by, I kind of want to ask about, because I think a lot of people that watch these interviews would be interested in it too is CARLA right, which is like kind of an environment for learning autonomous vehicle stuff. Can you maybe describe it? And what inspired you to make it?
Yeah, CARLA is a simulator for autonomous driving. And it's grown into an extensive open source simulation platform for autonomous driving that's now widely used, both in industry and in research.
And I can answer your question about inspiration I think in two parts. There is what originally inspired us to create CARLA and then there is what keeps it going. And so what originally inspired us is actually basic scientific interest in sensory motor learning and sensory motor control. I wanted to understand how we train intelligent agents that have this kind of embodied intelligence that you see in us and other animals. Where we can walk through an environment gracefully, deliberately, we can get to where we want to go, we can engage with the environment, if we need to rearrange it, we rearrange. We clearly act spatially intelligently and by intelligently, in an embodied fashion.
And this seems very core to me and I want to understand it, because I think this underlies other kinds of intelligence as well. And I think it's important for us on our way to AI, to use a loaded term, I think it's very important for us to understand this aspect of intelligence. It seems very core to me the kinds of internal representations that we maintain, and how we maintain them as we move through immersive three dimensional environments.
So I wanted to study this, I wanted to study this in a reproducible fashion. I wanted good tooling, I wanted good environments in which this can be studied. And we looked around, and when we started started this work, there just weren't very good, very satisfactory environments for us.
We ended up in some early projects, we ended up using the game Doom, which is a first person shooter that I used to play as a teenager, and I still have a warm spot, a spot, for. And we used Doom and we used it to good effect and in fact, we still use it in projects. And we used it in the sample factory paper as well, I mean, sample factories, another paper that that is based on Doom, essentially on derivatives of John Carmack's old code, which tells you something about the guy, right? So if people still use your code, 25 years later, you did something good. You did something right.
But Doom, if you just look at it, it's somehow is less than ideal, right? I mean, you walk around in a dungeon and you engage in assertive diplomacy of the kind that maybe we don't want to always look at and we don't want our graduate students to always be confronted with. I mean, there's a lot of blood and gore and somehow wasn't designed for AI, it was designed for the entertainment of, I guess, primarily teenage boys. So we wanted something a bit more modern, and that connects more directly to the kinds of applications that we have in mind to useful productive behaviors that we want our intelligent systems to learn. And autonomous driving was clearly one such behavior.
And I held the view at the time that I still hold, that autonomous driving is a long term problem, it's a long term game. It wasn't about to be solved, as people were saying when we were creating CARLA, and I still don't think that it's about to be solved. I think it's a long term effort. So we created a simulation platform where the task is autonomous driving.
And as an embodied artificial intelligence task, as an embodied artificial intelligence domain, I think it's a great domain. You have a complex environment, you need to navigate through it, you need to perceive the environment, to make decisions in real time. The decisions really matter, if you've got something wrong, it's really bad. So the stakes are high, but you're in simulation.
So that was the original motivation, it was basic scientific interest in intelligence and how to develop intelligence. And then the platform became very widely used. People wanted it, people wanted it for the engineering task of autonomous driving, and people kept asking for more and more and more and more features, more and more functionality. Other large institutions like actual automotive companies started providing funding for this platform to be maintained and developed because they wanted it. And we put together a team that the team has ably led by German Ros, one of the original developers of CARLA, who is now leading an extensive international team that is really primarily devoted to the autonomous driving domain and supporting the autonomous driving domain through CARLA.
That's so cool. I feel like maybe one criticism of academia, I don't know if it's fair or not, is that, it has trouble with incentives to make tools like this that are really reusable.
Did you feel pressure to write papers, instead of building a robust simulating tool, that would be useful for lots of other people?
Well, I maintain a portfolio approach where I think it's okay for one thrust of my research and one thrust of my lab to not yield the publication for a long time. Because other thrusts just very naturally end up publishing more. So it balances out, it balances out.
I personally don't see publication as a product or as a goal. I see publication as a symptom, publication is a symptom of having something to say. So publications come out, they come out at a healthy rate, just because we end up discovering that useful things that we want to share with people. And I personally find it very gratifying to work on a project for a long time, and do something substantial, maybe then publish it. And if people use our work, and it's useful to them, that is its own reward to me. So even if there is no publication, if people find our work useful, I love it, I find it very, very gratifying.
Mm-hmm (affirmative). Yeah, I can totally relate to that.
Can I ask you a more open ended question since you're kind of getting to the end of this? I guess, I wonder when I look at ML applications, I guess, broadly defined ML, the one that is kind of mysterious to me is robotics, right? Like, I feel like I see ML like working all over the place. It's just so easy to find... Like suddenly, my camera can search semantically. But then, I feel like the thing that I can do, that computers most can't do is kind of pick up an arbitrary object and move it somewhere. And it seems like you've been really successful, getting these things to work to some degree.
But I guess I always wonder like, what is so hard about robotics? And is this... Do you think there'll be like a moment where something starts working and we see ML robot applications all over the place? Or is this always going to remain like a huge challenge?
I don't think it will always remain a huge challenge. I don't think there is magic here. The problem is qualitatively different from your perception problems, such as computer vision, and being able to tell your camera, "Where is Lukas?" And the camera will find Lukas. The problem is qualitatively different, but I don't think the problem is insurmountable. And I think we're making good progress.
So the challenge is that to learn to act, you need to actually act. To act, you need to act in an environment, you need to act in a living environment. If you act in a physical environment, you have a problem, because the physical environment runs in real time. So you're potentially looking at the kinds of debug cycles that we mentioned with a human toddler, where something takes a couple of years to learn. And in these couple of years, I mean, that toddler is also an incredibly robust system, right? The toddler can fall no problem, right?
So during this time you run out of battery power, you fall, you break things, you need a physical space in which all of this happens. And then if you're designing the outer learning algorithms, you need to do this in parallel on many, many, many, many variations. You need many, many, many, many slightly different toddlers to see which one learns better. And it's very, very hard to make progress in this regime. So I think we need to identify the essential skills, the underlying skills, that... And I think many of these can be understood and modeled in essentially our equivalent model systems.
So if you look at neuroscience, for example, much of what we know about the nervous system was not discovered in humans, in the human nervous systems. It was discovered in model systems such as squids. So a squid is pretty different from a human. But it shares some essential aspects when it comes to the operation of the nervous system.
And it's easier to work with, for very many reasons. Squids are just easier to work with than than humans. Nobody says that if we understand squid intelligence, we will understand everything about human intelligence, and how to write novels and compose music. But we will understand many essential things that advance the field forward. I believe, we can also understand the essence of embodied intelligence, without worrying about, let's say, how to grasp with slippery pebbles, and how to pour coffee from a particular type of container.
Maybe we don't need to simulate all these complexities of the physical world, we need to identify the essential features that really bring out the essence of the problem, the essential aspects of spatial intelligence, and then study these inconvenient model systems.
That's what we try to do with a lot of our work. And I think we can actually make progress, make progress enough to bootstrap physical systems that are basically intelligent enough to survive, and not cause a lot of damage when they're deployed in the physical world. And then we can actually deploy them in the physical world and start tackling some of these last millimeter problems such as how to grasp a slippery glass, that kind of thing.
That's so interesting, is it really last millimeter because I feel like something just like, I mean, you would know better than me, but just like the way fabric hangs, or the way, like liquid spill, I understand that those are incredibly hard to simulate with any kind of accuracy as we would recognize it. You think that that's like, actually in the details, and the more important thing is like- Well, what is the more important thing then? To know how to simulate quickly? Or where's the productive axis to improve?
Well, one problem that I think a lot about that seems pretty key is the problem of internal representations of spatial environments that you need to maintain. So suppose you want to find your keys, okay, you're in an apartment, you don't remember where you left your keys, you want to find your keys. Okay? So you need to move through the apartment, and you need to maintain some representation of it. Or you're in a new restaurant, and you want to find the bathroom. You've never been there before, you want to find the bathroom. I've done this experiment, many times, you always find the bathroom and you don't even need to ask people, right? How do you do that? What is that?
So I think these questions, these behaviors, step into actually an important, what to me feels like an essential aspect of embodied intelligence, an essential aspect of spatial intelligence. And I think if we figure that out, we will be on our way, we will not be done, but we will be on our way.
Then there is the very detailed aspects, one of my favorite challenges, long term challenges for robotics as Steve Wozniak's challenge. Which is that a robot needs to be able to go into a new house that it's never been in before and make a coffee. So that I think will not be solved with just the skill that I mentioned to you. That does rely on some of these last millimeter problems, sort of the detailed actuation, also reasoning about the functionality of projects of objects. And I think we're actually far I don't think it's going to happen next year. I think we're quite far, but it's a very exciting journey.
Awesome. I love it. Thanks so much for your time. That was a lot of fun.
Thank you so much Lukas.