Listen on these platforms
Adrien Gaidon is the Head of Machine Learning Research at the Toyota Research Institute (TRI). His research focuses on scaling up ML for robot autonomy, spanning Scene and Behavior Understanding, Simulation for Deep Learning, 3D Computer Vision, and Self-Supervised Learning.
Connect with Adrien
0:00 Sneak peek, intro
0:48 Guitars and other favorite tools
3:55 Why is PyTorch so popular?
11:40 Autonomous vehicle research in the long term
15:10 Game-changing academic advances
20:53 The challenges of bringing autonomous vehicles to market
26:05 Perception and prediction
35:01 Fleet learning and meta learning
41:20 The human aspects of machine learning
44:25 The scalability bottleneck
Note: Transcriptions may contain some inaccuracies. Please submit any corrections to email@example.com. Thank you!
ML is everywhere, but it's like starting from perception, and more and more moving to prediction. And I think where to cutting edge is really is in the planning and control side. So how do you bridge these gaps from pixels to the steering? So ML is everywhere, of course, I'm biased.
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. Adrien is exactly the kind of person we imagined having on this podcast when we started it. He's the head of research at TRI, Toyota's research arm. And he's been a long time user, maybe one of the very first users of Weights & Biases. And every time I talk to him, he has interesting ideas on the field of machine learning, and the tools necessary to make it really work in production. This is going to be a really interesting conversation.
All right, I have a whole bunch of questions, but I thought I'd start with a little bit of an oddball one only for you. I always use this metaphor, building the Weights & Biases tools, that I hope our users love our tools in the same way that a guitar player loves their guitars. And I know you are a guitar player, do you have a favorite one that you own and play?
I don't know, I need to deactivate my Zoom background, but this is a road worn fender strat.
I love the road worn because first, I don't mind damaging it and more. And second it has a really nice feel. I think like tools, that's more general than just guitars, but they grow on you. It's almost like a lot of musicians give names to their guitars like, Eric Clapton famously, et cetera. And I think like really good tools, they become part of you and, and you develop a relationship with them. It's the case for cars, it's the case for guitars, virtual tools it's kind of interesting. Some tools definitely become part of you, I haven't named a W&B report after my daughter or something like this yet, but who knows?
Well, besides W&B, what are your favorite tools that you use in your day-to-day job building machine learning models?
If you're talking as a manager, that's not the same as if you're talking as a scientist.
Answer both, please. I would love to have both.
All right, I want mention maybe the ones that everybody knows and love. I like Todoist a lot as a manager, that's a great way to manage your tasks and stuff like this. I've been a longtime user of Todoist, and I really, really like it. Tried a lot of different ways to manage to dos, and et cetera, and keep track of those karma points and whatever things like gamification. This one, I think is a pretty nice recommendation I can give to everybody that has a lot of tasks and wants to stay on top of them. As a manager, just one is good.
And what do you like about it? As you said to do list is the app. We should put a link to that.
Yeah, in one word.
I have to say I use WorkFlowy, and I'm like super attached to it myself. I'm curious, what do you like about Todoist? What's the...
It's very simple. I think tools in this complicated world where you have many things to do has to be dead simple, and good synchronization across devices is super important because when you switch from one to the other, et cetera.
Nice. Well, this show isn't for managers, it's for the scientists. Tell us about, as a scientist, what tools you love.
For the scientists, I mean, Jupyter Notebooks, super like, obviously, right? I said I won't mention the ones that everybody uses, but this one's still, I will mention it otherwise, I mean, PyTorch is just awesome. As a manager and now senior manager, I get less and less time to do technical stuff as it should, right? I focus on empowering my teams, et cetera, but I still have this itch, and sometimes I do a lot of code reviews, and I'm like, "Oh, yeah, I want to try this thing."
PyTorch, even as a senior manager that doesn't do like 50% or even 30% of the day coding, I still get back to it very, very quickly, because it's just so simple, very few abstractions, very little like, vocabulary. It's not DSL, right? It's an empire, it's an empire on steroids, and that's just so easy to use.
Interesting, can you say anything else about PyTorch? I'm always kind of curious, because PyTorch just seems to have these like passionate fans or user base. The people that use other frameworks, they use them a lot and they seem to like them, but somehow the people that use PyTorch seem just, like incredible advocates. Do you have any sense of why that is?
I've been working in computer vision since 2007. And so basically in 2012, I finished my PhD and then I moved to research and industry at Xerox Research. And then what was interesting was that was just the time, I was big into kernel methods. Everything had to be convex, and learning theory, Vapnik, super-clean. And then 2012 Krizhevsky, non convexity, not a problem. All these kinds of things. And Caffe was very big, became very, very big, especially 2013 with like Ross Girschick, and like Berkeley doing amazing work there, Yangqing Jia, et cetera, et cetera. All this kind of tools really born there, but Caffe is C++ library, fairly easy to reproduce things, but fairly hard to do your own fork and do something very different, especially in the learning algorithm, like not changing architectures, et cetera, and that's the easy part of deep learning.
But changing the task you're working on, or changing the overall learning algorithm is more complicated. And I maintain an internal fork of Caffe and we did some papers, et cetera. But then in the alternative of Theano, which was like let's say, an early days pioneer, and I will leave it at that. A great library, but not necessarily the most user-friendly one. And then TensorFlow came, and it's a huge hype train, right? Of like Google, everybody wanting to work for Google, like TensorFlow, TensorFlow.
So of course jumped on the bandwagon too, and then Lua Torch was the only kind of Asterix a little bit, like the little village of resistance to the Roman Empire. But I never really liked Lua, I was always a big Python fan. And so when PyTorch came out, the nice clean design of torch with Python, that basically became a no-brainer. And everybody that did Python like PyData, kind of sphere, like SciPy sphere, it was familiar with NumPy, immediately became familiar with PyTorch.
And that was the genius, right? No training, no onboarding, you know NumPy, you can use PyTorch. I'm psyched for JAX, it's kind of interesting because now Google kind of realize this, that the DSL, graph base, is very complicated ecosystem, very complete ecosystem. So really nice for production setup, but for researchers that are a bit more on the crazy side of things, I wish I had the time to play with JAX basically. I just looked at things and it sounds amazing. And I think maybe there's going to be more diversity on PyTorch light tools.
It's so interesting, I think with TensorFlow came out, I thought, "Oh, people would just want to use the same tool, and everyone's just going to kind of switch to this." It's been kind of surprising to see the passionate advocates of PyTorch, at least by our internal metrics, it seems like it's getting more popular. Do you have any sense about what it is about the design that makes it feel so satisfying?
Right. There is a really, really great paper at NeurIPS last year, if I remember correctly. I think it's already cited 1500 times, which is huge, right? For the paper to be cited more than a thousand times is a big, big deal. In less than one year, it just shows you how popular it actually is. It's by the PyTorch authors like Soumith Chintala, et cetera, et cetera, all these great people.
And they described their design principles in that papers, I can recommend to your listeners to check it out. It's very accessible, it's a NeurIPS paper, so it might scare people away, that's math, but it's not, it's really, really good paper to read. I won't summarize that paper, but the design principles are really, really good. And they are basically directly the results for me of the great UX. It's a user experience, right?
It's just, you can't force people in this age of open source and of free tools that are widely available and also wildly known, right? You have to live under a rock to not know PyTorch exists, right? Then the best user experience wins. It's just as simple as that. And PyTorch is just so few abstractions, I think it's like maybe four abstractions total that you have to know, that's PyTorch specific, right?
And again, the rest is just very, very generic, very powerful, has nice workflows, there's PyTorch Lightning thats tries to simplify those workflows. Maybe Keras style, high-level APIs, but just the base level one is just, you go from idea to experiments really quickly. So that would be my why.
I love the idea of user experience of a library or like a deep learning framework. It's like you normally think of user experiences like a website, but the developer user experience is so important. I totally agree.
And it's because basically just like coding, right? It's becoming democratized. There's a huge thing about no code, which is all about that. But code is still like, people are going to code for a long time. It's like people say, "Oh no code." People will stop coding soon. No, same thing as like self-driving cars, they're going to happen, but it doesn't mean that people are going to stop driving soon. There's kind of a lot of good things that can happen if you simplify the user experience for what used to be called power of users. But the '70s era is done, where only the most hardcore geeks code, everybody codes now. I mean, a lot of people code.
And I guess you're more than a researcher, right? I mean, you've been working on autonomous vehicles at Toyota, trying to deploy them for quite a long time, right? I think some people might worry that PyTorch isn't easy to put into production, but you have one of the biggest challenge of productionizing your systems. How have you thought about that? Does PyTorch work for you in production?
TRI, Toyota Research Institute, where I work, was created like in 2016, roughly. And so we haven't worked that long on it, compared to the let's say, to the Waymo et cetera, that really started in 2009. But what's fun is that we kind of started with PyTorch almost from the start. We did at first, the first year we were really working about TensorFlow. Mostly for that reasons that you're describing is like putting things in production, et cetera.
But we found out that iterating was actually a bit painful, and because the decision was within our power as the researcher, we kind of switched to PyTorch fairly quickly. So that was one of the decisions I made early on that we're really happy with, the downside to it is when you are on the bleeding edge, you have blood all over your fingers, you know you cut yourself, right? And especially on the production side.
In the early days, what it meant is that when you deploy something like in Python, or something like not glorious. I don't want to go into the details, because it's a bit like... But then as the ecosystem progress, right? And now especially in, I would say in the last year or so PyTorch has really been growing and it's focused on productionizing. It turned out to be a really good bet.
We did it from a research perspective, and a velocity of iteration because I mean, that's our stance is autonomous driving, still a lot of research problems to be solved, right? A lot of research. So you want to optimize for the bottleneck, right? It's something very well know in a production system, this theory of constraints. You look at your workflow, where's the bottleneck, and optimizing the rest doesn't really matter because it's still the bottleneck that governs the speed at which you iterate.
We found that experimenting was the bottleneck, and now like production is not a bottleneck anymore because there's great tools. Like Onyx, we're using Onyx, we're using TensorRT as part of our tool chain to deploy models that are efficiently running on GPUs, et cetera. There's even more recent projects via TRTorch, which enables you to go directly from PyTorch to TensorRT. And there's many more far beyond Nvidia hardware, there's exciting cross-compilation tools, things like the TVM stack, et cetera, et cetera.
Production wise, I think it's such a big deal to deploy models that if the second top framework or the top two frameworks don't have good solutions for that, they're doomed to fail. So they understood this a long time ago. And it's good now.
And just so you're saying today that you actually can do it. You can get it into production, it's not a problem.
Oh, yeah. I mean, we could do it before, it's just not necessarily very nice production engineering, but now there are tools to do this in a really state-of-the-art way. Not just by researcher standards, but by proper engineering standards.
Right, right. I feel a little reticent to ask you this question, cause probably everyone asks you this, this is what my parents ask me, but in your view, since you're at the front lines, what is the state of self-driving cars like? I think everyone talks about it, yet I can't get into a car, and tell it to drive me somewhere and have it do it. On the other hand, I live in San Francisco and I see these cars driving around autonomously all the time. What's going on?
That's a good question, right? That's a standard question, that's a question everybody should ask themselves every six months or so. And the question is for how long, that I don't know, I can't predict the future, but I think that's one thing that attracted me when I came to TRI was, I was just surprised how much people thought it was solved, right?
Back in 2016, when I really started working on autonomous driving, as a researcher working in computer vision and machine learning, I was like, I'm excited about a lot of exciting problems, how do we leverage the fact that labeling is expensive? So we want to optimize label efficiency, maybe even go self supervise, and these kinds of things. And it was just starting at this period, or using simulation, right?
One of the big things I've done is leveraging simulation. And I was like, "Wow, there's so many open research challenges, it's so cool." As a researcher, I have a huge playground and a huge societal motivation to actually solve, there's 1.35 million traffic fatalities every year on the road. I was like, "This is a huge societal problem, it's super important to solve that, because this 1.35 million is just crazy."
But the reason that it's so high is because it's so hard. And so it's super hard problem, super important, and there's so many research problems. As a researcher, super excited. Move to Bay Area, everybody's like it's in six months. In six months I got this, everybody from this little startup, to the big companies, to OEMs, everybody was coming up with dates. 2018, we got this, but in 2016, 18, 19, 20, you name it.
Go back in 2016 and listened to any announcements, or et cetera. You will see everybody promised everything every time. And it's to get VC money and everything like this, I know it's Bay Area, how we get funding. But the stance, Gill Pratt, our CEO, which is a former DARPA Director, he was an MIT professor and everything. He is very, very smart and an excellent roboticist.
And he had always a deep appreciation for the problems. And he was at the labs and all kinds of things. And so it's always been like it's much harder than people think, it's going to take much longer than people think. And therefore, if you're serious about it, you should be committing long-term resources, and treat it as a research problem. We're a research institute. Research is our middle name, like John Leonard, a famous robotics professors, one of our VPs, always says that.
It's going to take a while, it's going to take a while and people are now coming to this realization, because in spite of all the hype and everything, when the results are not there at the given time, well you have to face the facts, right? And so now what we're seeing is we're seeing a consolidation in the field. People that are really committed to this problem, long-term, they're willing to sink in the money, the time, et cetera, and maybe open their minds a little bit to, "Hey, it's research."
We for instance, need like strong partnerships with academia, which we work a lot with Stanford, MIT, and University of Michigan for those reasons. We don't know all the answers, so we got to work with people to, they also don't know the answers, but we can take the scientific approach to try to them out. Versus just say, "It's solved, we just need to throw 100 code monkeys, or 1,000 code monkeys, or 10,000 code monkeys at it, and it's going to work." I think that's not the case. And even the engineers at these companies is actually doing a fair amount of research. Even in the engineering-heavy companies, I think so.
I was telling a Slack community that I was going to interview you and asking them if they had any questions they wanted to ask. And I thought one of the really good ones was, it's a little bit general, but you're kind of alluding to it is, what are the big academic advances coming, that'll kind of change the game for self-driving cars? And you seem like the perfect person to have a perspective on this.
One thing that I'm particularly excited about, and that I've been doing some work on is differentiable rendering. There's this huge ambitious vision, I think the academic professor that I think embodies this the best is probably Josh Tenenbaum at MIT. He's a really, really amazing professor, if you don't know about him, just check out his research. And him and his students, and Jiajun Wu who is now a professor at Stanford.
We're actually discussing with them and they have super cool ideas around this vision as inverse graphics program. And I think that's really the right way to frame the problem. Alan Yuille, another really interesting professor, was basically calling this analysis by synthesis. So the idea is that what you want to do is with deep learning right now, which is fully supervised, is just you're learning a function that says, "Here's an image, you say jump." I say, "How high?"
Is like, here's an image, cat, dog. Just say cat or dog. Cat wrong, that's a dog. And you do that thousands and thousands of times, right? It's not unlike how we teach, like how I was teaching my daughter colors. Like red, yellow, no, it's red, blue, no it's red. And you do this kind. Then it kind of exponentially takes off and they become much smarter in their learning. But this initial phase of learning, which has rote memorization kind of like, this is how deep learning works. The problem with that is that interpretability, data costs, lots of problems around that.
And so for vision, what's interesting is the world has structure, right? And there's physics, like Newton existed. There's physics of, there's gravity, there's physics of light. There's a lot of inductive biases that you can leverage, you can take basically just physics and physical laws and then try to bake it into your learning approach. And differentiable rendering or inverse graphics is one way to do it.
Basically, it's just take your sensor, you're trying to deconstruct the world, and resynthesize it. And that way you can compare in a self-supervised way what you reconstructed from what you observed. And the benefit of that is that you get systems that generalize much better, that can be trained on arbitrary amounts of raw data, don't need labels. And they also have some interpretability to them, they have some structure, right? Because they're deconstructing the world and following some structure, et cetera.
Differentiable rendering is a big, big one for me, vision inverse graphics is a big one, and there's many others. Self-supervised learning in general is something I'm very excited about, and it goes beyond just differentiable rendering. There's many other ways to leverage self supervision, especially time when you look at video, like the temporal dynamics, contrastive learning is a super hot topic right now.
And there's interesting works, I think from Max Wellings Lab called the Contrastive Structured World Models that I think is a cool paper, not really super applicable right now, but I think pure and exciting ideas and I would just leave it at that. Vision as inverse graphics, self-supervised learning, I'm super stoked about that.
I hadn't heard of contrastive learning before. Can you describe that briefly? You did such a good job with that with the other topic.
All right. Well, I mean, overall in a simple way, I would say that contrasrtive learning, there's a really cool paper that I can recommend everybody to read, which is the paper from godfather of deep learning, Geoff Hinton, it's called SimCLR, Sim-C-L-R. And it explains a little bit in... It got state-of-the-art results, basically there's two big approaches in contrastive learning that work really well. SimCLR and MoCo from FAIR, Kaiming He, another super impressive researcher.
And the basic idea is, it's some form of metric learning if you want. You basically want to learn a representation that verifies some ordering property, or some distance property. A traditional way would be, here's an example, here's one that is close to it, and here's one that is far from it. And what you want, is you want to learn the properties of your representations, such that this is true, and in a very simple way.
And in general, it's related to metric learning in a general way, but the cool thing is that for instance, in this CSWM paper, Contrastive Structure World Models, paper that I was mentioning, you can look at it as temporal dynamics, things that are close in time should be close in representation in feature space, and things that are far should be further away.
It's not always true, and actually we have an ongoing work with Stanford, a paper called CoCon, co-operative contrastive learning, where the idea is, in some cases in videos, things repeat themselves. And so you want to basically leverage multiview relationships, such that you know that the same thing in multiple views should also be close. It's not just contrastive learning, but also cooperative.
But it's an exploding field, there's so much work on that. The cool thing about it, the SimCLR, et cetera, it was shown that you can replace pre-training on the larger label dataset like ImageNet by just doing unsupervised pre-training with contrastive loss.
Wow, super cool.
And in practice, it's a big deal, because for instance, we can't use ImageNet to deploy products. If you're wondering like, "Oh, I can just easily take an ImageNet pre-trained model, get a few labels, few shots, transfer, and use it for production." You can't really do that, unless you have a license, a commercial license or something like this. Being able to do unsupervised pre-training, which was one of the early days, early inspirations of deep learning, with restricted Boltzmann machines and whatever, you want to do unsupervised pre-training with a lot of data for a lot of time. And then very quickly fine tune with a few shots setting, like a few labels. And it seems like we're there now.
Very cool. All right, switching gears a little bit, I just want to make sure I ask you this question, because you were telling me that you listened to our interview with Anantha, who's a VP of engineering at Lyft, and I think he brings maybe a different company's perspective, and maybe also a different... He kind of came up through engineering and thinks of himself as an engineer. And I was kind of wondering how your answers for the same questions about taking autonomous vehicle to market would differ from what he said.
One thing that I take from what he said was, he talked a lot about the organizational aspect. I think that was really interesting because when you think about engineering and you think about the problem like self-driving cars, it's not a one man or 10-men team, right? Or women. It's not 10 people effort. The challenge is it requires a lot of people and a coordination of a lot of people, also it's a robotics problem that is pretty wide in the skill set that it requires.
You have from people like hardware, we have amazing hardware people at TRI, which is kind of always impresses me, because I can't use a solder iron, even if you put a gun to my head, but these guys, they are magicians. We have really good hardware people, you have cloud people, you have all kinds of different skills. And one thing that I remember was in the podcast was, ML is a skill, right?
ML is a skill that is to be shared with everybody, and so that's why it's kind of diffused in the company to be successful at deploying this. I think that's a really good point, I agree. I would add something to it, which is because I lead a machine learning team, right? There is such a thing, so even though it's a skill and it should be everybody has it, I actually lead a team called machine learning, right? Machine learning research, and so if it's a skill and it's diffuse why have a team that's like this?
And we iterated through a couple of models of "we're kind of experts", and then teams can basically request projects where we help. So we kind of like embed in other teams, but that was not necessarily super successful, we basically got back to "we do our own projects", and we try to then seed some kind of more crazy ML projects that other team then can carry forward. In terms of bringing it to markets, for me, this is the organizational challenge.
I know it's kind of maybe not a typical answer, but I think because he insisted on that, I think this is really good to... There's something called the Conway's law, which is an organization that produces software tends to produce software that's structured like the organization. If you have, typically in self-driving cars, you have a perception team, you have a prediction team, you have a planning team, or you have a perception module, you have a prediction module, and then a planning model, right?
And then you have the whole kind of challenges as a manager, which I discovered, which is like siloing, communication across teams, all these kinds of things. And as an ML person, that's leading an ML team, what I found difficult is that, in ML you want the holy grail for self-driving cars is that they improve with experience. And I think that's one of the biggest misconceptions that people have about learning.
If you chat with like people like your grandma or whatever, about learning, and you explain them the high-level concept, what they immediately think is that the robot learns after deployment, right? You kind of like, your self driving car might be done when you buy it, but it's going to become smarter because you're going to teach it. And that's what machine learning is. And that's not at all what it is, right? That's not at all how it works, right?
There's a duty cycle, there's an operator... You retrieve data, you look at data, you label it, you test it, and then you deploy it. And this can take a long time, right? On some huge time scale this might be true, but on the short time scale, it's absolutely not true. The iteration speed is the key, and the challenges with this organization around perception, prediction, planning makes it very difficult to have the whole system optimized really quickly from use.
And so I think that's the major bottleneck for me as a machine learning person, which is, if driving from demonstrations like user experience and things like this, how can we make every system as quickly improving as possible? And this is this idea that we're very big on TRI called fleet learning, right? Which we don't care just for cars, but for home robots in general is like, we have millions of evolutions and millions of years of evolution plus decades of parental education, and machines like a car doesn't have that leisure, right?
Nobody would buy a Toyota if they had to say, "All right, I buy it at six months old" and then I have to tolerate all kinds of distractions like we were talking about just before recording. And no way people would buy a car like that, or a robot like that, right? That destroys half the home and then say, "Oh, it's okay, it's learning." We got to speed things up, right? So the learning has to be much more accelerated for machines than it is for humans, and the only way to do that is parallelism. And so fleet learning is something we're very, very big on for that purpose. Fleet learning and end-to-end system level optimization and the right organization to match behind, I would say are the three big bottlenecks to deploy any robotic system.
Interesting, I guess I kind of think of machine learning as primarily helping with perception. Am I wrong on that? Do you view machine learning as something that goes everywhere in the-
Yeah, both. Yes, you're right that today perception is the main application for machine learning, at least in robotics. The reason for it is because there's just no way around it. ImageNet competition is kind of funny, one of my mentors and one of the people I admire the most is the called Florent Perronnin. And he was the head of the computer vision lab at Xerox Research.
And he won the ImageNet challenge before deep learning. And in the year of deep learning, people say, "Oh, in deep learning halve the error rate." Well, they halve the error rate of Florent, which he improved, every year was improving 2% extra. There's some kind of inevitability to it, and again, Florent became really good at this, in the lab, and we all got into deep learning because again, we face the evidence as scientists.
It's inevitable because it works so much better, and also because there's no other way. You cannot engineer a world model, because you do this, and then you say like, "Oh, these are the labels I need, these are the features I need, and all these kinds of things", and then the world constantly changes, the world is non-stationary. Then you have like scooters, you have literally humans flying at 30 miles per hour on the streets. And you're like, "Wait, what? Is that a pedestrian? Is that a motorcycle? Is that a bird? Is that Superman? What the hell?"
And so it's inevitable and it works so much better. For perception, it's no brainer, even the most hardcore feature engineering, passionate people or people that believe there's an equation for everything. Nobody I know argues that this is the wrong approach to perception. But it's not the solution either, it's not like a slam dunk either because we need to go beyond that.
I would say robust perception is not solved, some form of perception when you know everything, et cetera. And you don't care for these nine nines of reliability, right? You can get really, really far, but uncertainty modeling, handling like false positives and all these kinds of things, that's a really hard problem. That's why machine learning, every obstruction is leaky. That'd be going back to PyTorch, that's why I like minimizing abstractions, because any obstruction is leaky.
And the problem with the modular robotic stacks, like perception, prediction, planning, is that you're making obstructions, you're making APIs. And the contracts you're making is like, if you think microservices type of things, they're all statistical in nature.
You're kind of saying, I'm going to give you something that I'm calling a red traffic light, and I'm confident that 99% of the time I'm right. What happens during this 1%? You're on your own, right? And it's unavoidable, right? Because no system will ever be perfect, and you shouldn't require a robot to be perfect. It needs to be better than a human, but it doesn't need to be perfect, otherwise you will never ship. How do you robustly handle uncertainty, and how does it propagate through each layer, and how do you think statistically versus logically or symbolically?
And that's becoming harder and harder as you move from perception, to prediction, to planning, because then planning is actually reasoning, right? It's search, it's reasoning, it's a higher-order cognitive function in a sense. And manipulating just feature vectors, like esoteric feature vectors, it's not really how it works.
This neural symbolic system, the best of both worlds, like Marco Pavone is an awesome Stanford professor that's doing cool research on that. How do you combine deep learning with more logical forms of reasoning? Something also we're looking at TRI a little bit.
How does it work today? Do you actually send more information? I feel like other people that I've talked to have talked about not just sending the output of the perception algorithm, but maybe even some of the activations of the parts of the neural network before the output. But then I wonder, what do you do with that downstream in a sort of logical system? How does TRI handle that?
Right. Actually that's not the approach we're taking, because you're right. It's kind of you're just pushing the thing under the rug. It's like hot potato game, it's like, "I don't have to solve this problem, there you go". And typically it doesn't really work well across teams is like, "I don't know if this going to work, but that's your job now." My personal holy grail is like building an end-to-end differentiable, but modular system.
It's still like engineering what you know, but learning what you don't. And so what it means is that you still have a perception module. It still outputs some concepts like, "Oh, this is a person." Persons exists, roads exists, we know this, right? The problem is that we're unsure whether our inference about them is right. Here, and my boss Wolfram Burgard, which is one of the legends of robotics because he wrote this book called Probabilistic Robotics with Sebastian Thrun and Dieter Fox, and created this whole movement, one thing we discussed very often with Wolfram is like, there shouldn't be an argmax, right?
If you have an argmax in the middle, somewhere upstream, you are basically destroying uncertainty, right? You're just forgetting any uncertainty you have, and what's really interesting is that, this is coming from a theoretical perspective, but from again, an organizational perspective, if you are the planning team and I give you something from a non-perception, and I give you, this is a red traffic light, and I'm wrong, you're going to be saying, "Hey, we crashed, it's your fault, you're wrong fix it." And I'm like, "Well, but I can not always be right, and you will be la, la, la, right?"
This is not how it works. The things you pass, every data structure that you pass, every information that you pass is a distribution. It's probablistic in nature. I know I'll sound like a Bayesian crazy guy, but I'm not a Bayesian guy, but from just a principled approach, you aren't certain about everything, right? That's a good principle in life too, you shouldn't be like too confident in everything, but so you pass the uncertainties.
Very concretely, your object detector, you try as much as possible to not argmax, let's say over the logits, to say like, "Oh, this is a person I'm sure." You're passing the full probability scores. And then you have to handle it downstream. You have to have a model that doesn't say, "If person, do this", right? That breaks any kind of rule based system you would have downstream.
You have to digest uncertainty. We have a recent paper at IROS where we showed that, for instance, you can pass a perception, probabilistic perception outputs, into an imitation learning like behavior cloning. The system is done, it's with ETH [Zurich], like Andreas Buehler, an intern of mine. It's going to be published soon. Passing uncertainty and leveraging uncertainty in the representation for downstream applications.
We also have very cool research with Stanford, with Boris Evanovich and Haruki Nishimura, two wonderful PhD students at Stanford working with Marco Pavone, and Mac Schwager, and it's interesting, it's like people in robotics and aeronautics, et cetera. And they're really, really good at thinking about safety and these constraints. And so here the idea was, Boris made a paper called Trajectoron, Trajectoron++, which takes in tracks of objects and can output multiple possible future trajectories.
And that's great, you can predict the future on that. But the problem with that is that it's very difficult to leverage in a planner. Now we can say, "Oh, I could go left, I could go right, I'm not sure". And then the planner is, "how do I decide", right? And if you're too conservative, right? If you mind safety and you're too conservative, then what happens is that everything is possible, therefore you have the frozen car problem, right?
It's like, I don't know what to do, everything is possible therefore, I will not move. So then you have a self-driving car, but it stays in the garage, right? Not great. So with Haruki and Boris, we basically did a system where we modified some of the controls. So it's like, you have to have very deep knowledge about control, and people Mac Schwager, Marco Pavone are really super, super smart about this.
And this is called risk sensitive control, where it, basically, what you can do is you can leverage these different samples from the trajectories, and reason in terms of control of "how do I minimize my risk?". How do I optimize my objective, like I want to drive, I want to go there, right? But at the same time, I want to avoid collisions. And so a really interesting thing is that there's a simple mathematic trick called the entropic risk.
And I can refer to the same thing published at IROS, and you can find this on my website, where you can basically change the objective function. So it's almost just a change of mathematical formulation of the optimization problem, of how to plan and you can have a very interpretable high level variable that's called the risk sensitivity to say, "If you're risk sensitive, you can go there. If you're risk neutral, you can go there. If you're risk-seeking, you can go there."
And then the problem becomes, how do you address this? And we have follow-up work on that too. ML is everywhere, but it's starting from perception and more and more moving to prediction. And I think where the cutting edge is really is in the planning and control side, because we did also some work with Felipe Codevilla, and other folks that are now at Mila, and Yoshua Bengio's lab, where we looked at behavior cloning. So just learning a policy from pixels to steering, but that is still far from like a very well-engineered stack and the domain that you know. How do you bridge these gaps? ML is everywhere, of course I'm biased.
Right, right. And I guess your team sits outside of any of these teams. It's sort of like going back to Conway's law. Your goal is to put ML, I guess, in every component of the-
Exactly, exactly. Yeah, right. So we're trying to find applications wherever possible. And again, the holy grail is that we want to improve the end-to-end system of experience.
And going back to what you said about fleet learning. How does that work? I mean, what actually can you learn from the fleet, and how real time is that?
Right. That's a good question, insightful questions. You could argue that today, fleet learning exists, but it's the disappointing version of it, which is just data haul back, put in a data lake, ETL, label, all these kinds of things. Basically anybody that kind of does data science does fleet learning in a sense. For robots, the whole kind of like spiel, it's like what Steve Jobs was saying, is like a 10X quantitative improvements is a qualitative improvement, right?
And so if you really improve the cycle, the iteration speed, that's where you're going to get to true fleet learning in the proper sense. One way is to just make that same process just faster by just optimizing it. So reach all of your data faster and iterate faster on it, and redeploy the models faster, right? And so that would be, for instance, looking again at this theory of constraints, or theory of lean, look at the bottleneck, the bottleneck is labeling. So that's why we work on self-supervised learning. Being able to do faster and faster fleet learning, in this sense, would be just "Get more out of self supervision", so that you can iterate quicker, and update the model with less labels. That's the big direction we're doing.
Another one is to start to look more towards the holy grail of lifelong and continual learning, where you have things like federated learning, and these kind of things, where what you share is not the data, but what you share is, let's say, the gradients, respective to a local objective that you computed for instance, right?
That has some benefits also in terms of privacy, in terms of communication, in terms of many, many things. We can do lifelong distributed learning in this way and federated learning, and these approaches exist. Ultimately, beyond just sharing data or just sharing gradients, you would want to share more useful kind of concepts, the distillation of what you learned, right?
Because here you're never distributing the learning, you're distributing what enables a centralized learner to then share it back, right? But what you want is to, distillize what you learn, share what you learn, right? And so that's what ultimate fleet learning has to become. It's not completely clear exactly what's the best way to do that, but there's a lot of exciting research on it. There's like super cool research on meta learning, for instance, from Chelsea Finn at Stanford, Sergey Levine at Berkeley, that's exciting.
One of the exciting research problems is how do you do very efficient fleet learning, where what you share is the distilled experience of each robot individually, so that they learn as fast as possible.
Makes sense. And I guess, it's a little bit of a diversion, but I just wanted to touch on the meta learning a little bit, because you're one of the people that really pushed us to build this Ray Tune integration that we've actually launched recently. We're really excited about, I'm curious if you could say a little bit about how you think about hyperparameter search, and just sort of like the optimization in general. When do you do it, what value does it bring you, what strategies do you use?
Right. I'm a big fan of Hyperband. And because just simply this idea, first the formulism is nice, it's formalizing as an online learning problem, like the bandit style. The second thing is I think the best way to be efficient is to reuse computations and that's exactly what they're doing. And because everything we're doing is SGD, right? Stochastic Gradient Dissent, it's an iterative process.
Having this idea of continuing optimization of the best solutions, and selecting things like this, I think that leverages some unique aspects of the optimization. In the early days, I was using Hyperopt and I liked it a lot. And this notion of, you model your parameters as random variables, your hyperparameters as random variables you don't know. And so what you do is you sample from them and then you fit the distribution.
When do we use it? Well, that's a hard question, actually. That's a really, really hard question, because when you're in production, you use it whenever you're about to deploy the model, you have some good confidence that this model is working well. And now you want to squeeze a little more out of it and you have a clean protocol, right? Train, eval, test set split, or you do something a bit more sophisticated than just a good split.
But already a good split is, it can go- When you build your own datasets, if you build a large validation set, a large test set, and it's diverse, and you're good at building those data sets, then that brings you a long way. In production, I think the answer is easier, but in research it's hard because in research you're like, "Is it not working as well as I intended because I have a bug?" Because it's typically more fresh code, right?
Production is like, the code has seen more pairs of eyes and more iterations, but in research it's more fresh. So is there a bug? My advisor kind of like, Cordelia Schmitz, actually taught me this during my PhD, which is whenever there was something funky, you ask yourself five times, whether there's a bug. And it's funny because I discovered that there's this exercise in safety, in critical analysis called the five why's.
And asking why five times for when there's a fault or there's a problem. You think "five times, easy", by the third time, you're like, "oh, this is hard." Is it a bug or is it just a hyperparameter search problem? And so typically you want to start hyperparameter search is kind of a hammer or a bazooka. You don't want to use it to kill a fly, right? I tend to try to delay the use of it when I believe that's the source of the issue. So that would be my answer.
It's interesting. We've gotten very strong reactions, both ways I think, in the interviews that we've done. I mean, some people are like, "Oh, hyperparameter search keeps you from actually learning the underlying structure of what you're doing." Other people are like, "Why would you spend any time figuring out what hyperparameters you want? Just let the machine do the work." So it's interesting.
I think there's these two phases, right? You start to develop the intuition and you experiment with it, because anyway, hyperparameter search is not magic. You have to decide the ranges of your high parameters, right? Or even if you use Bayesian hyperparameter optimization, you have to decide even the probabilistic distribution. Is it the log normal? Is it this kind of things, and the bounce?
And it's typically an iterative process, right? You do a first guess, you realize, "Oh, my optimum is on the edge of the grid search, on my edge of my grid search. Okay, I need to extend it", right? And so you iterate on that. It's still not, in my experience, maybe more on the research side, it's not like, "Oh, yeah, and now hyperparameter search, done." It's not like this.
Well, all right, cool. We're running out of time, and I always like to end with two questions for consistency, and maybe we can do some analysis on the answers one day. But the penultimate question that we always ask is, what is an underrated ML topic that you think people aren't talking about as much as they should for how valuable that might be?
I would say the things that we mentioned earlier was like how little machine and how little learning there actually is in machine learning. I was having a nice chat with Shivon Zilis and she was saying, "I'm interested in machine not learning." And I liked how she phrased it. I think more seriously there's bigger concern around ethical use of AI and all these kinds of things, right?
And there's this, always the saying of because you could, doesn't mean you should, right? And so the question in machine learning is very much...that I think is interesting is "what should not be learned". And so there are certain applications that should not be pursued, things like classifying, like justice legal appeals. We've seen all kinds of horror things, especially computer vision, sadly, there's a very easy way to...
There's just applications you shouldn't do. And I think as a leading researcher on these questions, she has done a great tutorial, I think with Emily Denton, at CPR this year, which I can recommend your readers to think about this, or socio-technical challenges there. But in practice also, and more like once you work on a good problem, like saving lives, or self-driving cars, or helping people age in place with home robots, and you know you should do these kinds of applications, then the question becomes, "which part of my system do I design? Which part of my system do I learn?"
And this is tricky because you need some design to generalize, because if you try to learn everything, like the example, James Kuffner, former CTO of TRI, and the head and CEO of TRI-AD, he was basically telling me, "you don't want to drive in front of a school a thousand times, and risk bumping into kids, to understand that the limit is 25 miles per hour", right?
There are certain things you need to engineer, but there's certain things you need to learn. Where do you set that line is very difficult, and the second answer would be labeling. If you listen to anybody that's talking about self-driving cars or everything like this and AGI, and all these kinds of things, never ever did they mention labeling. And if you know what's happening behind the scenes in the industry is that this is what is going on, this is what learning is.
That's what I said, there's very little machine in machine learning because it's human labor, a tremendous amount of human labor, like hundreds or thousands of labelers today, right? You don't have self-driving cars, and yet we have thousands of labelers for a single application that are just clicking on pixels. And you know about this, right? You must do this, my take on this is that you must label for the testing purposes, but as much as possible, I would like to avoid labeling for training purposes, because the training has to be continuous. That would be why I think labeling is just surprising how few people are talking about it in industry, the costs, and the scalability issues they go with.
Interesting. All right, well, I could ask questions forever about that, but I want to ask my final question and wrap this up. What is in your experience, the hardest part about taking models, and getting them deployed in production and use, where are the big bottlenecks?
Right, right. Scalability is such an obvious answer, right? Where, In the research land, we have the idea, we have this prototype, we use a standard data set, or even build our own data set, to prove an idea. And then you have first like a human element to it, which is you have to convince people to run with that idea, and that should be enough, right? You should have a nice W&B in the report, and they go, "Yes, awesome, we're going to put 20 or 30 or 100 engineers on it", just based on that.
There's some amount of convincing people, like to go from research to production. You have to be compelling, enough compelling evidence. That's more on the researchers, and our bottleneck is to convince people. And part of it is the scalability, right? People might say "I understand the idea, I can see that this works, like your evidence that you present, but I don't think it's going to work in this scenario, or I don't think it's going to be cost-effective, or I don't think it's going to be easy to just scale up computationally speaking" or something like this. Worrying about scalability is something we really do.
So one of the things that I did earlier this year is that, we kind of split the efforts between, in the ML team, between research and engineering. And Sudeep Pillai, is now the head of ML engineering at TRI. And he's really driven to this, whenever I talked to him, he's like, really that's his drive, is that, how do we think about scalability? And he's doing really cool work with his team on semi-supervised learning, and these kind of ideas to try to scale things up so that we can go from research ideas that are maybe not scalable to more well-formed transferable research idea that is shown to work at scale already at this stage, or to prototype stage. And then that can maybe be transferred.
What does it mean for an algorithm to not be scalable?
For instance, labeling you say, "Yeah, I can grade to great performance, I can get to 80% with a data set this size." And then you look at the performance improvements with the training set size, and then you realize that the improvements are logarithmic, and this is not good, right? And you say, "Oh, it improves with data, that's great." But then your cost is going to be all right, you need a billion-dollar datasets. You don't really want to do that, and a billion dollar compute, right?
This is a bit like the OpenAI story, which is if you have infinite money, infinite labelers, infinite time, infinite compute, how far can you go? And it's super interesting to see these guys, it's amazing what they do in terms of pushing the boundaries on that. But for some applications in the real world, that's not really reasonable. You have to kind of think about how you scale, and we're not trying to solve everything at once. Really for us, the compute is something we kind of ignore for now, and we're saying, "Let's assume you have infinite compute, but you don't have infinite labeling budget, right?"
I was going to say, we see OpenAI's usage, and your usage, and your usage is not small.
I'll take it as a compliment. No, we're well-funded then we're utilizing it well. So computation, right? Again, I said, not enough machine in machine learning, but that's what we're trying to do. We're trying to make machines work for us, right? It's kind of funny how people were saying, "It used to be that humans play video games, and then they do the work, but now it's kind of like we work so that machines can play video games." We make algorithms that can play Atari, while we don't do Atari games anymore. So that's weird.
All right. Well, thanks so much, Adrien. It's so much fun to talk to you.
Likewise, it was a pleasure.