Drago Anguelov — Robustness, Safety, and Scalability at Waymo

Drago discusses current trends in autonomous driving technology, the challenges of simulation and scalability, and why it's important to find rare examples.
Cayla Sharp, Angelica Pan
Created on July 11|Last edited on July 14
Comment
﻿
﻿
About this episodeDrago Anguelov is a Distinguished Scientist and Head of Research at Waymo, an autonomous driving technology company and subsidiary of Alphabet Inc.
We begin by discussing Drago's work on the original Inception architecture, winner of the 2014 ImageNet challenge and introduction of the inception module. Then, we explore milestones and current trends in autonomous driving, from Waymo's release of the Open Dataset to the trade-offs between modular and end-to-end systems.
Drago also shares his thoughts on finding rare examples, and the challenges of creating scalable and robust systems.
Connect with Drago & Waymo:﻿Drago on LinkedIn﻿
﻿Waymo on Twitter﻿
﻿Careers at Waymo﻿﻿﻿
Listen﻿
﻿Apple Podcasts﻿﻿    Spotify﻿     Google Podcasts    YouTube ﻿﻿
Timestamps0:00 Intro﻿
0:45 The story behind the Inception architecture﻿
13:51 Trends and milestones in autonomous vehicles﻿
23:52 The challenges of scalability and simulation﻿
30:19 Why LiDar and mapping are useful﻿
35:31 Waymo Via and autonomous trucking﻿
37:31 Robustness and unsupervised domain adaptation﻿
40:44 Why Waymo released the Waymo Open Dataset﻿
49:02 The domain gap between simulation and the real world﻿
56:40 Finding rare examples﻿
1:04:34 The challenges of production requirements﻿
1:08:36 Outro﻿
Links﻿Inception v1﻿
"SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation", Qiangeng Xu et al. (2021)
"GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting", Zhao Chen et al. (2022)
Watch on YouTube﻿
TranscriptNote: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
IntroDrago:
I think that simulator has this huge scaling promise. You take any scenario you saw, you release the agents and you release yourself, and you can try all kinds of stuff and they can try all kinds of stuff.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald.
Today, I'm talking with an old friend, Drago Anguelov, who is currently a Distinguished Scientist and Head of Research at Waymo. He's been working on image models for at least the past 20 years and was another student of Daphne Koller, who was also on this podcast. This is a super fun conversation. I hope you enjoy it as much as I did.
The story behind the Inception architectureMy first question is something I hadn't realized, which is that you were one of the authors of the original Inception architecture.
Drago:
That's right.
Lukas:
I should know but I somehow missed that. Can you tell me how that came about and what you were thinking about at the time?
Drago:
Actually, the story goes back even before.
At Google, I worked on Street View for a bit which was related to autonomous driving and one of the areas where, actually, computer vision used to work pretty well. There were two, face detection and license plates blurring. That worked pretty well at the time.
The other thing that worked really well is 3D reconstruction from cameras, from LiDAR. That's what I used to work on. Bundle adjustment and so on.
Lukas:
Were those deep learning models at the time?
Drago:
None of them were deep learning.
Actually, we had one person in 2008 or '09. He came from Microsoft. I think his name was Ahmed or Abdul. He used deep nets to essentially detect and blur license plates. Everyone was very unhappy that he used deep nets, because it was his own code base and no one else was doing anything like it. Of course, you could modernize and upgrade it by doing support vector machines.
Lukas:
Right.
Drago:
Eventually, people tried to modernize and upgrade with support vector machines the neural net things and they didn't quite succeed. I think they regressed a bit but everyone used technology they understood.
I didn't work exactly on that problem but I think at the time, that's how we used to do it. That was in 2009, maybe in '10, right? And after working in this field, I decided that maybe I should do something more adventurous in my career and join a team in Los Angeles that essentially was called Google Goggles. It was not the glasses. It was a little app that did computer vision and we used to use it for experimental computer vision tasks.
There, we started experimenting with different applications of learning and deep learning to computer vision. How can we recognize these objects in these pictures?
At the time, there was a time when... I was a tech lead manager of a small team. There were four of us. Half of us did graphical models, deformable parts models. You may be familiar when I was a student of Daphne Koller, we did a lot of those. Then half of us, the other half, we were experimenting with deep learning. That was Christian Szegedy and Dumitriu Erhan.
In those early days, the deep learning models at Google were something that was called QuocNet, which is a non-convolutional, big neural network that Quoc, a step-student of Andrew Ng, brought.
For a while, we were trouncing it with deformable parts models and I was working with an intern who later became a Googler. We had the best deformable parts type detector. Even collaborated with Professor Deva Ramanan. He was also in the LA area at the time. We had the collaboration, built something nice.
That's where Christian Szegedy came in. He was actually on my team. For a while, the deformable parts were beating the deep nets. But then eventually, AlexNet came in and then all of a sudden, no custom solution could beat the deep nets and so we switched to this. But we were early on this already. We had people that had been doing this for a while.
So two interesting things happened. We started optimizing the architectures because, actually, in Google, that was the easiest thing to do. Because the system for training them, it was called DistBelief. It was pretty unwieldy, and so you couldn't be too smart. The easiest thing to do is to just tweak the architecture.
We're tweaking it and Christian, one day, comes to me and says, "Hey, Drago. I have this idea. It's a Hebbian-inspired idea. I'm going to train this new architecture." I was like, "Oh, Christian. Very nice." I mean, we had been playing too. I had some versions of architecture that was 1% or 2% better or something.
"What part did you change?" It's like, "I'm going to change everything."
I was like, "Oh, that's a great engineering approach. Aren't you worried that...who knows what will happen?"
"No, I have a good intuition. I'm already training some. It's doing great."
A bit later, he's like, "Look at this thing!" It beats anything we've ever seen. And that's when we decided to do also the ImageNet challenge. We had this and some detector work as well. SSD came out of it. That's also a very strong contribution by Christian.
But he was bold and he decided to just try more ambitious things. And in these early competitions, people still tried to do a lot of smart things in the old style. They tried to embed non-algorithms in the networks instead of making the networks better. And we, for good or bad, were in the environment where the easiest thing we could do is make the networks better.
That's what we did and I think that really helped early on.
Lukas:
What was the intuition that he had to...what was the tweak that really made a difference?
Drago:
I mean, I think there is...if you remember the Inception architecture, it had this...in each module, there were several paths. One path was doing 1x1 convolution, mostly just adding depth processing. Then it was 3x3 and 5x5 convolutions, and those were adding...expanding the receptive fields, right?
Then you had a separate channel for each which kept the model still tractable, so it's not like number of inputs than number of outputs. So you had some...it's something like block diagonal, not quite structure because I had three channels and then you, again, condensed the information from those.
That was the idea for a block. It's a nice compact way to still add a lot of rich structure in depth. That was, I think, very powerful.
I think if you ask Christian, he'll give you a whole other story why he came up with this model. I'm not sure I'm the best person to channel it. You should invite him to...I mean, he did a lot of these early visionary things. But we all worked on it together and that's how I was part of it.
The other thing we...actually, again, Christian was very involved. We discovered...this was 2013 or so. Deep nets, we used a lot for classification but not many had used them for detection. Hartmut Neven, who was our director, came and said, "Hey, Christian. I have this idea." I think some of it came from Christian. "Let's make a better detector. We just backpropagate the signal through the network and see which parts of the image caused it to fire that it's a cat. If you do this, you will find where the cat was because the network will highlight you the cat."
We're like, "Oh, that's a cool way to do object detector. We don't need to...we can mostly use a classifier. Let's try it."
We tried a few versions but Christian tried it and it's like, "It doesn't work." And it's like, "Why doesn't it work?"
"Well, the image doesn't change. Now, it says 'It's not a cat.' It's whatever, 'giraffe.'" I mean, you name it. We're like, "That's strange." He debugged for a long time.
I mean, it's also that at the time, the system was complicated. The written was not easy to debug. Maybe two months, he debugged, including trying on MNIST. On MNIST you could do the same.
Then eventually we realized, "Okay, something is happening here. There's these adversarial examples. You can just flip the label without much visible...any changes in the image." But we started off to discover a detector and then ultimately, he ended up with the paper. They bunched several discoveries in that paper but by far, the primary one was the adversarial examples.
Lukas:
That must have been an exciting moment. Did it feel like these image tasks were getting better much faster at that time or did it feel like a gradual change?
Drago:
I mean, it was a very exciting time, right? When a new set of...a whole new field opens in front of you. Let's try to do computer vision with deep nets and most of it hasn't been done. And most people are not doing it either, right?
I mean, there were a lot of developments at that time. Every few months, something pretty major happened. I mean strangely enough, this continues. If you caught a bunch of people in 2015, '16, and say, "Okay. What's left in computer vision, in 2D computer vision? How much should we do?"
We're like, "We're pretty good already." I mean, that's why I went to self-driving. I was like, "Okay, 2D computer vision on images is pretty good now, in 2015. Let's do cars. That's a whole other game."
But now, early on, there were a lot of big developments. Batch normalization came out, Sergey Ioffe, and again, Christian Szegedy were involved. That's down the line. I mean, in Google Brain, people did a lot of really cool things.
It was just like one after the other, there was a group of people. That was also a time when a lot of academics came to Google to do deep learning, right? Later, a lot of them went back to academia, they realized they can still do it there.
For a while, it's like, "We just need to do it in the big companies." There was a bit of this. At least that was my exposure to it. Maybe, people have different interpretations. It went back and forth. Now with what people call foundation models and the big transformer language models, people say, "Maybe we should be in the industry again," right?
But there was a time when people could go back to academia and not feel too deprived.
Lukas:
I think there have been four versions of Inception, right? Are people still working on improving these architectures or does it feel like we've squeezed out all the improvements from that?
Drago:
I mean, it's moved on.
There is actually a guy on the Waymo research team called Mingxing Tan, who worked with Quoc Le, of the famous QuocNet I described. That was not convolutional. Hey Quoc, I don't mean anything bad. These folks are doing great work.
I think Christian just moved away from trying to improve the architecture. So, there were Inception and then there were... I think, Francois Chollet — who was also was briefly on the team I led at Google that was still '15, who did Keras — he had, I think, Xception or Nextception, another variant he simplified.
Lukas:
Yeah, that's in the Keras library for sure.
Drago:
So, he developed that. I think afterwards, now, people moved to the large transformer models, right? XCiT, and Swin Transformers, and Google. Minxing Tan, some of his work. There's a model called CodeNet.
In our times, we top-1 on ImageNet. We would get maybe 70% accuracy. Of course, top-5, it's a lot better. People used to score at top-5. Now, people can get, I think, 90% top-1, with a lot of pre-training on large datasets and other things.
But this CodeNet is a hybrid convnet transformer and it's dramatically bigger than the models we used to train, and it's pre-trained on a lot more stuff, potentially. But people have pushed what's possible in ImageNet a lot further. Now, I'm not sure how much yet further you can push it, given the inherent limitation of the dataset.
Lukas:
Right.
Drago:
But people are very good at ImageNet with different technology than what we used to do.
Lukas:
What's the inner-annotator agreement on ImageNet? How well do humans do ImageNet?
Drago:
I have no idea. I mean in the old days, Andrej Karpathy, he did a test where he tried to label the test set after training himself. And I think the models were competitive with him. By now, I think they blow humans out of the water.
Speaking of Andrej Karpathy, actually, it's a small world but in 2012, when we were doing deformable models and deep nets — actually I was going there with the story and then we went to other directions — I had the chance to pick either Andrej as an intern to do deep learning, or a guy called Xiangxin Zhu from Deva Ramanan's lab to do deformable models. And I picked Xiangxin Zhu, right?
So, I never got to work with Andrej. Maybe to my peril, but yes.
Lukas:
I remember I interviewed with you to be your research assistant as a master's student and you chose Jimmy Pang, who is a very talented guy.
Drago:
Oh, my. He's a very talented guy. I'm sorry. Don't hold it against me.
Lukas:
No, he's a good choice. I can't hold that one against you.
Drago:
Hopefully, it worked out for all of us.
Lukas:
It worked out for everyone.
Trends and milestones in autonomous vehiclesLukas:
I'm really curious about...you've been in autonomous vehicles for quite a while. From the outside, it feels like autonomous vehicles are steadily improving. Sorta feels inevitable to me, I guess? But so hard to tell when I'll really be able to just purchase an autonomous vehicle and ride it.
I'm kind of curious, what... I mean, I'm really curious about what your thoughts are where things go. But have there been major breakthroughs in autonomous vehicles in the last 10 years that you've been working on them or has it been really an iterative process?
Drago:
It's an interesting domain because...I wasn't at Waymo early on, but people that were at Waymo were very proud of the demos they could do even 10 years ago or 12 years ago, right? Waymo is 13 years old, we've worked on the problem a long time.
As everyone also understands, it's the very interactive, rare cases that you need to be robust — and all the possible failures that you need to be robust — that makes it so hard. A lot of these improvements are not so easily perceptible.
In the early times when you sit on the vehicle, it feels pretty good but you need sometimes dramatic improvements under the hood to make sure that it's really pretty good and comparable to humans. I mean, humans ultimately are pretty good at driving, all things considered. Especially when they pay attention, right? Which of course, that's one big advantage of autonomous vehicles. They always pay attention.
I would say that over the last 10 years...and I'm happy to be part of the process. Just in computer vision but here even more, obviously, I think the entire technology is being rethought ground up. I think machine learning takes constantly more prominent roles and the types of machine learning and the models we do continue improving at a fast pace.
So, there is a lot of capabilities. And I think you can see that, for a while, maybe there were no notable launches. Even though you would hear about the space. Now, people start launching things, right? I mean, Waymo launched the first fully driverless service in Phoenix in 2020. In public, I think we've driven over half a million miles in autonomous mode. In San Francisco, we started driverless operations. That's another big milestone.
I mean, we're building a stack that can handle car, truck, highway, domain cities, but it's one driver still. But these are deployments we're having. We announced we will launch downtown Phoenix. I was on the car actually in San Francisco, in driverless operation, maybe 10 days ago.
It's awesome, right? I think when you start seeing milestones like this, they're meaningful. Now, the truly meaningful milestones is when you release it at large scope and scale, right? You want to do it in a thoughtful manner, make sure that you're confident when you put these things out there, that they interact well with people and are safe for everybody.
Lukas:
When you say that autonomous vehicles have been really redesigned and machine learning takes a more prominent role, could you give me a flavor of what the trends are like? Are things moving to a more end-to-end system where the inputs are like a camera and the output is which way to turn the steering wheel? Or are things becoming more componentized, where each piece is responsible for something?
What are the big trends over the last 10 years?
Drago:
The main trend, I would say...and I've tried as a leader of the research team at Waymo, which does pretty much primarily ML, almost exclusively, right? But we started applying to perception and then prediction and understanding behavior, and then planning, and then in the simulation, right?
I think it's permeated every aspect of the system. Onboard, offboard. There's machine learning in all these components. Major models, meaning they're not just small features, they're core parts of the system.
On a macro level, that's a change that's definitely happened at Waymo. I think when people started early on in 2009, there was a famous Sebastian Thrun book, "Probabilistic Robotics". There, you have the LiDAR. You can create all these segments out of the LiDAR, then you can reason about the segments you can build.
Initially, people — without the deep learning models — would build a very modular stack, with very many modules. Each does a little something. You put them all together. It's a significant engineering challenge.
The trend has been larger and larger neural nets, right? Doing more and more, potentially going from neural nets in narrow scope to neural nets in wider scopes. Maybe narrow nets from one task to neural nets to do multiple tasks. The trend is for the modules — with the help of machine learning models — to grow larger and fewer.
Now, there is an interesting question, and this is an area of exciting research. Not only, I think, in the industry. Some companies espouse a fully end-to-end learned approach. There is no clarity if a fully end-to-end learned system is actually better.
There is, in life...when you build these things, there is often trade-off between different extremes, right? Each of these things has its pluses and its minuses, and you want somehow to take advantage of the pluses but not to be stung too much by the minuses.
We were maybe too much on the end of too modular system, too many small pieces written by engineers. Whether the answer is several large modules or a single end-to-end thing, I think it's an open question. I think this is an area that we are still..as a research team, we're exploring the repercussions of these things.
The industry is exploring because people have different vision for some of these things, right? But I don't think, I would not say...there's some serious trade-offs to doing everything end-to-end. Not in academia, by the way. If you take an academic dataset and you train more end-to-end in the small scope, you will probably do better. But that does not mean you build a better system, in the production setting.
Lukas:
Why is that? What are the pitfalls?
Drago:
I think ultimately, in an academic setting, you look for a lot more average metrics and things, right? And the dataset is small. Clearly, if you build something that incorporates everything and co-trains[?], it will probably do better, especially if you optimize on it.
In the production setting, you're looking to be robust to the very rare cases. You look at speed of iteration and ability for people to fix your model if there are issues, right? You look at the stability. Like, understanding there are issues, being able to dig in. Simulation requirements too.
If you have a fully end-to-end model, now you need to...your simulation has to be end-to-end and you need to simulate all the sensors as needed and so on. That's maybe a lot more expensive than some intermediate representation that may be simpler to simulate. Maybe it does not pass all the information the model may want to pass, but at the same time, you'll get other benefits. Maybe you can train closed loop a lot faster. That, now, also can help you.
There's very interesting trade-offs in this space.
Lukas:
I think one thing that I've really noticed from my vantage point of selling tooling to autonomous vehicles is how many customers there are for labeled data and Weights & Biases stuff.
Do you think it's a wasted effort that so many different smart teams are tackling the same problem? Do you feel like there's a diversity of approaches that's interesting? Does Waymo have a specific point of view that is different than Zoox or other places that you know of?
Drago:
I think if you start looking at the stack, first of all, you don't really know. I don't know exactly what the stacks of the other companies are, right? They're proprietary.
I think there is an interesting search space where you are saying, "I'm going to design this system. It will have these APIs. These are the intermediate outputs. This is how I build my tooling. This is how I understand how the system is doing. This is how it iterates on each of them. That's why this representation is beneficial for onboard perception, onboard performance, for example in simulation." Right?
You take all this into consideration. It's a very wide search space. I think every company ends up with somewhat different APIs, design choices, trade-offs, and how modular versus not trade-offs and how much machine learning they put versus not.
This is very understudied because ultimately, it's hard. I think in research, it's a lot easier to study every one problem in isolation. You can say, "Okay, let's do 3D flow prediction." We have some state-of-the-art 3D flow prediction or monocular depth others do. I think when you start combining them, there's actually a lot of variability possible and people's stacks end up quite different in the end. Even if on the high level, you can say they're somewhat in a similar way.
The challenges of scalability and simulationLukas:
Interesting. Do you feel like there's still deep problems to solve between now and everyone riding in autonomous vehicles?
Drago:
Scalability is always a problem, right? Safely, cheaply scaling to...I always think what system do I need to build to scale to a dozen cities, cheap.
Lukas:
But wait, I want to understand that. Because with a normal piece of software, you wouldn't only deploy it in San Francisco and Phoenix. If it's safe in one city, it would be safe in every location. What makes it hard to go to LA and have the same thing deployed or go to Boston and deploy?
Why does it have to be once at a time versus...usually, software goes everywhere at once.
Drago:
I think we're at the point where we're building software that should deploy to most places and be pretty good at it. Maybe historically with the probabilistic robotic stuff, that wasn't quite the case, right?
Still, you need to validate that and be sure. I think there is a lot of local particularities in every location that you should make sure you can handle. There's some strange roundabout in this place, and there is some eight-way intersection in this other place, and maybe here in Pittsburgh, they do the Pittsburgh left that you need to understand, right?
Someone needs to go out still, collect data from these places, potentially tune the models on these places, potentially then do the safety analysis to convince yourself you actually should deploy. And that is work. That's why you don't just build it once and, "Okay, let's just drive and see what happens." I mean, you can do that but if you actually remove the driver, not sure how responsible that it is.
Lukas:
But I mean, Google has actually mapped every city on the planet, it feels like. Shouldn't it be possible to send some cars and collect data? What...
Drago:
I mean, we are sending cars in collecting data, right? So we are growing our scope.
Now, we have Chandler and Phoenix. Now San Francisco, we announced. Downtown Phoenix, we announced that we will deploy starting this year. We're collecting data if there's been public postings in New York, say, in the winter and in Los Angeles. And of course, we now have trucks collecting highway data for trucks and behavior around trucks which is important, I think.
There's somewhat different behaviors around trucks and different issues — like seeing around the trailer, for example — that you don't have with a car and you have a somewhat different sense of configuration. We're broadening, right?
I think the car is...and I agree with you. If you look at every city as one deployment and one piece of software and you just develop it and launch it in that city, that doesn't scale, right?
The way we think of it, we're building one driver, if possible. This driver is able to handle all these environments as possible, including potentially as much as possible, cars and trucks. Even though there will be some small differences. But the core of all pieces is similar in the nature of what they want to do.
Then you iterate and when you're comfortable that you have enough data about safety and passing the bar, you launch.
Lukas:
Do you have an opinion on LiDAR versus vision-only approaches?
Drago:
By the way, maybe one more topic on the previous question.
I think if you ask what are the big open questions, I think one of the interesting topics is — and this is a scaling factor for you — you want more machine learning in the planner, if possible. And you want a realistic simulation environment where you can just replay full system scenarios, and without too much human involvement determine whether you're improving or not improving.
For us, the big challenge is, it's a very complex endeavor. It's not like someone gave you the perfect simulator for autonomous vehicles. You need to build one. And ideally, you build one from the sensor data and the data you collected. So that's like real-to-sim.
And now, by what metrics do you build the simulator?
You need to establish metrics for the simulator that constitute acceptable simulation and for our simulation, a lot of it is about behavior of agents. It's not just how something looks, even though we like that work, too. We've done NeRF and the 3D reconstruction and all kinds of things.
But ultimately, the behavior is some of the main things you need to solve, so you need realistic behavior in the simulator. Then when you have that, then you have the other metrics which says, "What does it mean to drive well in the simulator, in the world?"
You need both. You need to build both things. The further you go, the easier it is to improve these pieces because the less you need humans in the loop, right? We can still improve them. You don't need perfectly realistic simulator to improve your driving. It's just that it requires more human judgment, right?
But there is a process to... I mean in these areas, a lot is possible still, right? And we hopefully will show more interesting work this year. We sent a couple of papers in the space that people may find interesting.
Lukas:
Cool. That might even be applicable to things outside of autonomous vehicles, right? It seems like sim-to-real type of stuff is necessary for any kind of robotics application.
Drago:
Yeah, real-to-sim in some sense. I mean, the specific instantiation is maybe a little different, but I would say that one of the nice properties of AV is that it is a complete robotics problem. Maybe it's a specific kind but a lot of the things you need to solve for other robotics problems are, at least in some shape, covered.
There will be hopefully a lot of positive spillover from our domain to others.
Why LiDar and mapping are usefulLukas:
A couple of practical questions that everyone on the team wanted me to ask you, if you could talk about it. Do you have an opinion on LiDAR and more complicated sensors versus vision-only approaches? Do you think LiDAR will always be needed to make things safe?
Drago:
I think ultimately, it's a question of...I don't know if LiDAR will always be needed but I think it's really great. And I know it's not very expensive, right?
Lukas:
Right.
Drago:
I think it even makes your computer vision much better and it makes your simulation much better, which then immediately also results in better driving. It's a fantastic sensor that you can just have for now. So, why not have it, right?
I think there is this convergence happening, in some sense. LiDAR is becoming more like cameras. It's higher and higher res. Maybe it even can do passive lighting, so then it is a camera also while being a LiDAR. And it's cheaper and cheaper with the current technologies.
On the other hand, obviously, our 3D perception cameras — even compared to two years ago — is dramatically better, right? I really like having the LiDAR. To me, it's a safety feature. It's a lot safer being in the car with LiDAR than not. Maybe it's theoretically possible to just do it with cameras and maybe it will play out, but do you want to risk it and why? I mean, it's easy to remove LiDAR. No one is stopping you, right?
Lukas:
Sure.
Drago:
It's not like we don't have state-of-the-art camera approaches. It's easy to remove. Maps too, right?
Lukas:
Yeah. That was my next question. I mean, how critical do you think the mapping is? Because that doesn't seem scalable necessarily, right?
Drago:
It's pretty scalable. I mean, you can do mapping with machine learning in some sense, if you design it properly.
Generally, maps are a prior. They tell you about an environment, especially an environment you drove a lot in. What to expect? What is behind this occlusion, right? When you looked at this intersection, what does this thing really tell you to do versus not? Or what to expect around that corner? If you can have some of this information, why not use it, right? I mean, it's safer.
Lukas:
Right.
Drago:
Now, should you trust the map as is and require it is correct? That's not scalable. If you say, "The map is given to me, I need to maintain it true, otherwise I can't drive," you cannot deploy autonomous driving at scale then. You don't have a business.
I mean, people do construction, put cones. They change the traffic lights, they repaint things on the highways where the trucks drive, they reroute lanes. You need to deal with this otherwise you don't have a business ultimately, in the end.You can't trust maps blindly, but why not have a prior? I mean, we drive a city and even to do the safety case or just to collect data to understand what people do, why not have a map prior?
Lukas:
Do you think it helps enough that there will be one winner in autonomous vehicles that everyone uses, then it gets better because it collects the map data? Is it that much of an advantage?
Drago:
Which, the map data? I think generally there's scaling benefits in autonomous driving. I think a lot of the scaling benefits, they accrue when you use large machine learning models, right? You see the extreme case with GPT-3 in the big language models.
In our days, when we studied with Daphne Koller, we learned that there is a bias-variance trade-off. You want Occam's razor, you want to penalize models that have high expressivity and you will get the best generalization, right? The simplest model that explains your data is great. Probably better than some fancy overfitting model.
Now, all of this is on its head. You say, "I want to train the huge model that is much, much bigger than anything on tons of data, that may be the same as mine or different. And that model will generalize better for me." Right?
Lukas:
Right.
Drago:
Now, in AVs, what does this mean? We have all this data. Waymo has more data than the vast majority of companies or different platforms. I mean, we're 13 years driven, 20 million miles in autonomous model, right? We have whatever 20 billion miles in simulation. Simulation is also data.
Now, we have cars and trucks. All of these things — if you take the large machine learning point of view — makes the models better, because you have more data. It's more diverse data. It captures..we try to see everything that you could see. If you do your job well, these models will actually generalize better. It helps you having cars to do well on trucks. And I have all this great car data, right?
You add it to the models for the truck and it helps a lot. And car data is a lot cheaper to collect than truck, too. And maybe, a lot more diverse. I mean, often on the highways being a truck, you drive fairly conservatively and fairly few things happen on the highway. But it's a multiplier for you in the multi-platform setting, right?
Our domain is friendly to this, I think.
Waymo Via and autonomous truckingLukas:
Right. Could you talk a little bit about why Waymo is investing in trucks? It seems to me like a different enough domain — like more different than a different city — that I could imagine...my first thought would be, "Well, you'd probably get the cities working first with a car and then switch to a truck," but it must not be. Could you talk about that?
Drago:
I think there's some difference between the two, but ultimately most of the pieces are similar enough that you can share. You can share roughly the same modular design. You can share roughly similar types of models. You can share roughly the same types of simulation environments. You can cross-benefit by cross-pollinating the data between the two domains.
For example, to understand how others behave. You can just collect data of how people behave with cars. It will generalize to trucks, right? There's some unique problems with trucks that do have to be solved. One of them is you need to see a lot further for a truck, partly because a fully loaded truck takes a while to stop. Also, if you want to change lanes for a truck, sometimes you need to create gaps, right?
And it takes longer to create gaps and do it without cutting people off unnecessarily, than for a car. So you need to anticipate a lot sooner, and you need to see around the trailer, or be smarter how you infer what's behind your trailer.
There are a few of these problems, and that's why I have a bit different sense of configuration. But if you look at the core pieces, a lot of the other logic — like which modules you would put together, what to put in each module — is very similar. All the infrastructure is similar.
Now, trucking is a very big use case, right? It's a big market. So it makes sense from that standpoint. There is enough cross-pollination and commonality, more so than differences, I would say.
Robustness and unsupervised domain adaptationLukas:
Another question I wanted to ask...maybe you get this all the time, but such a common adversarial example is slightly modifying street signs to make a system think it's a different sign. Is that a toy thing that doesn't really come up and doesn't really cross your mind as a major problem, or is that something that you actually really worry about trying to create autonomous vehicles?
Drago:
In our case, we have three different sensors, right? I don't think you can fool three different sensors nearly as easily and independently. Furthermore, we have redundance between the sensors, right? When you want-
Lukas:
Right.
Drago:
Part of the beauty of having active sensors is one of them can fail and they can still fairly independently detect things for you, right? From that standpoint, hybrid stack with multiple different sensors is more robust. That's one.
Second, I think generally, these adversarial problems fall in the bucket of robustness. And in some sense, unsupervised domain adaptation. You want to generalize to similar situations. And in research, we have studied these topics.
We have methods currently that we've investigated that help against either transferring from one domain to another. There is a paper called SPG that we put up. That's an interesting take on essentially adding more structure to a prediction task, detection task. To make it more robust to new conditions. Like you train in sunny weather, then you want to work in rainy weather.
It turns out that instead of just regressing 3D boxes, if you first have an intermediate task that regularizes, predict your point clouds, and fills it in, then from that, from this canonicalized, regularized point cloud, now you predict your box; it turns out you get a lot of robustness.
We did it with unsupervised domain adaptation in mind. By the way, in the Waymo Open Dataset, we released some data for people to study this. We can talk maybe more about this later, about the Open Dataset. But we did it with this in mind and then we realized, "Oh, this method is actually number one in the KITTI leaderboard." KITTI is one of the...for hard detection cases. That was maybe a year ago.
That's because when you do well — and KITTI is a small dataset — there are rare examples. When you add robustness to domain adaptation and you do it well, it just happens to do well on these examples, on more of these hard examples.
So these are techniques that we're exploring. We have, at this point, significant experience with adversarial techniques. There's actually a large space of them. There is a challenge. Many techniques make you more robust to adversarial cases but really hurt your performance in nominal cases. The challenge is to find robustness methods to train your models such that you don't regress the common cases. If anything, you get better and you get more robust to the adversarial attacks. There are such methods.
Why Waymo released the Waymo Open DatasetLukas:
That's a good segue into something I want to ask you about, which is the Open Dataset that you've been releasing. Could you maybe, first of all, talk about what they are? But I would love to hear the motivation and what's been surprising in the reaction from the community after putting them out.
Drago:
I would say that when I joined Waymo in 2018, and we started the research team, which is applied research team internally. Most of our work actually is primarily focused at improving Waymo systems with machine learning. We do publish too, right? A good amount but not all our work. We're not just made for academia.
We wanted to engage better the community. Then the question is "Well, how do I collaborate with you?" or "How do we encourage you to work on certain problems?" At the time — especially when we started planning the dataset — there was the KITTI dataset. Which by modern measures — it was done I think in 2010, 2012 — it was tiny.
Then we thought, "Okay. The best way to encourage people to solve problems relevant to our setup — which is a lot more data — and the problems we're interested in, let's start releasing data that people can just push the state-of-the-art with. That's what the community does not have."
We released what I believe is still one of the largest and richest datasets, and we are actively making it better and better and better. If you checked it out two years ago, come back and see the kind of things we have now and we will continue releasing interesting data. We have 3D bounding boxes over time, 2D bounding boxes over time. Now, we have 3D semantics segmentation. We have 2D and 3D post key points for people in busy scenes type of data that has very little as other datasets you can see in the wild of such data.
We have a bunch of interesting challenges. One of the interesting things is...we released the perception dataset and we picked 2000 run sequences. Which in its time, was quite a lot, right? So, 2000 20-second sequences compared to anything else, it's a humongous amount of data. Then we started trying to do behavior prediction task with it.
If you do this, you realize that for behavior prediction, you need an order magnitude yet more data. Why? Because say, a scene of 20 seconds, it has 200 objects and you're maybe at 10 Hz in our dataset, right? That's tens of thousands of instances that you observed over these 20 seconds. And maybe you will see one interesting interaction or not in the whole sequence.
From this standpoint, then I'm like, "Okay, what is a reasonable size of data for behavior understanding and understanding interesting interactions?" And we came up, "Okay. If we had 2000 perception sequences, you want 100,000 behavior sequences."
Then of course, then the question is, "Okay. If you release the sensor data for all of this, how are people even going to download it?" Then we did some very interesting things. We released vectorized data of the environment produced by our sensors by actually novel systems we have. It's a system called auto labeling, which I think is pretty key for the autonomous driving space. Which in hindsight, after you observe the whole scene, you can try to as perfectly as possible to recreate everything that happened, right?
We have novel work on this. It's published maybe a year ago, or two years ago. With this work, we actually made our dataset. It's still probably state-of-the-art of what you can do with these models. It's very clean data, a kind that was never done, so you can study aspects you could not before.
Lukas:
Have people engaged with it in ways that were unexpected? Has it been useful to you?
Drago:
People come up with very powerful models, which is part of the appeal. You have people from industry, from academia, even kids from high school in some cases. Like one of our challenges...which is really impressive to see just the broad, worldwide reach.
What's interesting is we release it with some problems in mind and we help try to suggest problems. The way we try to suggest problems is...we've been running challenges for three years straight with prizes. So we say, "Here's a problem. Here's a metric we believe is suitable for this problem. Please submit...here's the leaderboard. Here, you can submit. If you do well, you can win and come to our workshop." This year, we also have a workshop at CVPR, one of the two premier computer vision conferences. You get to present.
People participate and every year, we expand the set of challenges that we have. This year, we have three completely new challenges. Some are really unique that have not been run. Say, future occupancy prediction, both for occluded and non-occluded agents with flow that has...there's few such challenges.
We have one on, "From five cameras over time, can you reconstruct the 3D boxes accurately?" There is variants of this for a single camera but for multiple cameras over time with rolling shutter, which is a real setup on a car. We worked out some very interesting metrics and set up that has not been done before. That is very core.
A lot of people do appreciate, I think, the releases. And I think the more we release, the more different research people can do because now they can study how all of these enrich each other and how the perception and motion dataset...they have certain compatibility and you can reason how to combine some of these, and it gives you a lot of opportunity.
But the last point is people started solving problems we hadn't thought of with this data, or do different research, including ourselves. For example, you can use our data to train NeRF models. I mean, you have all these rich data from all over the place. You could do that. Or you can train 3D reconstruction models, right? You can do shape completion models. I mean, there is a lot of things you can do when you have such rich data when we release two sensors, camera and LiDAR. If you have camera and LiDAR in interesting environments, you can do a ton.
Lukas:
Cool. Is it is a challenge to convince the business that it's a useful thing to release this stuff? Is there objections like there's IP that might leak out or even privacy issues possible?
Drago:
There were objections. I think ultimately, people...I'm thankful and Waymo is a great place to have a research team. I think it's a great collaborative environment with people that really appreciate the value we can bring, especially in an open-ended field.
I think you can really balance the concerns, right? I don't think with us releasing the Open Dataset, it will give such a huge leg to the competition because we released some data for people to study. I mean, problems in this space, right? I think ultimately, it's really helpful to everyone, but it's not defining. I think there's a lot more positives for everybody than worries for Waymo and by releasing it, we hopefully struck a good balance.
It has been a lot of work. Ultimately, we want to release data at a quality that befits the Waymo brand. That means that we need to take, say, blurring all the faces and license plates well. We need to make sure that the annotations are very high quality, which they are. We really paid a lot of attention and we ran models to keep mining for potential errors in our 2D and 3D annotations. I think they're very high quality. So hopefully, people can benefit from that.
The domain gap between simulation and the real worldLukas:
We always end with two open-ended questions that I'd love to try before you go. What do you think is an understudied part of machine learning, or something that you would want to look into if you had more time?
Drago:
I would say that I'm perfectly happy doing the problems we have because ultimately, they cover...most machine learning problems are represented in our domain.
I would say a few. One of the fascinating areas that we're looking at is...and AV really stresses that you want robust systems, right? And we touched on this. So what does that mean, right? This means many things and it depends which systems.
One of them is you want to build inductive bias and structure in the... If you think of the whole thing as one big architecture, you want to build the right structure so it generalizes, right? This means pick the right API, pick the right designs and representations.
There is a certain flow in our models, which I think now became a lot more popular in the whole ML community. You go from perspective view with tens of millions of points, scans, you name it. Then you create a Euclidean space, maybe in top down view with ultimately...with objects, with relations, with poly lines or structure. In that one, models generalize a lot better so you want to do more of this. That's one.
The other one — which we touched on very briefly — but a big part of it is, when you train these systems and make them robust, you need to be able to detect the rare examples. Why do you want to detect and when? If you detects the rare examples you can, of course, bias your training set and metrics to make sure you do well on them, right?
When you drive...if you know you don't know, it's already a huge help because machine learning models, you can think of them, they're very performant when you trust them. If you don't trust them, you can fall to something a lot more cautious and safe.
You just need to know when. There are a lot of techniques you can study to do this. We can talk about finding rare examples if we get to it, but we have a whole bunch of research on this. We can, maybe after...there is another one that I find fascinating, and we touched on.
This is domain gap between simulation and real world. How and what should be the simulation, such that you can train the best possible autonomous vehicle stack? How do I build it from the data I collected? What are the metrics for the simulator that it should optimize as realism and then how do you put planning agents in it, right? I think that is a fascinating-
Lukas:
-can you give me some examples of results in that? I'm not familiar with that work.
Drago:
There are several things you can do. There are several aspects of realism. You can think of it...when you put your vehicle in the simulator, you want to produce inputs to the vehicle that are similar or highly similar to what you see in the real world, right? Then the outcomes in the simulator are pertinent.
What are the inputs to your vehicle? It's sensor data and it's the behavior of other agents in the simulator. These are the two main axes. Some kind of sensor data perception realism. Maybe you do some intermediate representation that's a lot cheaper than simulating every pixel, but you need something.
Now, you need agents to behave realistically. Meaning they react to you. I mean, agents need to react to you, right? If you do something different, the simulator needs to cause an effect. It needs to...there's a reaction. It needs to be reasonable, right?
How reasonable does it need to be? It varies. As you know, there is a strong work on randomization in other domains. If you want to train a more robust model, you can even try somewhat unreasonable things. As long as there is enough of them, you can build a more robust model.
In our domain, you also want simulator to ideally be a good measure of risk. And that's a higher requirement. Then you need taller level for what realistic means because it needs to be somehow correlated to the real rates.
Lukas:
But how would you even know that though if it's reacting to what the agent does? How do you quantify how good your simulation is? The agent might do something that you never saw in the real world. How could you even know if the simulation is realistic?
Drago:
There's two measures in which you can measure realism of agents that we think. And we've presented in past talks.
One of them is a Turing test of sorts. You look at the scene, it's like, "Could this agent have done this? Is it likely or completely impossible?" That's one. That's a proof of existence that it's realistic. Then you have distributional realism. Which is, let's say, how often someone will cut off in front of you, or what is the braking profile, Or how long does someone take to pay attention to you, right?
That is the type of useful distributional realism that you can enforce and this makes sure that agents behave, at least on a distributional level, similar to what you observe. Now that we've observed a ton of behavior, right? So we have enough data to know roughly what the distribution of these things is.
One of the challenges is agents acting in a continuous space. It's somehow practically an infinite distribution. But you can take slices of it that are meaningful and enforce that those are matching, right? There's certain designs there that you need to build in.
Lukas:
I would imagine there's parts of the distribution that you might care about but it would be dangerous to even do in the real world. But you might really care what happens if you slam on the brakes or make a hard turn.
Drago:
You can play any future you want, in theory, if you build it right.
I think that simulator has this huge scaling promise. You take any scenario you saw, you release the agents and you release yourself, and you can try all kinds of stuff and they can try all kinds of stuff. And you can learn from that, right? It multiplies your data. If you have good models for the agents, now you have 100X multiplier on everything. That's fascinating.
Maybe if you can score roughly how likely each future is, then you have even a likelihood estimate, right? You can sample adversarially and bias yourself towards the interesting cases. Try someone to try to cut in front of you when you're riding. Most of the times they don't and maybe 1% they will. Then you see what happens if they try.
There are different ways to build it but you have the opportunity — if you do it right — to really dramatically increase, say, the cases of collisions you can replay. Because we don't see that many collisions. Thank God. Even when we drive a lot. But I can make you a lot in the simulator and some will be more realistic than others and it makes you a nice area to study these things safely, of course.
It's the best if that's where you study them, right?
Finding rare examplesLukas:
Right. Do you have more thoughts on finding unusual examples? I mean, active learning has been around for a long time. It's something that I think most companies use when they want to actually deploy something to production. Are you talking about active learning or something more complicated here?
Drago:
I will say a few things.Some of them are papers of ours, observations.
Ultimately, our domain is ripe for finding the rare examples. It's one of the main tasks you need to do, right? I mean, most of the time you drive, it should be boring and you need to find...and we collect a ton of data which is great. Almost the setting is, you have some proxy for infinite unlabeled data, and you have some labeling budget. You can label yourself some data, as you know, you ran a labeling company. Now, how do you benefit the most from this data you just collected, right?
Most of the examples are somewhere, if you can find them. That's the first observation, right? That's one. Now, if you were to find them, you can data augment a lot out of them. That's a good way to go, right? We have papers on how to perturb them in different ways. You can do this for cameras. You can do it for LiDAR. You can even machine learn how to best perturb them to get best results with that work, right?
I'll get to you in a second about ways to find rare examples.
There is a long-tail learning literature typically and a lot of the long-tail literature was driven in academia by dataset such as ImageNet or some like... I don't know, is it birds dataset? There is one. We used to do it-
Lukas:
Small dataset, yeah.
Drago:
We used to do it at Google — when I worked for Google Goggles — breeds of dogs, and types of birds, and types of food and all kind of things. Typically, the literature in long-tail is driven by these rich semantic datasets that you have some very rare thing. Like a rare breed of bird or a rare breed of plant, and then you need to detect it with five examples, right?
But that is a world in which everything was named. Just this name was rare, and you just had five examples. Let's maybe learn to do the most with them. That's one way.
Now in our world, it's a little different. In autonomous driving, you don't want to name every type of plant or even every type of dog. You have fairly broad categories. Like, take the category of vehicle to an extreme. There's all kinds of vehicles in the category of vehicle, and 80% of it will be boring sedans.
Then down the line, you can have all kinds of strange configurations of things people do, right? Cement mixer with a trailer or something, or trams on the... I mean you can have. Now in this big bucket, what is a rare example? You don't want to name it.
And rare is not the same as hard. That's also this property.
I'll give you an intuition. I think the people sometimes say, "Oh, we're going to train an ensemble and where the ensemble disagrees, we'll just label." That's very standard. What's the problem there? Well, the ensemble finds hard examples. Models disagree, not easy to tell what it is. That doesn't mean it's actually, first of all, rare; second, beneficial to label.
You can see this intuition maybe the easiest in LiDAR perception. You do LiDAR perception. If you do ensemble mining — we've studied this. Actually a great guy in Waymo Research called Max studied this — you get cars in parking lots far away. They have five LiDAR points. The models clearly disagree how the bounding box should be. But that's not a very useful example. It's not like if you mined more examples with five LiDAR points, you'll get much better on them, right?
Lukas:
Right.
Drago:
You need some mechanism to tell rare from hard.
Lukas:
What did you do? What's the intuition?
Drago:
You can build, for example, a model that estimates — given features of the examples — a distribution, and check which are actually rare things versus one we've seen a lot. We have a paper on this. Still unpublished, so I will not say more. But it hopefully will be soon.
Obviously, that's one way to think about it but I just thought it's an interesting distinction.
There is another work we did which is called GradTail, that is published, and that's a bit of the same idea. It's like, "Let's define long-tail as uncertainty related to the model or the task you're doing." It's not so much that some class is long-tail somewhere. It's epistemic uncertainty related to the model you're training.
What does that mean, really? Again, this comes to say...actually, Zhao Chen, who is the primary author of that paper, has this reasoning, "Some rare kind of apple is really important and relevant when you try to tell the types of apples. But if you try to tell apples versus oranges, that's maybe not even the relevant case."
So we have a definition of long-tail which is if an example — when you train it — has a gradient that is orthogonal or different from the mean gradient for the class.
Lukas:
Sorry. How do you define a gradient for label?
Drago:
I mean, you can backprop the gradient for an example. There's some layer and you can check on average what examples from that class-
Lukas:
I see.
Drago:
-pass as a gradient around this time. If you have different gradients...actually you can be either orthogonal or negative. You can argue which one is which, but if you have different enough gradient, it's ultimately a long-tail example.
Sometimes you find different examples that semantics does not give you. For example, these types of examples have maybe a class "fridge" but the fridge is open. It has some strange point of view. That's a rare example. The class is common. You can predict depth or regress something. The rare example can be the depths of points far away that are close to something occluded.
You can't even name these things but that doesn't mean that they don't exist, the long-tail in semantics at different concepts. We've explored this a bit because it's relevant to our domain.
Lukas:
Is the intuition that these change your model the most?
Drago:
The rare examples, they will improve the model the most because you can actually learn it.
Lukas:
Right.
Drago:
If you feed it. As opposed to the hard one, you can feed it but model will waste capacity trying to solve something that's hard to solve in the first place. But when mining, it matters.
Now of course, there is a whole meta point, which is "What is an example you should mine?" If you have a full, all integrated eval of the whole system, you can try to introspect in the system. And at that point, the example you should mine is the one that cause you troubles downstream, right? That's the optimal world in which you mine. The problem is that now it's complicated, because it couples all your system and evaluation, right?
If your evaluation is perfect, you should do it. If not, then some of these simpler, more modular approaches give you a lot of the benefit with a lot simpler set.
The challenges of production requirementsLukas:
Cool. All right. I want to make sure I get my last question in, which is basically, what's the hardest part of making this stuff work in production? When you think about — from soup to nuts — making autonomous vehicles work, what's the step that's most unexpectedly challenging?
Drago:
Which question is this? Is this from research to production what's hard, or what is hard about release?
Lukas:
Research to production, exactly. Like you have something working in research, or something working in the Open Dataset challenge, or a thing that works in a Kaggle competition, and now you need to make the car go. Where does this break down the most?
Drago:
I mean, I'll tell you.
Actually, my first experience with this is...I was at Street View, that was maybe 2007. I learned how to automatically calibrate the camera on the car to the IMU and the GPS system. Every once in a while, they were miscalibrated. The panoramas were all crooked in strange ways, right? You don't want to do it manually because there's so much data, so I came up with a system that did it and I had great results maybe in two months, say.
Then it's like, "All right, let's ship it." Then you start...I ran it now on a lot more results, and I see all kinds of issues. Someone put a bag over this thing, in other cases the car got stopped for too long so you can't do structure for motion. You just find a bunch of them. That was maybe three more months.
Then you run it and it's like, "I'm there." Then you run it again and of course, in a large enough dataset, everything that can go wrong does go wrong. Then you find a whole set of yet more rare cases that you need to worry about. So, three more months on those, right?
That, to me, taught me, "Oh, I see. From something that works good enough on the demo case — which is typically a paper — to something actually working, there is still a big chasm." I think a lot of it comes from additional requirements.
For a paper, you have academic metrics. They're usually permissive. They're usually fairly average type metrics over something, right? The only constraint is, "Okay, there's this one or two metrics you picked. Let's show the main ones and let's show it works well."
Then you go to the production folks and they say, "But we want this model to produce three more things and this rare case that it doesn't work on, that it should work on. And furthermore, it needs to run three times faster and you need to build it into this system with these constraints."
And you're like, "Oh, great." Now, my work may be tripled or quintupled. And then all the downstream models have to work with it, too. It can break them now, maybe produces new signals. Now I need to work to fix those, right?
That's the usual story of how to get a research model in production. I mean, you need to persist and go. There's a stage two or three. The issue there is, even if one or two people are enough to get the demo result, they may not be enough to push the thing through production.
Now, I need to ideally...we're lucky that we have a lot of great collaborations with the production team, so we just do it together. They need several people...you need a lot more people, ultimately, to do this, right?
That's why also in research...we are applied research team. We are not there to try every possible thing and learn something. We're trying to ideally guess the right things to try, show that they work, and then spend sizable effort if needed to build infrastructure integration, often frameworks even, jointly with the teams such that many people now can accomplish it successfully.
That's why the team is not too small either. When you actually have applied aspiration and shipping aspiration, you usually need larger teams. You need fewer, larger efforts because that's what's conducive to actually landing the things beyond papers.
OutroLukas:
Awesome. Thanks so much, Drago. This was really fun.
Drago:
Great talking to you. Likewise, thanks for having me.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So, check it out.
﻿
﻿
Add a comment
Tags: Articles, Gradient Dissent, Podcast
Iterate on AI agents and models faster. Try Weights & Biases today.