An interview with TRI

Carey Phelps, Product Lead

We sat down with Adrien Gaidon to talk about his work as a Machine Learning Lead at Toyota Research Institute.

What are you working on?

I lead the machine learning team here at Toyota Research Institute (TRI). We work on the long term driving challenges for autonomous driving. We’re especially working on large scale engineering challenges because Toyota is a global brand with a hundred million cars on the road today. The question is, how can you have machine learning models in cars that are robust to conditions in Michigan, conditions here in California, or conditions in Japan?

It's really hard for machine learning right now to be able to provide any guarantees, statistical or otherwise, on how reliable it's going to be. Putting in a safety critical system, it really has to work. So we're working really on these exciting challenges, like how do we learn with a lot of data. At the same time, how can we make it safe enough so that we can put it in cars and save lives instead of endanger lives.

What is the most challenging thing you're working on?

Most of machine learning today is supervised learning, at least the type that we trust to put in products and robots that interact with users. Training machine learning models without providing labels is really challenging, it's called unsupervised learning. One thing the team we're working on in particular is self supervised learning, so how can we leverage an intrinsic property of the signal? How can we leverage something that is true about the world and use it for supervision to train a deep neural net using video and all the data coming from cars? For instance, physics of light are universal, and geometry rules are also universal. If we can use geometric properties to constrain the learning of our system, for instance to predict depth from images, then it's really exciting to see that you can learn models without labels. Therefore when you feed more and more data, you can have high throughput of improvement of your model because it doesn't have to go and be observed and annotated by humans to tell it that there is a car here, it's at that distance, et cetera which is really difficult to do at scale.

How would you describe your work to someone non-technical?

I often say that I'm trying to teach computer vision, so I'm trying to teach computers how to see. If I want to get more involved I say, I'm trying to teach computers how to learn how to see. Which is a bit meta, but the idea is how do you take a machine, a camera and have the machine able to say, that pixel that little bit of red green and blue is actually a car that little bit of red green and blue is road and understanding the scene is is really paramount to autonomous driving. You can't drive if you don't understand what's around you and so imbuing computers with perception is really important and that's how I would describe it to the layman, teaching computers how to see.

What inspired you to work on this?

I was doing a lot of computer vision work before and it’s a lot of work because it’s a really hard problem. There's something called the Moravec's paradox which is certain things like chess, backgammon or go are are things that we consider to be the epitome of human intelligence, but these are actually easy for machines. Perception and even more, prediction is is far from being solved although we humans are really good at it and I'm really bad at Go. Moravec's paradox is basically saying that certain things are easy for humans are easy because it resulted from the huge evolutionary process that made us who we are and very sophisticated eyes that are much better than any camera and really good brains that are able to process all this. This is the reason why what's easy for us is potentially really hard for machines and I'm really excited about about making this as good as humans at perception and for self-driving cars in particular, it has to work. So this is the thing is that you can claim results and I can make papers and it's awesome and do competitions and we're doing that but putting it on the road is a whole other game. I have a four year old daughter and I live in the in the Bay Area and I see all these cars driving around and when I have to cross the street with my daughter I'm thinking about this and living this right? How can I make sure it becomes safe enough to put on the road? How do you make this work for real with real people in the real world? That's what gets me excited about the problem: mathematical machine learning but also the real world impact. I really love both aspects.

Why is it hard to know if your models are getting better?

That is a fairly deep question, pun intended. The reason it's hard is first because it's statistical. Good enough is a value judgment that has to be quantified on data in machine learning. We build huge data sets and then split data sets into training sets, validation sets, and test sets. You have to be really careful with the evaluation protocol to not have statistical leakage and derive wrong and optimistic conclusions because you have seen data you shouldn't have seen when you train the models. As in all sciences, machine learning is a very experimental science so you have to be careful about the experimental protocol. You have to define the metrics clearly when you have a robotic system or a self-driving car that is extremely hard to test on the public roads for instance because the safety standards are very high, but at the same time you want continuous deployment and you want rapid iteration. One way is to decompose the system and evaluate its modules. And how do you define metrics of quality of performance for these modules such that if let's say, your object detector becomes really good, your overall system becomes really good.

Then there are like more mundane reasons why it's hard which is we do a lot of experiments. We use the scientific method, we formulate a hypothesis, we formulate experiments to try to falsify that hypothesis or acquire evidence that supports this and then launch experiments as quickly as possible on as many machines as possible in the cloud. Then we get a lot of results and we have to analyze these results and we have to derive conclusions from these results. So visualizing all the results and being able to discuss about it as a team and design the next set of experiments or decide to proceed further with deploying the model that's really hard.

Why is it important to look at models over the duration of training?

It's important to have some form of live monitoring because we iterate really quickly on a lot of ideas and there's a lot of bugs. Machine learning is a little bit in the age of alchemy and a lot of things work and sometimes it's hard to know for sure why. That's why you need a lot of introspection. We’re making some hypothesis that if the model is learning then this part of the gradient magnitude should go up or should go down. Having this level of introspection enables you to see if you have a bug that might be not a bug in your code but a bug in your neural network design. A bug in the data set pre-processing and something happened that doesn't correspond to your hypothesis. So bugs in machine learning can just be that the code is correct but it doesn't do what you think it's doing and the model can still learn and can still train. That's a bit of the alchemy part. Having a lot of introspection is extremely important to understand, is the system behaving in a way that I anticipated? We're very good at post hoc rationalization.

Having a hypothesis upfront and designing monitoring systems to see whether this hypothesis holds or not. Richard Feynman says that religion is a culture of faith, science is a culture of doubt. Having a lot of monitoring, having a lot of metrics enables you to have this healthy scientific doubt that is relying on evidence and data.

Why is it important to have a dashboard?

A dashboard is extremely important for communication of results. Science is not done in isolation. There's no such thing as stealth science, science exists in the public eye. That's why we're putting papers on archive, submitting papers to conferences, and giving talks about what we're doing. You have to confront your ideas with the rest of the world and it starts internally. It starts with dashboards that you have internally to discuss and showcase and debate scientifically using facts amongst a team. Is that really working? Is that doing what we think it's doing? What is the best method?

Some of the work we did was using semantic segmentation, instant segmentation, and depth information. The work I mentioned around predicting 3D bounding boxes from a single image use another work that we did around predicting depth from an ocular video input. Collaboration is extremely important because the problems we're tackling in autonomous driving are so complex and the expectation on robustness is so high that we have to leverage all our strengths combined. That literally materializes in the form of, I trust the results of that model because I've seen on the benchmark that on the dashboard that it's the best. So I'll pick that model and integrate it into my own model to do my own experiments. If it doesn't work, you go back to the dashboard you go back to the discussions. It facilitates communication, scientific debate, and collaboration.

Why is Weights and Biases useful to you?

My team has maybe two thirds of research scientists and a third of machine learning engineers. As you can imagine people had different tools that they liked from their PHD's or past jobs. They all come in with their experience and they’d say X is the best tool, Y is the best tool. So tensor board or Visdom or a custom Matplotlib library that they made during their PHD years. It's really hard if you don't have a common language, this lingua franca that enables people to share the results and discuss and share code then that that becomes really difficult for collaboration. Weights and Biases helped us a lot because at the time there was nothing that really addressed all our different needs and as the product grew, it has for us subsumed, generalized the different products we were doing so we've been transitioning people little by little and now the whole team is using it.

We're still using maybe tensor board for some like one off experiments on the side maybe a Jupyter notebook with some custom Matplotlib for again something small, but for all the large scale serious experiments and research projects, the scalability of Weights and Biases, the customizability, the fact that we can really build reports and live monitoring not just like like frozen reports of a conclusion, all these things are awesome features that really vastly improved our productivity by having this common on Lingua Franca about experimental visualization and logging.

How would you describe Weights and Biases to a potential user?

From our perspective, Weights and Biases is an awesome experiment management platform in the sense that it has a really great user experience. It's very easy to log whatever metrics you want to log. We log a lot of images, not just numerical metrics and time series. A lot of our experiments are on computer vision so we want we are making visual predictions and we want to visualize those predictions.

We want to contrast a prediction with a ground truth so we have fairly involved visualizations of our results. Weights and Biases has this flexibility and the scalability which is extremely important to really have these massive amounts of visualization because of these massive amounts of experiments and contrast and compare them, aggregate them making reports. It's great as a manager I really appreciate when people tell me this works and I have physical evidence mathematical evidence that this is the case.

How do you use Weights and Biases in a team context?

We have different projects that have different dashboards related to them. People also do their own experiments so they even just run off on their own for their own experiments.  A lot of times it's TensorBoard or it can be Weights and Biases but then when we are getting closer to a deadline or getting closer to transferring a model then we are trying to make all the final experiments go to the same dashboard so that we can change one thing at a time. One thing that's difficult about machine learning is there's so many hyperparameters, or so many ways that you can modify the system that it's really important to be able to contrast experiments by changing one thing at a time. These dashboards enable us to see these ablation studies and see, what if I change this, what happens? What if I change that what happens? When we want to run an ablation study that goes into a dashboard and that information is localized and we can reason about it and have this focus versus drowning in the massive noise.

How would you compare your workflow before and after Weights and Biases?

Much more collaborative. That's that's the main thing for me and that's the reason why as a manager of the team I really pushed people to start to use it and now people are really happy about it. Because when you scratch your own itch in your corner you don't initially see the benefits and you don't reap the benefits. When you make the effort, because it's an effort to spend a bit more time to integrate with the tools of the team and contribute to the tools of the team, then by yourself you may go faster, but together we go further.

How many hours has Weights and Biases saved you?

I know it's not a very scientific measure but a lot. Spending time writing these intermediate metrics and intermediate visualizations to make sure you understand the inner workings of your system takes a lot of time, but there's a lot of redundancies across projects or across people potentially. So Weights and Biases has been great because by unifying this it has made it easier because it's just one thing to understand, learn and master so one tool and it's enabled us to reuse a lot of boilerplates across projects and that's just the development time.

Then there's the whole experiment analysis time, and there it’s also a lot because comparing across dashboards that are stale or if there is a logging server for instance or using Visdom which is a great tool in its flexibility but you can't have easy comparison of these experiments and traceability over long periods of time scales. Sometimes you have to redo experiments just to confirm the results or remember the results.

In the scale of your project, what have you achieved so far?

My team is fairly young, roughly a year and a half old. For almost two years what we did is we developed a large scale deep learning cloud infrastructure on AWS using Pytorch and a lot of custom built deep learning tools. We could have started doing experiments with MNIST or small data but I really pushed the team to set up the infrastructure. A big infrastructure will enable you to run very large scale experiments, which means either experiments on a lot of data or many experiments. We're quite proud of our infrastructure and we've been reaping the benefits of this because we've been iterating really quickly thanks to that. The second benefit is that these past six months we've been extremely productive on the research front, submitting five papers to the top conferences in machine learning and computer vision thanks to the benefits of this infrastructure. In my previous experience, myself as a lead author of a paper could take six months and here in six months we did five. I was really impressed with the productivity of the team and the creativity that was unleashed thanks to our efforts in building out this infrastructure. I'm really excited to show more about this next year, I think 2019 is going to be a year where we have a lot of cool stuff to show.