Pete Warden — Practical Applications of TinyML

Pete discusses machine learning for embedded devices, from running neural nets on a Raspberry Pi to wake words and industrial monitoring.
Angelica Pan

About this episode

Pete is the Technical Lead of the TensorFlow Micro team, which works on deep learning for mobile and embedded devices.
Lukas and Pete talk about hacking a Raspberry Pi to run AlexNet, the power and size constraints of embedded devices, and techniques to reduce model size. Pete also explains real world applications of TensorFlow Lite Micro and shares what it's been like to work on TensorFlow from the beginning.

Connect with Pete

Listen

Apple Podcasts Spotify Google Podcasts YouTube SoundCloud

Timestamps

0:00 Intro
1:23 Hacking a Raspberry Pi to run neural nets
13:50 Model and hardware architectures
18:56 Training a magic wand
21:47 Raspberry Pi vs Arduino
27:51 Reducing model size
33:29 Training on the edge
39:47 What it's like to work on TensorFlow
47:45 Improving datasets and model deployment
53:05 Outro

Links

Watch on YouTube

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!

Intro

Pete:
The teams I've seen been really successful at deploying ML products, they've had people who, formally or informally, have taken on that hot responsibility for the whole thing, and have the people who are writing the inner loops of the assembly sitting next to the people who are creating the models.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. This is a conversation with Pete Warden, well-known hacker and blogger. Among many things that he's done in his life, he started a company Jetpac, which was a very early mobile machine learning app company that was bought by Google in 2014. He's also been a tech lead and staff engineer on the TensorFlow team since then. So he's been at TensorFlow since the very beginning. He's written a book about taking ML models and making them work on embedded devices, everything from an Arduino to a Raspberry Pi. And it's something that I'm really passionate about. So we really get into it and the technical details. I think you'll really enjoy this interview. Quick disclaimer for this conversation: We had a few glitches in the audio, which are entirely my fault. I've been traveling with my family to Big Sur, which is a lot of fun, but I didn't bring all my podcasting gear, as you can probably see. If anything's inaudible, please check the transcription, which is provided in the notes.

Hacking a Raspberry Pi to run neural nets

Lukas:
All right, Pete, I have a lot of questions for you, but since this is my show, I'm going to start with the question that I would want to ask if I was listening. Tell me again about the time that you hacked a Raspberry Pi to train neural nets with a GPU.
Pete:
Oh God. Yeah, that was really fun. So back when the Raspberry Pi first came out, it had a GPU in it, but it wasn't a GPU you could use to do anything useful with, unless you want to draw things. But who wants to just draw things with a GPU? But there was some reverse engineering that had been happening and some crazy sort of engineers out there on the hardware side who'd actually managed to get a manual describing how to use the...how to program the Raspberry Pi GPU at low level. And this had been driving me crazy ever since I'd been at Apple years ago, because I was always able to use GLSL and all of these comparatively high level languages to program GPUs. But I was always trying to get to do things that the designers hadn't intended. Like when I was at Apple, I was trying to get them to do image processing rather than just doing straight forward graphics. And I never — You may hear a dog in the background. That is our new puppy, Nutmeg — but I always wanted to be able to program them. I knew that there was an assembler level that I could program them at, if I only had access. I spent five years at Apple, tried to persuade ATI and NVIDIA to give me access. And I actually managed to persuade them, but then the driver people at Apple were like, "No, don't give him access because then we'll have to support the crazy things he's doing. So when the Raspberry Pi came along-
Lukas:
Was this Raspberry Pi 1 or 2 or 3?
Pete:
This was back in the Raspberry Pi 1 days. So it was not long after it had first come out and they actually gave you the data sheet for the GPU, which described the instruction format for programming all of these weird little hardware blocks that were inside the GPU. There really wasn't anything like an assembler. There wasn't...basically anything that you would expect to be able to use. All you had was the raw, like, "Hey, these are the machine code instructions." And especially back in those days, in Raspberry Pi 1 days, there wasn't even any SIMD instructions, really, on the Raspberry Pi because it was using an ARMv6.
Lukas:
What is a SIMD instruction?
Pete:
Oh, sorry. Single Input Multiple Data. So if you're familiar with X86, it's things like SEE or AVX. It's basically a way of saying, "Hey, I've got an array of 32 numbers. Multiply them all", and specifying that in one instruction versus having a loop that goes through 32 instructions and does them one at a time. It's a really nice way of speeding up anything that's doing a lot of number crunching, whether it's graphic or whether it's, in our case, machine learning. I really wanted to get some cool image recognition stuff. Since back when AlexNet was all the rage, I wanted to get AlexNet running in less than a 30-second frame on this Raspberry Pi. The ARMv6 really was...it was like, I think it was just like Broadcom had some dumpster full of these chips they couldn't sell because they were so old. This is not official. I have no idea if this is true, but it feels true. And so they were like, "Oh sure, use them for this, whatever, this Raspberry Pi thing that we're thinking about." They were so old that it was actually really hard to even find compiler support. They didn't have, especially, these kinds of modern optimizations that you would expect to have. But I knew that this GPU could potentially do what I wanted. So I spent some time on the data sheet. There were a bunch of...a handful of people did some open source hacking on this stuff so I was able to kind of fork some of their projects. Actually funnily enough, some of the Raspberry Pi founders were actually very interested in this too. I ended up kind of hacking away and managed to figure out how to do this sort of matrix multiplication. And that, funnily enough, one of the people who was really into this was actually Eben Upton, the founder of Raspberry Pi. So he was actually one of the few people who actually replied on the forums when I was sending out distress signals when I was getting stuck on stuff. So anyway, yeah I ended up being able to use the GPU to do this matrix multiplication so I could actually run AlexNet, recognize a cat or a dog in 2 seconds rather than 30 seconds. It was some of the most fun I've had in years because it really was just like trying to string things together with sticky tape and chicken wire. Yeah, I had a blast.
Lukas:
How does it even work? You're writing assembly and running it on a GPU. What environment are you writing this in?
Pete:
So I was pretty much using a text editor. There were a couple of different people had done some work on assembly projects. None of them really worked, or they didn't work for what I needed. So I ended up sort of hacking them up together. So I then feed in the text into the assembler, which would produce the raw kind of command streams. Then I had to figure out the right memory addresses to write to from the Raspberry Pi CPU to upload this program. And then that program would be sitting there in the, I think there was something like, some ridiculously small number of instructions I could run, like 64 instructions in there or something or 128. The program would be sitting there on all of these, I think there was four or eight cores. I would then have to kick them off. I'd have to feed in the memory from the...and it was, I mean, honestly it was like, in terms of software engineering, it was a disaster. But it worked.
Lukas:
Well. What kind of debugging messages do you get? I mean, I'm thinking back to college and writing this. I remember the computer would just crash I think when there was invalid...
Pete:
Well, I was actually writing out to a pixel, so I could tell by the pixel color how far through the program that it had actually got. Which...I'm color blind, so that didn't help. But yeah, it was really getting...it was getting down and dirty. It was the sort of thing where you can just lose yourself for a few weeks in some really obscure technical problems.
Lukas:
I mean, having worked on projects kind of like that, how did you maintain hope that the project would finish in a way that it would work? I think that might be the hardest thing for me to work on something like that.
Pete:
Well, at the time I was working on a startup and this seemed a much more tractable problem than all of the other things I was dealing with at the startup. So it, in a lot of ways it was just, it was procrastination on dealing with worse problems.
Lukas:
Great answer.
Pete:
Yeah.
Lukas:
What was the reason that the Raspberry PI included this GPU that they wouldn't actually let you directly access? Was this for streaming video or something?
Pete:
Yeah, it really was designed for, I think, early 2000 set top boxes and things. You were going to be able to draw a few triangles but you weren't going to be able to run any...it wasn't designed to run any shaders or anything on it. So, GLSL and things like that weren't even considered for it at that time. I think there's been some work on that since, I think, maybe with some more modern versions and GPUs. But back in the Raspberry Pi 1 days it's just like, you can draw some triangles and that's...
Lukas:
Have you been following the Raspberry Pi since? Do you have thoughts on the 4 and did they talk to you about what to include there maybe?
Pete:
No, no. I think they knew better because I'm not exactly an average user. I mean, as a sort of a general developer, it's fantastic, because the Raspberry Pi 4 is this beast of a machine with multi-threading and it's got those SIMD instructions I talked about. There's, I think, support for GLSL and all these modern OpenGL things in the GPU. But as kind of a hacker I'm like, "Oh..."
Lukas:
Well, it's funny because I think I met you when I was trying to get TensorFlow to run on the Raspberry Pi 3, which is literally just trying to compile it and link in the proper libraries. I remember completely getting stuck. I mean, I'm ashamed to tell you that, and reaching out on the forums and being like, "Wow, the tech support from TensorFlow is unbelievably good, it's answering my questions."
Pete:
Well, I think you ended up...you found my email address as well. I think you dropped me an email and again I think you caught me in the middle of procrastinating on something that I was supposed to be doing. And I was like, "Oh wow, this is way more fun. Let me spend some time on this." But no, I mean, you shouldn't underestimate that TensorFlow has so many dependencies. Which is pretty normal for a Python sort of cloud server sort of project, because they're essentially kind of free in that environment. You just do like a "pip install" or something and it will just work. But as soon as you're moving over to something that's not the vanilla sort of x86 Linux environment that it's expecting, you suddenly sort of have to pay the price of trying to figure out all of these..."Where did this come from?"

Model and hardware architectures

Lukas:
Right. Right. So I guess one question that comes to mind for me that I don't know if you feel like it's a fair question for you to answer, but I'd love your thoughts on it is it seems like everyone trains their models, except for people at Google, train their models on NVIDIA GPUs. I'm told that's because of the CUDA library that essentially compiles, and cuDNN that makes a low level language for writing ML components and then compiling them onto the NVIDIA chip. But if Pete Warden can just directly write code to do matrix multiplication on a chip that's not even trying to publish its docs and let anyone do this, where's the disconnect? Why don't we see more chips being used for compiling? Why doesn't TensorFlow work better on top of more different kinds of architecture? I know that was one of the... I think that was one of the original design goals of TensorFlow, but we haven't seen maybe the explosion of different GPU architectures that I think we might've been expecting back in 2016, 2017.
Pete:
Yeah. I can't speak so directly to the TensorFlow experience, but I can say more generally what I've seen happening, speaking personally is, it's the damn researchers. They keep coming up with new techniques and better ways of training models. What generally tends to happen is it follows the same model that sort of Alex Krizhevsky originally did and his colleagues with AlexNet, where the thing that blew me away when I first started getting into deep learning was....Alex had made his code available and he had not only been working at the high-level model creation side, he'd also been really hacking on the CUDA kernels to run on the GPU to get stuff running fast enough. It was this really interesting...having to kind of understand all these high-level concepts, these cutting edge concepts of machine learning, while also being this in a loop kind of assembly, essentially...not quite down to that level, but like intrinsic, really thinking about every cycle. What has tended to happen is that as new techniques have come in, the researchers tend to just — for their own, to run their own experiments — they have to write things that run as fast as possible. So they've had to learn how to...the default for this is CUDA, so you end up with new techniques coming in as a CUDA implementation. Usually there's a C++ CPU implementation that may or may not be particularly optimized and then there's definitely a CUDA implementation. Then the techniques that latch on, the rest of the world has to then figure out how to take what's often great code for its purpose, but is written by researchers for research purposes and then figure out how to port it to different systems with different precisions. There's this whole hidden amount of work that people have to do to take all of these emerging techniques and get them running across all architectures. I think that's true across the whole ecosystem. It's one of the reasons that I really love for experimenting — if you're in the Raspberry Pi sort of form factor, but you can afford to be burning 10 watts of power — grab a Jetson or a Jetson Nano or something, because then you've got essentially the same GPU that you'd be running in a desk machine just on a much smaller form factor.
Lukas:
Totally. Yeah. It makes me a little sad that the Raspberry Pi doesn't have an NVIDIA chip on it.
Pete:
The heat sink alone would be...
Lukas:
One thing I noticed...your book is excellent, on embedded ML. Actually I was in a different interview — which we should pull that clip of an interview with Pete Skomoroch — and we both had your book at our desks, so we had both been reading it. I don't know if you know him but-
Pete:
Yeah, I'm a good...yeah, Pete's awesome. He's been doing some amazing stuff too. He's another person who occasionally catches me when I'm procrastinating and I'm able to offer some advice and vice versa.
Lukas:
We should have a neighborhood...
Pete:
Yeah. Procrastination, hacking procrastination list.

Training a magic wand

Lukas:
It seems pretty obvious that you do some interesting projects in your house or for personal stuff. I was wondering if you could talk about any of your own personal ML hack projects.
Pete:
Oh, that's a really...I'm obsessed with actually trying to get a magic wand working well.
Lukas:
Tell me more.
Pete:
One of the things I get to see is...would be these applications that are being produced by industry professionals for things like Android phones, smart phones in general. The gesture recognition using accelerometers just works really well on these phones, because people are able to get it working really well in the commercial realm. But I haven't seen that many examples of it actually working well as open source. Even the example that we ship with TensorFlow Lite Micro is not good enough. It's a proof of concept, but it doesn't work nearly as well as I want. So I have been...that's been one of my main projects I keep coming back to is, "Okay, how can I actually do a Zoro sign or something holding — I've got the little Arduino on my desk here — and do that and have it recognize..." I want to be able to do that to the TV screen and have it change channels or something. What I've really wanted to be able to do — we actually released some of this stuff as part of Google IO, so I'll share you a link. Maybe you can put it in the description afterwards — but my end goal, because these things actually have Bluetooth, I want it to be able to emulate a keyboard or a mouse or game pad controller and actually be able to customize it so that you can — or a MIDI keyboard even as well — and actually customize it so you can do some kind of gesture and then have it...you do a "Z" and it presses the Z key or something on your virtual keyboard, and that does something interesting with whatever you've got it connected up to. So, that isn't quite working yet. But if I...hopefully I get some tough enough problems in my main job that I'll procrastinate and spend some more time on that

Raspberry Pi vs Arduino

Lukas:
Man, I hope for that too. For people that maybe aren't experts in embedded computing systems, could you describe the difference between a Raspberry Pi and an Arduino? And then the different challenges in getting ML to run on a Raspberry Pi versus an Arduino?
Pete:
At a top level, the biggest difference is the amount of memory. This Arduino Nano BLE Sense 33 is...I think it has 256K of RAM and either 512K or something like that of flash, kind of read-only memory. It's this really, really small environment you actually have to run in, and it means you don't have a lot of things that you would expect to have through an operating system, like files or printf. You're really having to look at every single byte. The printf function itself...in a lot of implementations it will actually take about 25 kilobytes of code size just having printf because printf is essentially this big switch statement of, "Oh, have you got a percent F? Oh, here's how you print a float value," and there's hundreds of these modifiers and things you never even think of for printing things you can ever imagine, and all that code has to get put in if you actually have printf in the system. All of these devices that we're aiming at, they often have only a couple of hundred kilobytes of space to write your programs in. You may be sensing a theme here, I love to fit...take modern stuff and fit it back into something like a Commodore 64.
Lukas:
It seems like Pete Warden doesn't always need a practical reason to do something, but what might be the practical reason between an Arduino versus a Raspberry Pi?
Pete:
Luckily I've actually managed to justify my hobby and turn it into my full-time project, because one great example of where we use this is...let's see my phone here, let's get a hold of my phone, you know what a phone looks like. If you think about things like — I won't say the full word, because it will set off people's phones — but the OK-G wake word or the wake words on Apple or Amazon. When you're using a voice interface, you want your phone to wake up when it hears you say that word, but what it turns out is you can't afford to even run the main ARM application processor 24/7 to listen out for that word because your battery would just be drained. These main CPUs use maybe somewhere around a watt of power when they're up and running, when you're browsing the web or interacting with it. What they all do instead is actually have what's often called an "always on" hub or chip or sensor hub or something like that, where the main CPU is powered down so it's not using any energy, but this much more limited, but much more lower energy chip is actually running and listening to the microphone and running a very, very small — somewhere on the order of 30 kilobytes — ML model to say, "Hey, has somebody said that word, that wake word phase that I'm supposed to be listening out for?" They have exactly the same challenges. You only have a few hundred kilobytes at most. You're running on a pretty low-end processor. You don't have an operating system, every byte counts. So you have to squeeze the library as small as possible. That's one of the real-world applications where we're actually using this TensorFlow Lite Micro. More generally, the Raspberry Pi is...you're probably looking at $25, something like that. The equivalent — which the Raspberry Pi Foundation just launched last year or maybe at the start of this year — that's kind of the equivalent of the Arduino, is the Pico. And that's, I think $3 retail. The Raspberry Pi, again, uses one or two watts of power so if you're going to run it for a day, you essentially need the phone battery that it will run down over the course of a day. Whereas the Pico is only using a hundred milliwatts, a 10th of a watt. You can run it for 10 times longer on the same battery, you can run it on a much smaller battery. These embedded devices tend to be used where there's power constraints, or there's cost constraints, or even where there's form factor constraints, because this thing is even smaller than a Raspberry Pi Zero and you can stick it anywhere and it will survive being run over and all of those sorts of things.

Reducing model size

Lukas:
Can you describe — let's take, for example, a speech recognition system — can you describe the differences of how you would think about training and deploying if it was going to the cloud or a big desktop server versus a Raspberry Pi versus an Arduino?
Pete:
Yeah. The theme again is size and how much space you actually have on these systems. You'll be thinking always about, "How can I make this model as small as possible?" You're looking at making the model probably in the tens of kilobytes for doing...we have this example of doing speech recognition and I think it uses a 20 kilobyte model. You are going to be sacrificing accuracy and a whole bunch of other stuff in order to get something that will actually fit on this really low energy device. But hopefully it's still accurate enough that it's useful.
Lukas:
Right. How do you do that? How do you reduce the size without compromising accuracy? Can you describe some of the techniques?
Pete:
I actually just blogged about what trick that I've seen used but I realized I hadn't seen in the literature very much. Which is where — the classic going back to AlexNet approach — after you do a convolution in an image recognition network, you often have a pooling stage. That pooling stage would either do average pooling or max pooling. What that's doing is it's taking the output of the convolution, which is often the same size as the input but with a lot more channels, and then it's taking blocks of 2 by 2 values and it's saying, "Hey, I'm going to only take the maximum of that 2 by 2 block. So, take 4 values and output 1 value, or do the same but do averaging. That helps with accuracy. But because you are outputting these very large outputs from the convolution, that means that you have to have a lot of RAM because you have to hold the input for the convolution and you also have to hold the output, which is the same size of the input, but typically has more channels, so the memory size is even larger. Instead of doing that, a common technique that I've seen in the industry is to use a stride of 2 on the convolution. Instead of having the sliding window just slide over 1 pixel every time as you're doing the convolutions, you actually have it done in 2 pixels, horizontally and vertically. That has the effect of outputting the same result as you would...or the same size, same number of elements that you would get if you did a convolution process of a 2 by 2 pooling. But it means that you actually do less compute and you don't have to have nearly as much active memory kicking around.
Lukas:
Interesting. I had thought maybe with the size of the model it was just the size of the model's parameters, but it sounds like you also...obviously you need some active memory. But it's hard to imagine that even could be on the order of magnitude of the size of the model. Literally the pixels of the image and then the intermediate results can be bigger than the model?
Pete:
Yeah. That's the nice thing about convolution. You get to reuse the weights in a way that you really don't with fully connected layers. You can actually end up with convolution models, the activation memory taking up a substantial amount of space. I'm also getting into the weeds a bit, because the obvious answer to your question is also quantization. Taking these floating point models and just turning them into 8-bit, because that immediately slashes all of your memory sizes by 75%.
Lukas:
I've seen people go down to 4 bits or even 1 bit. Do you have thoughts on that?
Pete:
Yeah. There's been some really interesting work. A colleague of mine actually — again, I'll send on a link to the paper — looked at...I think it's something about the pareto-optimal bit depth for ResNet is 4 bits or something like that. There's been some really really good research about going down to 4 bits or 2 bits or even going down to binary networks with 1 bit. The biggest challenge from our side is that CPUs aren't generally optimized for anything other than 8-bit arithmetic. Going down to these lower bit depths requires some advances in the hardware they're actually using.

Training on the edge

Lukas:
Do you have any thoughts about actually training on the edge? I feel people have been talking about this for a long time, but I haven't seen real world examples where you can actually do some of the training and then it passes that upstream. Is that...
Pete:
What I've seen is that, especially on the embedded edge, it's very hard to get labeled data. Right now, there's been some great advances in unsupervised learning, but our workhorse approach to solving image and audio and accelerometer recognition problems is still around actually taking big labelled data sets and just running them through training. If you don't have some implicit labels on the data that you're gathering on the edge, which you always never do, it's very hard to justify training. The one case where I actually have seen this look like it's pretty promising, is for industrial monitoring. So when you've got a piece of machinery and you basically want to know if it's about to shake itself to bits because it's got a mechanical problem, and you have an accelerometer or microphone sensor sitting on this device. The hard part is telling whether it's actually about to shake itself to bits or whether that's just how it normally vibrates. One promising approach for this predictive maintenance is to actually spend the first 24 hours just assuming that everything is normal and learning, "Okay, this is normality And then only after that, start to look for things that are outside of the...you're implicitly labeling the first 24 hours, "Okay, this is normal data," and then you're looking for anything that's an excursion out beyond that. That makes sense for some kind of a training approach. But even there, I still actually push people to consider things like using embeddings and other approaches that don't require full backpropagation to do the training. For example, if you have an audio model that has to recognize a particular person saying a word, try and have that model produce an N-dimensional vector that's an embedding, and then have the person say the word 3 times, and then just use k-nearest neighbor approaches to tell if subsequent utterances are close in that embedding space. You've done something that looks like learning, from a user perspective, but you don't have to have all this machinery of variables and changing the neural network and you're just doing it as a post-processing action.
Lukas:
Do you see a lot of actual real world uses, like actual companies shipping stuff like models into micro controllers?
Pete:
Yeah. This is hard to talk about because these aren't Android apps and things where people are fairly open and open source. A lot of these are pretty well-established old-school industrial companies and automotive companies and things like that. But we do see...there's a bunch of products out there that are already using ML under the hood. One of the examples I like to give is when I joined Google back in 2014, I met Raziel Alvares — who's now actually at Facebook doing some very similar stuff, I believe — but he was responsible for a lot of the OK-G work. They've been shipping on billions of phones, using ML and specifically using deep learning, to do this kind of recognition.But I had no idea that they were shipping these 30-kilobyte models to do ML, and they had been for years. From my understanding, from what I've seen of Apple and other companies, they've been using very similar approaches in the speech world for a long time. But a lot of these areas don't have the same expectation that you'll publicize work, that we tend to in the modern ML world. It flies below the radar. These things are...there's ML models already running in your house, almost certainly right now, that are running on embedded hardware.
Lukas:
Besides the audio recognition, what might those ML models in my house be doing? Can you give me a little bit of flavor for that?
Pete:
Yeah. Accelerometer, recognition, trying to tell if somebody's doing a gesture, or if a piece of machinery is doing what you're expecting. The washing machine or the dishwasher or things like that, trying to actually take in these signals from noisy sensors and actually try and tell what's actually happening.
Lukas:
Do you think there's a ML model in my washing machine?
Pete:
I would not be at all surprised.
Lukas:
Wow.
Pete:
Yeah.

What it's like to work on TensorFlow

Lukas:
I guess another question that I had for you, thinking about your long tenure on TensorFlow — which is such a well known library — is how has that evolved over the time you've been there? Have things surprised you in the directions that it's taken? How do you even think about, with a project like that, what to prioritize into the future?
Pete:
Honestly how big TensorFlow got and how fast really blew me away. That was amazing to see. I'm used to working on these weird technical problems that I find interesting and following my curiosity. I'd been led to TensorFlow by pulling on a piece of yarn and ending up there. It was really nice to see...not just TensorFlow, but PyTorch, MXNet, all of these other frameworks, there's been this explosion in the number of people interested. Especially, there's been this explosion in the number of products that have been shipping. The number of use cases that people have found for these has been really mind blowing. I'm used to doing open source projects which get 10 stars or something, and I'm happy. But seeing TensorFlow and all these other frameworks just get this mass adoption has been...yeah. It definitely surprised me, and has been really nice to see.
Lukas:
What about in terms of what it does? How has that evolved? What new functionality gets added to a library like that? Why do you make so many breaking changes?
Pete:
Yes. I would just like to say I am sorry [laughs]. It's such a really interesting problem, because we're almost coming back to what we were talking about with Alex Krizhevsky. The classic example of the ML paradigm that we're in at the moment is you need a lot of flexibility to be able to experiment and create models and iterate new approaches, but all of the approaches need to run really, really, really, really fast because you're running millions of iterations, millions of data points through each run just in order to try out one model. So you've got this really challenging combination of you need all this flexibility, but you also need this cutting edge performance, and you're trying to squeeze out the absolute maximum amount of throughput you can out of the hardware that you have. So you end up with this world where you have Python calling into these chunks of these operators or these layers, where the actual operating layers themselves are highly, highly optimized, but you're expecting to be able to plug them into each other in very arbitrary ways and preserve that high performance. Especially with TensorFlow, you're also expecting to be able to do it across multiple accelerated targets. Things like the TPU, CPUs, and AMD, as well as NVIDIA GPUs. Honestly, it's just a really hard engineering problem. It's been a couple of years now since I've been on the mainline TensorFlow team, and it blew my mind how many dimensions and combinations and permutations and things they had to worry about in terms of getting this stuff just up and running and working well for people. It is tough as a user because you've got this space shuttle control panel of complexity and you probably only want to use part of it, but everybody wants a different-
Lukas:
Right, right. Well, maybe this is I guess a naive question, but when I look at the cuDNN library, it looks pretty close to the TensorFlow wrapper. Is that right? It seems like it tries to do the same building blocks that TensorFlow has. So I would think with NVIDIA, it would be a lot of just passing information down into cuDNN?
Pete:
Yeah. I mean, where I saw a lot of complexity was around things like the networking and the distribution and the very fast...making sure that you didn't end up getting bottlenecked on data transfer as you're shuffling stuff around. We've had to go in and mess around with JPEG encoding and try different libraries to figure out which one would be faster because that starts to become the bottleneck at some point when you're throwing your stuff onto the GPU fast enough. I have to admit though, I'm getting out of my...I've looked at that code in wonder. I have not tried to fix issues there, so I'm...
Lukas:
Amazing. I guess one more question on the topic. How do you test all these hardware environments? Do you have to set up the hardware somewhere to run all these things before you ship the library?
Pete:
Well, that's another pretty...the task of doing the continuous integration and the testing across all of these different pieces of hardware and all the different combinations of, "Oh, have you got 2 cards in your machine? Have you got 4? Have you got this version of Linux? Are you running on Windows? Which versions of the drivers do you have? Which versions of the accelerators on cuDNN?" All of these, there are farms full of these machines where we're trying to test all of these different combinations and permutations, or as many as we can, to try and actually make sure that stuff works. As you can imagine, it's not a straightforward task.

Improving datasets and model deployment

Lukas:
All right. Well, we're getting close to time, and we always end with two questions that I want to save time for. One question is what is an underrated topic in machine learning that you would like to investigate if you had some extra time?
Pete:
Datasets. The common theme that I've seen throughout all the time I've worked with...I've ended up working with hundreds of teams who are creating products using machine learning, and almost always what we find is that investing time in improving their datasets is a much better return on investment than trying to tweak their architectures or hyper-parameters or things like that. There are very few tools out there for actually doing useful things with datasets and improving datasets and understanding datasets and gathering datasets and data points, and cleaning up labels. I really think...I'm starting to see...I think Andrew Ng and some other people have been talking about data-centric approaches and I'm starting to see more focus on that. But I think that that's going to just continue, and it's going to be...I feel like as the ML world is maturing and more people are going through that experience of trying to put a product out and realizing, "Oh my god, we need better data tools," there's going to be way more demand and way more focus on that. That is an extremely interesting area for me.
Lukas:
Well, you may have answered my last question, but I think you're well-qualified to answer it, having done a bunch of ML startups and then working on TensorFlow. When you think about deploying an ML model in the real world and getting it to work for a useful purpose, what do you see as the major bottlenecks? I guess datasets is one, I agree, is maybe the biggest one, but do you see others too?
Pete:
Yeah. So, another big problem is there's this artificial distinction between the people who create models, who often come from a research background, and the people who have to deploy them. What will often happen is that the model creation people will get as far as getting an eval that shows that their model is reaching a certain level of accuracy in their Python environment, and they'll say, "Okay, I'm done. Here's the checkpoints for this model," which is great, and then just hand that over to the people who are going to deploy it on an Android application. The problem there is that there's all sorts of things like the actual data in the application itself may be quite different to the training data. You're almost certainly going to have to do some stuff to it like quantization or some kind of thing that involves re-training, in order to have something that's optimal for the device that you're actually shipping on. There's just a lot of really useful feedback that you can get from trying this out in a real device that someone can hold in their hand and use that you just don't get from the eval use case. So coming back to actually Pete Skomoroch, I first met him when he was part of the whole DJ Patil and the LinkedIn crew doing some of the really early data science stuff. They had this idea..,I think it was DJ who came up with the naming of data science and data scientists as somebody who would own the full stack of taking everything from doing the data analysis to coming up with models and things on it to actually deploying those on the website and then taking ownership of that whole end-to-end process. The teams I've seen been really successful at deploying ML products, they've had people who, formally or informally, have taken on that hot responsibility for the whole thing, and have the people who are writing the inner loops of the assembly sitting next to the people who are creating the models. The team who created MobileNet, Mobile-Vision, with Andrew Howard and Benoit Jacob, they were a great example of that. They all work very, very closely together doing everything from coming up with new model techniques to figuring out how they're actually going to run on real hardware at the really low level. So, that's one of the biggest things that I'm hoping to see change in the next few years as more people adopt that model.

Outro

Lukas:
Well said. Thanks so much, Pete. That was super fun.
Pete:
Yeah, thanks, Lukas.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description, where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce, so check it out.