Skip to main content

Transcription with Whisper

Created on December 29|Last edited on February 21


Subheader




Run set
470
State
Notes
User
Tags
Created
Runtime
Sweep
audio_duration
audio_format
audio_path
modelsize
transcript
transcription_factor
transcription_time
Failed
-
hans-ramsl
9s
-
-
-
-
-
-
-
-
Failed
-
hans-ramsl
6s
-
-
-
-
-
-
-
-
Crashed
-
hans-ramsl
21h 4m 12s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
34s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
6s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
26s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
1m 7s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
27s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
7m 27s
-
-
-
-
-
-
-
-
Finished
-
hans-ramsl
5m 50s
-
-
-
-
-
-
-
-
1-10
of 470



All Runs
291



Pete Warden (2021-10-21)
6



Adrien Gaidon (2021-04-23)
4




Large Models
75


[00:00.000 --> 00:07.072]   The teams I've seen being really successful at deploying ML products, they've had people
[00:07.072 --> 00:13.010]   who formerly, when formerly, have taken on that hot responsibility for the whole thing
[00:13.010 --> 00:17.072]   and have the people who are writing the inner loops of the assembly sitting next to the
[00:17.072 --> 00:20.040]   people who are creating the models.
[00:20.040 --> 00:24.078]   You're listening to Gradient Dissent, a show about machine learning in the real world,
[00:24.078 --> 00:27.004]   and I'm your host, Lukas Biewald.
[00:27.004 --> 00:30.092]   This is a conversation with Pete Worden, well-known hacker and blogger.
[00:30.092 --> 00:35.020]   Among many things that he's done in his life, he started a company Jetpack, which was a
[00:35.020 --> 00:40.072]   very early mobile machine learning app company that was bought by Google in 2014.
[00:40.072 --> 00:45.032]   He's also been a tech lead and staff engineer on the TensorFlow team since then, so he's
[00:45.032 --> 00:48.040]   been at TensorFlow since the very beginning.
[00:48.040 --> 00:53.048]   He's written a book about taking ML models and making them work on embedded devices,
[00:53.048 --> 00:56.008]   everything from an Arduino to a Raspberry Pi.
[00:56.008 --> 00:59.000]   It's something that I'm really passionate about, so we really get into it in the technical
[00:59.000 --> 01:00.000]   details.
[01:00.000 --> 01:02.040]   I think you'll really enjoy this interview.
[01:02.040 --> 01:04.072]   Quick disclaimer for this conversation.
[01:04.072 --> 01:08.012]   We had a few glitches in the audio, which are entirely my fault.
[01:08.012 --> 01:13.008]   I've been traveling with my family to Big Sur, which is a lot of fun, but I didn't bring
[01:13.008 --> 01:17.068]   all my podcasting gear, as you can probably see.
[01:17.068 --> 01:22.052]   If anything's inaudible, please check the transcription, which is provided in the notes.
[01:22.052 --> 01:23.052]   All right, Pete.
[01:23.052 --> 01:27.092]   I have a lot of questions for you, but since this is my show, I'm going to start with the
[01:27.092 --> 01:33.052]   question that I would want to ask if I was listening.
[01:33.052 --> 01:37.088]   Tell me again about the time that you hacked the Raspberry Pi to train neural nets with
[01:37.088 --> 01:38.088]   the GPU.
[01:38.088 --> 01:39.088]   Oh, God.
[01:39.088 --> 01:50.044]   Yeah, that was really fun.
[01:50.044 --> 01:57.000]   Back when the Raspberry Pi first came out, it had a GPU in it, but it wasn't a GPU you
[01:57.000 --> 02:02.012]   could do anything useful with, unless you wanted to draw things.
[02:02.012 --> 02:06.084]   Who wants to just draw things with a GPU?
[02:06.084 --> 02:15.016]   But there was some reverse engineering that had been happening and some crazy engineers
[02:15.016 --> 02:24.040]   out there on the hardware side who'd actually managed to get a manual describing how to
[02:24.040 --> 02:28.008]   program the Raspberry Pi GPU at a low level.
[02:28.008 --> 02:35.080]   This had been driving me crazy ever since I'd been at Apple years ago, because I was
[02:35.080 --> 02:44.064]   always able to use GLSL and all of these comparatively high-level languages to program GPUs.
[02:44.064 --> 02:52.004]   But I was always trying to get them to do things that the designers hadn't intended.
[02:52.004 --> 02:57.020]   When I was at Apple, I was trying to get them to do image processing rather than just doing
[02:57.020 --> 02:59.068]   straightforward graphics.
[02:59.068 --> 03:00.068]   And I never...
[03:00.068 --> 03:04.008]   You may hear a dog in the background.
[03:04.008 --> 03:07.028]   That is our new puppy, Nutmeg.
[03:07.028 --> 03:09.072]   But I always wanted to be able to program them.
[03:09.072 --> 03:15.064]   I knew that there was an assembler level that I could program them at if I only had access.
[03:15.064 --> 03:22.000]   I spent five years at Apple trying to persuade ATI and Nvidia to give me access.
[03:22.000 --> 03:28.060]   I actually managed to persuade them, but then the driver people at Apple were like, "No,
[03:28.060 --> 03:36.036]   don't give him access, because then we'll have to support the crazy things he's doing."
[03:36.036 --> 03:37.092]   When the Raspberry Pi came along...
[03:37.092 --> 03:40.096]   And this was Raspberry Pi 1 or 2 or 3?
[03:40.096 --> 03:45.060]   This was back in the Raspberry Pi 1 days.
[03:45.060 --> 03:53.076]   So it was not long after it first came out, and they actually gave you the data sheet
[03:53.076 --> 04:03.024]   for the GPU, which described the instruction format for programming all of these weird
[04:03.024 --> 04:08.000]   little hardware blocks that were inside the GPU.
[04:08.000 --> 04:12.064]   And there really wasn't anything like an assembler.
[04:12.064 --> 04:18.084]   There wasn't basically anything that you would expect to be able to use.
[04:18.084 --> 04:24.092]   All you had was the raw, "Hey, these are the machine code instructions."
[04:24.092 --> 04:29.092]   And especially back in those days, in the Raspberry Pi 1 days, there wasn't even any
[04:29.092 --> 04:36.032]   SIMD instructions on the Raspberry Pi, because it was using an ARM v6.
[04:36.032 --> 04:39.028]   But what is a SIMD instruction?
[04:39.028 --> 04:42.016]   Oh, sorry, Single Input Multiple Data.
[04:42.016 --> 04:49.092]   So if you're familiar with x86, it's things like SEE or AVX.
[04:49.092 --> 04:58.024]   It's basically a way of saying, "Hey, I've got an array of 32 numbers.
[04:58.024 --> 05:01.032]   Multiply them all."
[05:01.032 --> 05:07.076]   And specifying that in one instruction versus having a loop that goes through 32 instructions
[05:07.076 --> 05:12.090]   and does them one at a time.
[05:12.090 --> 05:19.012]   So it's a really nice way of speeding up anything that's doing a lot of number crunching, whether
[05:19.012 --> 05:25.090]   it's graphics or whether it's, in our case, machine learning.
[05:25.090 --> 05:31.088]   And I really wanted to get some cool image recognition stuff.
[05:31.088 --> 05:34.084]   This is back when AlexNet was all the way.
[05:34.084 --> 05:42.084]   I wanted to get AlexNet running in less than 30 seconds frame on this Raspberry Pi.
[05:42.084 --> 05:45.034]   And the ARM v6 really was.
[05:45.034 --> 05:53.084]   I think it was just like, Broadcom had some dumpster full of these chips they couldn't
[05:53.084 --> 05:56.098]   sell because they were so old.
[05:56.098 --> 05:57.098]   This is not official.
[05:57.098 --> 06:02.078]   I have no idea if this is true, but it feels true.
[06:02.078 --> 06:06.094]   And so they were like, "Oh, sure, use them for this Raspberry Pi thing that we're thinking
[06:06.094 --> 06:07.094]   about."
[06:07.094 --> 06:11.068]   And they were so old that it was actually really hard to even find compiler support.
[06:11.068 --> 06:18.010]   They didn't have especially these kinds of modern optimizations that you would expect
[06:18.010 --> 06:19.028]   to have.
[06:19.028 --> 06:27.004]   But I knew that this GPU could potentially do what I wanted.
[06:27.004 --> 06:29.044]   So I spent some time with the datasheet.
[06:29.044 --> 06:34.028]   There were a handful of people doing open source hacking on this stuff, so I was able
[06:34.028 --> 06:39.004]   to fork some of their projects.
[06:39.004 --> 06:50.032]   And actually, funnily enough, some of the Raspberry Pi founders were actually very interested
[06:50.032 --> 06:52.064]   in this too.
[06:52.064 --> 07:06.088]   So I ended up hacking away and managed to figure out how to do this matrix multiplication.
[07:06.088 --> 07:15.092]   And funnily enough, one of the people who was really into this was actually Evan Lupton,
[07:15.092 --> 07:20.064]   the founder of Raspberry Pi.
[07:20.064 --> 07:24.076]   He was actually one of the few people who actually applied on the forums when I was
[07:24.076 --> 07:29.020]   sending out distress signals when I was getting stuck on stuff.
[07:29.020 --> 07:37.040]   So anyway, I ended up being able to use the GPU to do this matrix multiplication, so I
[07:37.040 --> 07:46.084]   actually run AlexNet to recognize a cat or a dog in two seconds rather than 30 seconds.
[07:46.084 --> 07:53.008]   And it was some of the most fun I've had in years because it really was trying to string
[07:53.008 --> 07:58.060]   things together with sticky tape and chicken wire.
[07:58.060 --> 08:00.056]   And yeah, I had a blast.
[08:00.056 --> 08:01.056]   How does it even work?
[08:01.056 --> 08:07.092]   You're writing assembly and running it on the GPU.
[08:07.092 --> 08:13.064]   What environment are you writing this in?
[08:13.064 --> 08:19.084]   So I was just pretty much using a text editor, and then there were a couple of different
[08:19.084 --> 08:23.044]   people who had done some work on assembly projects.
[08:23.044 --> 08:29.084]   None of them really worked, or they didn't work for what I needed, so I ended up hacking
[08:29.084 --> 08:30.084]   them up together.
[08:30.084 --> 08:39.004]   So I then feed in the text into the assembler, which would produce the raw command streams.
[08:39.004 --> 08:45.016]   And then I had to figure out the right memory addresses to write to from the Raspberry Pi
[08:45.016 --> 08:49.012]   CPU to upload this program.
[08:49.012 --> 08:55.056]   And then that program would be sitting there in the, I think there was something like some
[08:55.056 --> 08:57.056]   ridiculously small number of instructions.
[08:57.056 --> 09:04.008]   I could run like 64 instructions in there or something, or 128.
[09:04.008 --> 09:09.004]   The program would be sitting there on all of these, I think it was like four or eight
[09:09.004 --> 09:10.004]   cores.
[09:10.004 --> 09:13.020]   I would then have to kick them off.
[09:13.020 --> 09:20.032]   I'd have to feed in the memory from the...
[09:20.032 --> 09:28.096]   Honestly, in terms of software engineering, it was a disaster, but it worked.
[09:28.096 --> 09:30.088]   What kind of debugging messages do you get?
[09:30.088 --> 09:32.060]   I'm thinking back to college and writing this.
[09:32.060 --> 09:37.004]   I remember the computer would just crash, I think, when there was invalid...
[09:37.004 --> 09:44.024]   I was actually writing out to a pixel so I could tell by the pixel color how far through
[09:44.024 --> 09:56.008]   the program that it had actually got, which I'm colorblind, so that didn't help.
[09:56.008 --> 10:02.008]   But yeah, it was really getting down and dirty.
[10:02.008 --> 10:05.084]   It was the sort of thing where you can just lose yourself for a few weeks in some really
[10:05.084 --> 10:10.012]   obscure technical problems.
[10:10.012 --> 10:15.036]   Having worked on projects kind of like that, how did you maintain hope that the project
[10:15.036 --> 10:16.096]   would finish in a way that it would work?
[10:16.096 --> 10:22.048]   I think that might be the hardest thing for me to work on something like that.
[10:22.048 --> 10:27.076]   At the time, I was working on a startup, and this seemed a much more practical problem
[10:27.076 --> 10:34.000]   than all of the other things I was dealing with at the startup.
[10:34.000 --> 10:40.000]   So in a lot of ways, it was just procrastination on dealing with worse problems.
[10:40.000 --> 10:41.048]   Great answer.
[10:41.048 --> 10:42.048]   Yeah.
[10:42.048 --> 10:50.052]   And I guess what was the reason that the Raspberry Pi included this GPU that they wouldn't actually
[10:50.052 --> 10:52.052]   let you directly access?
[10:52.052 --> 10:55.060]   Was it for streaming video or something?
[10:55.060 --> 11:01.088]   Yeah, it really was designed for, I think, early 2000s set-top boxes and things.
[11:01.088 --> 11:10.004]   So you were going to be able to draw a few triangles, but you weren't going to be able
[11:10.004 --> 11:11.004]   to run any...
[11:11.004 --> 11:13.096]   It wasn't designed to run any shaders or anything on it.
[11:13.096 --> 11:21.048]   So GLSL and things like that weren't even considered fluid at that time.
[11:21.048 --> 11:25.068]   I think there's been some work on that since, I think, maybe with some more modern versions
[11:25.068 --> 11:26.068]   of GPUs.
[11:26.068 --> 11:33.064]   But back in the last week, probably one day, it's just like, "You're going to draw some
[11:33.064 --> 11:34.064]   triangles."
[11:34.064 --> 11:37.024]   Have you been following the Raspberry Pi sense?
[11:37.024 --> 11:40.028]   Do you have thoughts on the floor?
[11:40.028 --> 11:43.036]   Did they talk to you about what to include there, maybe?
[11:43.036 --> 11:44.068]   No, no.
[11:44.068 --> 11:54.088]   I think they knew better than that because I'm not exactly an average user.
[11:54.088 --> 12:01.044]   As a general developer, it's fantastic because the Raspberry Pi 4 is this beast of a machine
[12:01.044 --> 12:07.008]   with multi-threading and it's got those SIMD instructions I talked about.
[12:07.008 --> 12:14.008]   There's I think support for GLSL and all these modern OpenGL things in the GPU.
[12:14.008 --> 12:21.000]   But as a hacker, I'm like, "Oh, it's just kind of..."
[12:21.000 --> 12:25.032]   It's all kind of...
[12:25.032 --> 12:26.032]   Exactly.
[12:26.032 --> 12:31.008]   Well, it's funny because I think I met you when I was trying to get TensorFlow to run
[12:31.008 --> 12:36.048]   on the Raspberry Pi 3, which is literally just trying to compile it and link in the
[12:36.048 --> 12:37.048]   proper libraries.
[12:37.048 --> 12:40.036]   I remember completely getting stuck.
[12:40.036 --> 12:44.036]   I'm ashamed to tell you that and reach out to the forums and being like, "Wow, the tech
[12:44.036 --> 12:49.048]   support from TensorFlow is unbelievably good that it's answering my questions."
[12:49.048 --> 12:52.024]   Well, I think you ended up...
[12:52.024 --> 12:54.016]   You found my email address as well.
[12:54.016 --> 12:55.088]   I think you dropped me an email.
[12:55.088 --> 13:01.004]   And again, I think you caught me in the middle of procrastinating on something that was supposed
[13:01.004 --> 13:02.004]   to be doing.
[13:02.004 --> 13:04.048]   And I was like, "Oh, wow, this is way more fun.
[13:04.048 --> 13:07.080]   Let me spend some time on this."
[13:07.080 --> 13:16.004]   But no, you shouldn't underestimate that TensorFlow has so many dependencies, which is pretty
[13:16.004 --> 13:25.032]   normal for a Python cloud server project because they're essentially free in that environment.
[13:25.032 --> 13:29.004]   You just do a PIP install or something and it will just work.
[13:29.004 --> 13:39.052]   But as soon as you're moving over to something that's not the vanilla x86 Linux environment
[13:39.052 --> 13:46.092]   that it's expecting, you suddenly have to pay the price of trying to figure out all
[13:46.092 --> 13:47.092]   of these.
[13:47.092 --> 13:48.092]   Where did this come from?
[13:48.092 --> 13:49.092]   Right.
[13:49.092 --> 13:50.092]   Right.
[13:50.092 --> 13:55.084]   So I guess one question that comes to mind for me that I don't know if you feel like
[13:55.084 --> 14:00.028]   it's a fair question for you to answer, but I'd love your thoughts on it.
[14:00.028 --> 14:03.076]   It seems like everyone trains their models, except for people at Google, train their models
[14:03.076 --> 14:06.068]   on Nvidia GPUs.
[14:06.068 --> 14:11.064]   I'm told that's because of the CUDA library that essentially compiles the code and CUDA
[14:11.064 --> 14:19.008]   NN that makes a low-level language for writing ML components and then compiling them onto
[14:19.008 --> 14:20.032]   the Nvidia chip.
[14:20.032 --> 14:28.068]   But if Pete Worden can just directly write code to do matrix multiplication on a chip
[14:28.068 --> 14:34.016]   that's not even trying to publish its docs and let anyone do this, where's the disconnect?
[14:34.016 --> 14:37.080]   Why don't we see more chips being used for compiling?
[14:37.080 --> 14:42.080]   Why doesn't TensorFlow work better on top of more different kinds of architecture?
[14:42.080 --> 14:48.016]   I think that was one of the original design goals of TensorFlow, but we haven't seen maybe
[14:48.016 --> 14:52.048]   the explosion of different GPU architectures that I think we might have been expecting
[14:52.048 --> 14:54.048]   back in 2016, 2017.
[14:54.048 --> 14:55.048]   Yeah.
[14:55.048 --> 15:10.096]   personally is it's the damn researchers.
[15:10.096 --> 15:18.040]   They keep coming up with new techniques and better ways of training models.
[15:18.040 --> 15:27.008]   What generally tends to happen is it follows the same model that Alex Frizetti originally
[15:27.008 --> 15:34.052]   did and his colleagues with AlexNet, where the thing that blew me away when I first started
[15:34.052 --> 15:39.036]   getting into deep learning was Alex had made his code available.
[15:39.036 --> 15:47.012]   He had not only been working at the high-level model creation side, he'd also been really
[15:47.012 --> 15:55.020]   hacking on the CUDA kernels to run on the GPU to get stuff running fast enough.
[15:55.020 --> 16:00.056]   It was this really interesting having to understand all these high-level concepts, these cutting
[16:00.056 --> 16:10.012]   edge concepts of machine learning, while also being this in a loop assembly, essentially,
[16:10.012 --> 16:17.076]   not quite down to that level, but intrinsic, really thinking about every cycle.
[16:17.076 --> 16:25.016]   What has tended to happen is as new techniques have come in, the researchers tend to just
[16:25.016 --> 16:31.008]   for their own, to run their own experiments, they have to write things that run as fast
[16:31.008 --> 16:32.008]   as possible.
[16:32.008 --> 16:36.088]   They've had to learn how to, the default for this is CUDA.
[16:36.088 --> 16:44.068]   You end up with new techniques coming in as CUDA implementations.
[16:44.068 --> 16:50.044]   Usually there's a C++ CPU implementation that may or may not be particularly optimized,
[16:50.044 --> 16:54.004]   and then there's definitely a CUDA implementation.
[16:54.004 --> 16:59.064]   Then the techniques that actually catch on, the rest of the world has to then figure out
[16:59.064 --> 17:08.008]   how to take what's often great code for its purpose, but is written by researchers for
[17:08.008 --> 17:14.020]   research purposes, and then figure out how to port it to different systems with different
[17:14.020 --> 17:15.020]   decisions.
[17:15.020 --> 17:25.072]   There's this whole hidden amount of work that people have to do to take all of these emerging
[17:25.072 --> 17:29.072]   techniques and get them running across all architectures.
[17:29.072 --> 17:36.096]   I think that's true across the whole ecosystem.
[17:36.096 --> 17:43.028]   It's one of the reasons that I really love for experimenting if you're in the Raspberry
[17:43.028 --> 17:50.016]   Pi form factor, but you can afford to be burning 10 watts of power.
[17:50.016 --> 17:55.084]   Have a Jetson or Jetson Nano or something, because then you've got essentially the same
[17:55.084 --> 18:03.048]   you that you'd be running in a desktop machine, but just on a much smaller form factor.
[18:03.048 --> 18:04.048]   Totally.
[18:04.048 --> 18:05.048]   Yeah.
[18:05.048 --> 18:11.096]   It makes me a little sad that the Raspberry Pi doesn't have an Nvidia chip on it.
[18:11.096 --> 18:17.080]   The heat sink alone would be.
[18:17.080 --> 18:24.000]   One thing I noticed, your book is excellent on embedded ML.
[18:24.000 --> 18:30.016]   I was in a different interview, which we should pull that clip of interview with Pete Skomorok.
[18:30.016 --> 18:34.092]   We both had your book on our desk.
[18:34.092 --> 18:35.092]   Yeah.
[18:35.092 --> 18:36.092]   Pete's awesome.
[18:36.092 --> 18:40.076]   He's been doing some amazing stuff too.
[18:40.076 --> 18:45.076]   He's another person who occasionally catches me when I'm procrastinating and I'm able to
[18:45.076 --> 18:49.008]   offer some advice and vice versa.
[18:49.008 --> 18:52.088]   Nice, maybe we should have a neighborhood.
[18:52.088 --> 18:56.096]   Yeah, hacking procrastination list.
[18:56.096 --> 19:02.012]   I guess it seems pretty obvious that you do some interesting projects in your house or
[19:02.012 --> 19:03.012]   for personal stuff.
[19:03.012 --> 19:11.096]   I was wondering if you could talk about any of your own personal ML hack projects.
[19:11.096 --> 19:19.048]   I'm obsessed with actually trying to get a magic wand working well.
[19:19.048 --> 19:28.024]   One of the things that I get to see is these applications that are being produced by industry
[19:28.024 --> 19:36.036]   professionals for things like Android, Android phones, smartphones in general.
[19:36.036 --> 19:44.092]   The gesture recognition using accelerometers just works really well on these phones.
[19:44.092 --> 19:51.080]   Because people are able to get it working really well in the commercial realm, but I
[19:51.080 --> 19:59.024]   haven't seen that many examples of it actually working well as open source.
[19:59.024 --> 20:07.068]   Even the example that we ship with TensorFlow, like Micro, is not good enough.
[20:07.068 --> 20:16.020]   It's a proof of concept, but it doesn't work nearly as well as I want.
[20:16.020 --> 20:21.040]   That's been one of my main projects I keep coming back to is, "Okay, how can I actually
[20:21.040 --> 20:26.048]   just do a Zorro sign or something holding?"
[20:26.048 --> 20:32.096]   I've got the little Arduino on my desk here.
[20:32.096 --> 20:35.096]   Do that and have it recognize.
[20:35.096 --> 20:44.028]   I want to be able to do that to the screen and have it change channels or something.
[20:44.028 --> 20:50.016]   What I really want to be able to do, we actually released some of this stuff as part of Google
[20:50.016 --> 20:56.064]   I/O, so I'll share a link maybe you can put in the description afterwards.
[20:56.064 --> 21:03.068]   My end goal, because these things actually have Bluetooth, I want it to be able to emulate
[21:03.068 --> 21:16.008]   a keyboard or a mouse or a gamepad controller and actually be able to customize it, like
[21:16.008 --> 21:21.088]   a MIDI keyboard even as well, and actually customize it so you can do some kind of gesture
[21:21.088 --> 21:27.096]   and then have it like you do a Z and it presses the Z key or something on your virtual keyboard
[21:27.096 --> 21:34.008]   and that does something interesting with whatever you've got it connected up to.
[21:34.008 --> 21:40.088]   That isn't quite working yet, but hopefully I get some tough enough problems in my main
[21:40.088 --> 21:45.056]   job that I'll procrastinate and spend some more time on that.
[21:45.056 --> 21:47.036]   Man, I hope for that too.
[21:47.036 --> 21:54.016]   I guess for people that maybe aren't experts in embedded computing systems, could you describe
[21:54.016 --> 21:58.044]   the difference between a Raspberry Pi and an Arduino and then the different challenges
[21:58.044 --> 22:02.040]   in getting ML to run on a Raspberry Pi versus an Arduino?
[22:02.040 --> 22:03.040]   Yeah.
[22:03.040 --> 22:09.072]   At the top level, the biggest difference is the amount of memory.
[22:09.072 --> 22:25.072]   This Arduino Nano BLE Sense 33, I think it has 256K of RAM and either 512K or something
[22:25.072 --> 22:32.004]   like that of flash, kind of like read-only memory.
[22:32.004 --> 22:37.084]   It's this really, really small environment that you actually have to run in and it means
[22:37.084 --> 22:44.024]   you don't have a lot of things that you would expect to have to an operating system like
[22:44.024 --> 22:52.080]   files or printf or you're really having to look at every single byte.
[22:52.080 --> 22:59.060]   The printf function itself, in a lot of implementations, it will actually take up about 20 kilobytes
[22:59.060 --> 23:06.060]   of code size just having printf because printf is essentially this big switch statement of,
[23:06.060 --> 23:08.060]   "Oh, have you got a printf?
[23:08.060 --> 23:13.040]   Oh, here's printf in a float value."
[23:13.040 --> 23:18.092]   There's hundreds of these modifiers and things you never even think of or printing things
[23:18.092 --> 23:20.040]   you never even imagine.
[23:20.040 --> 23:26.068]   All that code has to get printed in if you actually have printf in the system.
[23:26.068 --> 23:34.096]   All of these devices that we're aiming at, they often have only a couple of hundred kilobytes
[23:34.096 --> 23:39.012]   of space to write your programs in.
[23:39.012 --> 23:40.080]   You may be sensing a theme here.
[23:40.080 --> 23:48.024]   I love trying to fit, take modern stuff and fit it back into something that's like a Commodore
[23:48.024 --> 23:49.024]   64.
[23:49.024 --> 23:54.088]   Okay, it seems like Pete Worden doesn't always need a practical reason to do something, but
[23:54.088 --> 23:59.040]   what might be the practical reason for using Arduino versus a Raspberry Pi?
[23:59.040 --> 24:09.076]   Well, luckily, I've actually managed to justify my hobby and turn it into my full-time project
[24:09.076 --> 24:13.060]   because one great example of where we use these is...
[24:13.060 --> 24:16.052]   I actually don't see my phone here.
[24:16.052 --> 24:20.072]   I was going to hold onto the phone, but you know what a phone looks like.
[24:20.072 --> 24:23.044]   If you think about things like...
[24:23.044 --> 24:28.064]   And I won't say the full word because it will set off people's phones, but the OKG, wake
[24:28.064 --> 24:34.016]   word, or the wake words on Apple or Amazon.
[24:34.016 --> 24:40.092]   When you're using a voice interface, you want your phone to wake up when it hears you say
[24:40.092 --> 24:50.060]   that word, but what it turns out is you can't afford to even run the main ARM application
[24:50.060 --> 24:59.028]   processor 24/7 to listen out for that word because your battery would just be drained.
[24:59.028 --> 25:06.076]   These main CPUs use maybe somewhere around a watt of power when they're up and running
[25:06.076 --> 25:11.096]   when you're browsing the web or interacting with it.
[25:11.096 --> 25:20.012]   What they all do instead is actually have what's often called an always-on hub or chip
[25:20.012 --> 25:26.060]   or sensor hub or something like that where the main CPU is powered down, so it's not
[25:26.060 --> 25:36.060]   using any energy, but this much more limited but much more lower energy chip is actually
[25:36.060 --> 25:44.060]   running and listening to the microphone and running a very, very small, somewhere in the
[25:44.060 --> 25:54.040]   order of 30 kilobytes, ML model to say, "Hey, has somebody said that word or that wake word
[25:54.040 --> 25:59.068]   phrase that I'm supposed to be listening out for?"
[25:59.068 --> 26:04.024]   They have exactly the same challenges.
[26:04.024 --> 26:07.056]   You only have a few hundred kilobytes at most.
[26:07.056 --> 26:09.048]   You're running on a pretty low-end processor.
[26:09.048 --> 26:12.056]   You don't have an operating system.
[26:12.056 --> 26:22.004]   Every byte counts, so you have to squeeze the library as small as possible.
[26:22.004 --> 26:28.048]   That's one of the real-world applications where we're actually using this TensorFlow
[26:28.048 --> 26:32.032]   Lite Micro.
[26:32.032 --> 26:41.084]   More generally, the Raspberry Pi is, you're probably looking at $25, something like that.
[26:41.084 --> 26:48.088]   The equivalent which the Raspberry Pi Foundation just launched, I think last year or maybe
[26:48.088 --> 26:55.048]   at the start of this year, that's the equivalent of the Arduino is the Pico, and that's, I
[26:55.048 --> 27:00.020]   think, like $3 retail.
[27:00.020 --> 27:06.020]   The Raspberry Pi, again, uses one or two watts of power, so if you're going to run it for
[27:06.020 --> 27:12.096]   a day, you essentially need the phone battery that it will run down over the course of a
[27:12.096 --> 27:21.024]   day, whereas the Pico is only using 100 milliwatts, a tenth of a watt.
[27:21.024 --> 27:25.012]   You can run it for 10 times longer than the same battery, or you can run it on a much
[27:25.012 --> 27:30.026]   smaller battery.
[27:30.026 --> 27:35.016]   These embedded devices tend to be used where there's power constraints or there's cost
[27:35.016 --> 27:41.080]   constraints or even where there's form factor constraints because this thing is even smaller
[27:41.080 --> 27:45.064]   than a Raspberry Pi Zero.
[27:45.064 --> 27:52.056]   You can stick it anywhere and it will survive being run over and all of those sorts of things.
[27:52.056 --> 27:58.036]   Can you describe, let's take, for example, a speech recognition system.
[27:58.036 --> 28:01.066]   Can you describe the differences of how you would think about training and deploying if
[28:01.066 --> 28:10.044]   it was going to the cloud or a big desktop server versus a Raspberry Pi versus an Arduino?
[28:10.044 --> 28:18.052]   The theme, again, is size and how much space you actually have on these systems.
[28:18.052 --> 28:25.052]   You'll be thinking always about how can I make this model as small as possible.
[28:25.052 --> 28:31.016]   You're looking at making the model probably in the tens of kilobytes for doing...
[28:31.016 --> 28:38.088]   We have this example of doing speech recognition and I think it uses a 20 kilobyte model.
[28:38.088 --> 28:45.042]   You're going to be sacrificing accuracy and a whole bunch of other stuff in order to get
[28:45.042 --> 28:52.066]   something that will actually fit on this really low energy device, but hopefully it's still
[28:52.066 --> 28:56.090]   accurate enough that it's useful.
[28:56.090 --> 28:58.044]   How do you do that?
[28:58.044 --> 29:01.020]   How do you reduce the size without compromising accuracy?
[29:01.020 --> 29:03.064]   Can you describe some of the techniques?
[29:03.064 --> 29:04.064]   Yeah.
[29:04.064 --> 29:12.088]   I actually just blogged about one trick that I've seen used, but I realized I hadn't seen
[29:12.088 --> 29:20.024]   in the literature very much, which is where the classic going back to AlexNet approach
[29:20.024 --> 29:28.028]   after you do a convolution in an image recognition network, you often have a pooling stage.
[29:28.028 --> 29:34.028]   That pooling stage would either do average pooling or max pooling.
[29:34.028 --> 29:41.066]   What that's doing is it's taking the output of the convolution, which is often the same
[29:41.066 --> 29:45.052]   size as the input, but with a lot more channels.
[29:45.052 --> 29:52.036]   Then it's taking blocks of two by two values and it's saying, "Hey, I'm going to only take
[29:52.036 --> 29:56.008]   the maximum of that two by two block."
[29:56.008 --> 30:02.072]   Take four values and output one value or do the same, but do averaging.
[30:02.072 --> 30:13.012]   That helps with accuracy, but because you're outputting these very large outputs from the
[30:13.012 --> 30:17.068]   convolution, that means that you have to have a lot of RAM because you have to hold the
[30:17.068 --> 30:23.092]   input for the convolution and you also have to hold the output, which is the same size
[30:23.092 --> 30:27.084]   as the input, but it has more channels.
[30:27.084 --> 30:31.068]   The memory size is even larger.
[30:31.068 --> 30:37.064]   Instead of doing that, a common technique that I've seen in the industry is to use a
[30:37.064 --> 30:42.076]   stride to arm the convolution.
[30:42.076 --> 30:46.040]   Instead of having the sliding window just slide over one pixel every time as you're
[30:46.040 --> 30:53.048]   doing the convolutions, you actually have it jump two pixels horizontally and vertically.
[30:53.048 --> 31:06.004]   That has the effect of outputting the same result or the same size, the same number of
[31:06.004 --> 31:12.004]   elements you would get if you did a convolution plus a two by two pooling, but it means that
[31:12.004 --> 31:20.004]   you actually do less compute and you don't have to have nearly as much active memory
[31:20.004 --> 31:21.004]   kicking around.
[31:21.004 --> 31:23.060]   Interesting.
[31:23.060 --> 31:27.084]   I had thought of the size of the model.
[31:27.084 --> 31:32.008]   It was just the size of the model's parameters, but it sounds like you also need some active
[31:32.008 --> 31:35.084]   memory, but it's hard to imagine that even could be on the order of magnitude of the
[31:35.084 --> 31:36.084]   size of the model.
[31:36.084 --> 31:42.068]   Like the pixels of the image and then the intermediate results, I guess, can be bigger
[31:42.068 --> 31:43.068]   than the model.
[31:43.068 --> 31:44.068]   Yeah.
[31:44.068 --> 31:52.060]   That's the nice thing about convolution is you get to reuse the weights in a way that
[31:52.060 --> 31:57.072]   you really don't with fully connected layers.
[31:57.072 --> 32:05.068]   You can actually end up with convolution models, the activation memory, taking up a substantial
[32:05.068 --> 32:07.076]   amount of space.
[32:07.076 --> 32:11.028]   I guess I'm also getting into the weeds a bit here because the obvious answer to your
[32:11.028 --> 32:16.044]   question is also quantization, like taking these floating point models and just turning
[32:16.044 --> 32:24.020]   them into 8-bit because that immediately slashes all of your memory sizes by 75%.
[32:24.020 --> 32:28.088]   And what about, I mean, I've seen people get down to 4 bits or even 1 bit.
[32:28.088 --> 32:30.088]   Do you have thoughts on that?
[32:30.088 --> 32:31.088]   Yeah.
[32:31.088 --> 32:32.088]   Yeah.
[32:32.088 --> 32:35.084]   That's really, really interesting work.
[32:35.084 --> 32:41.088]   A colleague of mine actually, again, I'll send on a link to the paper, but looked at,
[32:41.088 --> 32:49.088]   I think it's something about the Pareto optimal bit depth for ResNet is like 4 bits or something
[32:49.088 --> 32:52.016]   like that.
[32:52.016 --> 32:58.096]   And there's been some really, really good research about going down to 4 bits or 2 bits
[32:58.096 --> 33:04.052]   or even going down to binary networks with 1 bit.
[33:04.052 --> 33:14.024]   And the biggest challenge from our side is that CPUs aren't generally optimized for anything
[33:14.024 --> 33:17.084]   other than like 8-bit arithmetic.
[33:17.084 --> 33:28.048]   So going down to these little bit depths requires some advances in the hardware that we're actually
[33:28.048 --> 33:29.048]   using.
[33:29.048 --> 33:32.028]   Do you have any thoughts about actually training on the edge?
[33:32.028 --> 33:36.008]   I feel like people have been talking about this for a long time, but I haven't seen examples
[33:36.008 --> 33:41.036]   where you actually do some of the training and then it passes that upstream.
[33:41.036 --> 33:51.076]   What I've seen is that especially on the embedded edge, it's very hard to get labeled data.
[33:51.076 --> 33:57.072]   And right now, there's been some great advances in unsupervised learning.
[33:57.072 --> 34:06.000]   But our workhorse approach to solving like image and audio and accelerometer recognition
[34:06.000 --> 34:15.052]   problems is still around actually taking big labeled data sets and just running them through
[34:15.052 --> 34:16.052]   training.
[34:16.052 --> 34:21.028]   And so if you don't have some kind of implicit labels on the data that you're gathering on
[34:21.028 --> 34:26.026]   the edge, which you almost never do, it's very hard to justify training.
[34:26.026 --> 34:36.084]   The one case where I actually have seen this look like it's pretty promising is for industrial
[34:36.084 --> 34:38.020]   monitoring.
[34:38.020 --> 34:42.076]   So when you've got like a piece of machinery and you basically want to know if it's about
[34:42.076 --> 34:52.024]   to shake itself to bits because it's got kind of a mechanical problem and you have an accelerometer
[34:52.024 --> 34:57.024]   or a microphone sensor kind of sitting on this device.
[34:57.024 --> 35:02.000]   And the hard part is telling whether it's actually about to shake itself to bits or
[35:02.000 --> 35:08.002]   whether that's just how it normally like kind of like vibrates.
[35:08.002 --> 35:16.022]   And so one promising approach for this kind of predictive maintenance is to actually spend
[35:16.022 --> 35:24.056]   the first 24 hours just assuming that everything is normal and kind of learning, okay, this
[35:24.056 --> 35:26.032]   is normality.
[35:26.032 --> 35:32.040]   And then only after that start to kind of like look for things that are outside of the
[35:32.040 --> 35:38.032]   -- you're implicitly labeling like the first 24 hours, okay, this is normal data.
[35:38.032 --> 35:42.094]   And then you're looking for anything that's kind of like an excursion out beyond that.
[35:42.094 --> 35:49.028]   So that sort of makes sense for some kind of a training approach.
[35:49.028 --> 35:58.056]   But even there I still actually put people to consider things like using embeddings and
[35:58.056 --> 36:06.028]   other approaches that don't require full back propagation to do the training.
[36:06.028 --> 36:12.076]   For example, if you have an audio model that has to recognize a particular person saying
[36:12.076 --> 36:20.098]   a word, try and have that model produce sort of an n dimensional vector that's embedding
[36:20.098 --> 36:24.074]   and then have the person say the word three times.
[36:24.074 --> 36:32.062]   And then just use k nearest neighbor sort of approaches to kind of tell if subsequent
[36:32.062 --> 36:37.072]   utterances are close in that embedding space.
[36:37.072 --> 36:43.096]   And then you've sort of done something that looks like learning from a user perspective,
[36:43.096 --> 36:48.096]   but you don't have to have all this machinery of variables and changing the neural network
[36:48.096 --> 36:53.052]   and you're just doing it as a post processing action.
[36:53.052 --> 36:59.044]   Do you see a lot of actual real world uses, like actual companies kind of shipping stuff
[36:59.044 --> 37:02.016]   like models into microcontrollers?
[37:02.016 --> 37:03.064]   Yeah.
[37:03.064 --> 37:15.044]   And again, this is hard to talk about because a lot of these aren't like sort of Android
[37:15.044 --> 37:21.040]   apps and things where people are fairly open and open source.
[37:21.040 --> 37:28.004]   A lot of these are pretty sort of well established, old school industrial companies and automotive
[37:28.004 --> 37:30.072]   companies and things like that.
[37:30.072 --> 37:38.048]   But we do see there's a bunch of apps that are already or a bunch of products out there
[37:38.048 --> 37:42.024]   that are already using ML under the hood.
[37:42.024 --> 37:49.008]   I mean, one of the examples I like to give is when I joined Google back in 2014, I met
[37:49.008 --> 37:56.000]   Raziel Alvarez, who's now actually at Facebook doing some very similar stuff, I believe,
[37:56.000 --> 38:01.098]   but he was responsible for a lot of the OKG work.
[38:01.098 --> 38:09.056]   And they had been shipping on billions of phones using ML and specifically using deep
[38:09.056 --> 38:12.088]   learning to do this kind of recognition.
[38:12.088 --> 38:19.076]   But I had no idea that they were shipping these like 30 kilobyte models to do ML and
[38:19.076 --> 38:22.072]   they had been for years.
[38:22.072 --> 38:26.012]   And from my understanding, from what I've seen of Apple and other companies, they've
[38:26.012 --> 38:31.076]   been using very similar approaches in the speech world for a long time.
[38:31.076 --> 38:37.092]   But a lot of these areas don't have the same kind of expectation that you will publish
[38:37.092 --> 38:45.000]   and publicize work that we tend to in the modern ML world.
[38:45.000 --> 38:47.044]   So it sort of flies below the radar.
[38:47.044 --> 38:55.036]   But yeah, there's ML models that will be running in your house, almost certainly right now,
[38:55.036 --> 38:57.052]   that are running on embedded hardware.
[38:57.052 --> 39:01.020]   And I guess besides the audio recognition, what might those ML models in my house be
[39:01.020 --> 39:02.020]   doing?
[39:02.020 --> 39:04.064]   Can you give me a little bit of a flavor for that?
[39:04.064 --> 39:13.080]   Yeah, so accelerometer recognition, like trying to tell if somebody is doing a gesture or
[39:13.080 --> 39:20.092]   if a piece of machinery is doing what you're expecting, like the washing machine or the
[39:20.092 --> 39:27.056]   dishwasher or things like that, trying to actually take in these signals from noisy
[39:27.056 --> 39:31.080]   sensors and actually try and tell what's actually happening.
[39:31.080 --> 39:37.068]   Can you ML model my washing machine?
[39:37.068 --> 39:44.040]   I would not be at all surprised.
[39:44.040 --> 39:46.098]   Wow.
[39:46.098 --> 39:53.020]   I guess another question that I had for you, thinking about your long tenure on TensorFlow,
[39:53.020 --> 39:59.088]   which is such a well-known library, is kind of like, how has that evolved over the time
[39:59.088 --> 40:00.088]   you've been there?
[40:00.088 --> 40:04.080]   Have things surprised you in the directions that it's taken?
[40:04.080 --> 40:10.040]   How do you even think about, with a project like that, what to prioritize into the future?
[40:10.040 --> 40:25.008]   I mean, honestly, how big TensorFlow got and how fast really blew me away.
[40:25.008 --> 40:27.036]   That was kind of amazing to see.
[40:27.036 --> 40:36.092]   I'm used to working on these weird technical problems that I find interesting and following
[40:36.092 --> 40:39.092]   my curiosity.
[40:39.092 --> 40:48.028]   I've been led to TensorFlow by pulling on a piece of yarn and ending up there.
[40:48.028 --> 40:59.008]   It was really nice to see not just TensorFlow, but PyTorch, MXNet, all of these other frameworks.
[40:59.008 --> 41:05.036]   There's been this explosion in the number of people interested.
[41:05.036 --> 41:10.004]   Especially there's been this explosion in the number of products that have been shipping.
[41:10.004 --> 41:17.080]   The number of use cases that people have found for these has been really mind-blowing.
[41:17.080 --> 41:25.068]   I'm used to doing open-source projects which get 10 stars or something, and I'm happy.
[41:25.068 --> 41:40.044]   Seeing TensorFlow and all these other frameworks just get this mass adoption has definitely
[41:40.044 --> 41:47.072]   surprised me and has been really nice to see.
[41:47.072 --> 41:49.076]   What about in terms of what it does?
[41:49.076 --> 41:51.004]   How has that evolved?
[41:51.004 --> 41:55.072]   What kinds of new functionality gets added to a library like that?
[41:55.072 --> 42:02.096]   Why do you think some of these breaking changes?
[42:02.096 --> 42:13.092]   I would just like to say I am sorry.
[42:13.092 --> 42:24.028]   It's such a really interesting problem because we are almost coming back to what we were
[42:24.028 --> 42:36.012]   talking about with Alex Krzyzewski, the classic example of the ML paradigm that we're in at
[42:36.012 --> 42:37.092]   the moment.
[42:37.092 --> 42:45.028]   You need a lot of flexibility to be able to experiment and create models and iterate on
[42:45.028 --> 42:46.028]   new approaches.
[42:46.028 --> 42:51.000]   But all of the approaches need to run really, really, really, really fast because you're
[42:51.000 --> 43:02.016]   running millions of iterations, millions of data points through each run just in order
[43:02.016 --> 43:04.092]   to try out one model.
[43:04.092 --> 43:11.068]   You've got this really challenging combination of you need all this flexibility, but you
[43:11.068 --> 43:17.068]   also need this cutting-edge performance and you're trying to squeeze out the absolute
[43:17.068 --> 43:27.050]   maximum amount of throughput you can out of the hardware that you have.
[43:27.050 --> 43:35.008]   You end up with this world where you have Python crawling into these chunks of these
[43:35.008 --> 43:40.048]   operators or these layers where the actual operators and layers themselves are highly,
[43:40.048 --> 43:47.096]   highly optimized, but you're expecting to be able to plug them into each other in very
[43:47.096 --> 43:57.012]   arbitrary ways and preserve that high performance.
[43:57.012 --> 44:05.044]   Similarly with TensorFlow, you're also expecting to be able to do it across multiple accelerated
[44:05.044 --> 44:20.026]   targets, things like the TPU, CPUs, and AMD, as well as NVIDIA GPUs.
[44:20.026 --> 44:26.058]   Honestly, it's just a really hard engineering problem.
[44:26.058 --> 44:36.028]   It's been a couple of years now since I've been on the mainline TensorFlow team, and
[44:36.028 --> 44:44.004]   it blew my mind how many dimensions and combinations and permutations of things they had to worry
[44:44.004 --> 44:54.028]   about in terms of getting this stuff just up and running and working well for people.
[44:54.028 --> 45:02.098]   It is tough as a user because you've got this space shuttle control panel full of complexity
[45:02.098 --> 45:11.022]   and you probably only want to use part of it, but everybody wants a different...
[45:11.022 --> 45:18.032]   This is a naive question, but when I look at the Kudanen library, it looks pretty close
[45:18.032 --> 45:21.074]   to the TensorFlow wrapper.
[45:21.074 --> 45:23.090]   Is that right?
[45:23.090 --> 45:27.092]   It seems like it tries to do the same building blocks that TensorFlow has.
[45:27.092 --> 45:34.016]   I would think with NVIDIA, it would be a lot of just passing information down into Kudanen.
[45:34.016 --> 45:35.016]   Yeah.
[45:35.016 --> 45:44.080]   Where I saw a lot of complexity was around things like the networking and the distribution
[45:44.080 --> 45:58.024]   and the very fast making sure that you didn't end up getting bottlenecked on data transfer
[45:58.024 --> 46:02.064]   as you're shuttling stuff around.
[46:02.064 --> 46:08.068]   We've had to go in and mess around with JPEG encoding and try different libraries to figure
[46:08.068 --> 46:12.032]   out which one would be faster because that starts to become the bottleneck at some point
[46:12.032 --> 46:18.096]   when you're throwing stuff onto GPU fast enough.
[46:18.096 --> 46:24.096]   I have to admit though, I've looked at that code in wonder.
[46:24.096 --> 46:31.060]   I have not tried to fix issues there.
[46:31.060 --> 46:32.060]   Amazing.
[46:32.060 --> 46:37.084]   I guess one more question on the topic, how do you test all these hardware environments?
[46:37.084 --> 46:42.028]   Do you have to set up the hardware somewhere to run all these things before you shut the
[46:42.028 --> 46:43.028]   code?
[46:43.028 --> 46:48.036]   Well, that's another pretty...
[46:48.036 --> 46:54.076]   The task of doing the continuous integration and the testing across all of these different
[46:54.076 --> 47:01.084]   pieces of hardware and all the different combinations of, "Oh, have you got two cards in your machine?
[47:01.084 --> 47:02.084]   Have you got four?
[47:02.084 --> 47:05.084]   Have you got this version of Linux?
[47:05.084 --> 47:09.012]   Are you running on Windows?
[47:09.012 --> 47:13.052]   Which versions of the drivers do you have?
[47:13.052 --> 47:17.092]   Which versions of the accelerators do you use?
[47:17.092 --> 47:18.092]   2DNN?"
[47:18.092 --> 47:23.004]   All of these...
[47:23.004 --> 47:31.056]   There are farms full of these machines where we're trying to test all of these different
[47:31.056 --> 47:37.076]   combinations and permutations, or as many as we can, to try and actually make sure that
[47:37.076 --> 47:41.004]   stuff works.
[47:41.004 --> 47:44.056]   As you can imagine, it's not a straightforward task.
[47:44.056 --> 47:45.096]   All right.
[47:45.096 --> 47:50.016]   Well, we're getting close to time and we always end with two questions that I want to save
[47:50.016 --> 47:51.096]   time for.
[47:51.096 --> 47:56.064]   One question is, what is an underrated topic in machine learning that you would like to
[47:56.064 --> 47:59.052]   investigate if you had some extra time?
[47:59.052 --> 48:02.096]   Oh, well, datasets.
[48:02.096 --> 48:07.080]   The common theme that I've seen throughout all the time I've worked with, I've ended
[48:07.080 --> 48:14.092]   up working with hundreds of teams who are creating products using machine learning.
[48:14.092 --> 48:22.000]   Almost always what they find is that investing time in improving their datasets is a much
[48:22.000 --> 48:28.056]   better return on investment than trying to tweak their architectures or hyperparameters
[48:28.056 --> 48:31.028]   or things like that.
[48:31.028 --> 48:41.008]   There are very few tools out there for actually doing useful things with datasets and improving
[48:41.008 --> 48:52.088]   datasets and understanding datasets and gathering dataset data points and cleaning up labels.
[48:52.088 --> 48:57.048]   I really think, and I'm starting to see, I think Andrew Ng and some other people have
[48:57.048 --> 49:05.008]   been talking about data-centric approaches and I'm starting to see more focus on that.
[49:05.008 --> 49:10.088]   But I think that that's going to just continue and it's going to be...
[49:10.088 --> 49:16.004]   I feel like as the ML world is maturing and more people are going through that experience
[49:16.004 --> 49:21.056]   of trying to put a product out and realizing, "Oh my God, we need better data tools," there's
[49:21.056 --> 49:27.028]   going to be way more demand and way more focus on that.
[49:27.028 --> 49:31.008]   That is an extremely interesting area for me.
[49:31.008 --> 49:35.092]   Well, you may have answered my last question, but I think you're well qualified to answer
[49:35.092 --> 49:40.072]   it having done a bunch of ML startups and then working on TensorFlow.
[49:40.072 --> 49:45.072]   When you think about deploying an ML model in the real world and getting it to work for
[49:45.072 --> 49:50.064]   a useful purpose, what do you see as the major bottlenecks?
[49:50.064 --> 49:56.028]   I guess datasets is one, I agree, is maybe the biggest one, but do you see others?
[49:56.028 --> 49:57.028]   Yeah.
[49:57.028 --> 50:05.064]   So another big problem is there's this kind of artificial distinction between the people
[50:05.064 --> 50:11.088]   who create models who often come from a research background and the people who have to deploy
[50:11.088 --> 50:13.092]   them.
[50:13.092 --> 50:22.056]   What will often happen is that the model creation people will get as far as getting an eval
[50:22.056 --> 50:30.012]   that shows that their model is reaching a certain level of accuracy in their Python
[50:30.012 --> 50:34.096]   environment and they'll say, "Okay, I'm done.
[50:34.096 --> 50:40.016]   Here's the checkpoints for this model, which is great," and then just hand that over to
[50:40.016 --> 50:46.072]   the people who are going to deploy it on an Android application.
[50:46.072 --> 50:53.072]   The problem there is that there's all sorts of things like the actual data in the application
[50:53.072 --> 50:58.084]   itself may be quite different to the training data.
[50:58.084 --> 51:02.092]   You're almost certainly going to have to do some stuff to it like quantization or some
[51:02.092 --> 51:08.024]   kind of thing that involves retraining in order to have something that's optimal for
[51:08.024 --> 51:12.020]   the device that you're actually shipping on.
[51:12.020 --> 51:17.004]   And there's just a lot of really useful feedback that you can get from trying this out in a
[51:17.004 --> 51:21.060]   real device that someone can hold in their hand and use that you just don't get from
[51:21.060 --> 51:25.088]   the eval use case.
[51:25.088 --> 51:34.060]   So coming back to Pete Scomarock, I first met him when he was part of the whole DJ Patil
[51:34.060 --> 51:43.060]   and the LinkedIn crew doing some of the really early data science stuff.
[51:43.060 --> 51:53.024]   They had this idea, and I think it was DJ who came up with the naming of data science
[51:53.024 --> 52:00.032]   and data scientists as somebody who would own the full stack of taking everything from
[52:00.032 --> 52:07.088]   doing the data analysis to coming up with models and things on it to actually deploying
[52:07.088 --> 52:16.012]   those on the website and then taking ownership of that whole end-to-end process.
[52:16.012 --> 52:21.082]   And the teams I've seen be really successful at deploying ML products, they've had people
[52:21.082 --> 52:28.012]   who formally or informally have taken on that responsibility for the whole thing and have
[52:28.012 --> 52:32.032]   the people who are writing the inner loops of the assembly sitting next to the people
[52:32.032 --> 52:36.052]   who are creating the models.
[52:36.052 --> 52:42.056]   And the team who created MobileNet, MobileVision with Andrew Howard and Omar Jacob, they're
[52:42.056 --> 52:43.096]   a great example of that.
[52:43.096 --> 52:48.088]   They all work very, very closely together doing everything from coming up with new model
[52:48.088 --> 52:55.068]   techniques to figuring out how they're actually going to run on real hardware at the really
[52:55.068 --> 52:57.096]   low level.
[52:57.096 --> 53:02.096]   So that's really one of the biggest things that I'm really hoping to see change in the
[53:02.096 --> 53:05.096]   next few years is more people kind of adopt that model.
[53:05.096 --> 53:06.096]   Well said.
[53:06.096 --> 53:07.096]   Thanks so much, Pete.
[53:07.096 --> 53:08.096]   That was super fun.
[53:08.096 --> 53:12.022]   No, thanks, Lucas.
[53:12.022 --> 53:16.060]   If you're enjoying these interviews and you want to learn more, please click on the link
[53:16.060 --> 53:21.032]   to the show notes in the description where you can find links to all the papers that
[53:21.032 --> 53:25.076]   are mentioned, supplemental material, and a transcription that we work really hard to
[53:25.076 --> 53:26.076]   produce.
[53:26.076 --> 53:27.076]   So check it out.



Tiny models
224



Run set
26



Run set
374



Run set
374


Audio
Tiny
Base
Large
1
2
3

media /
1.1MB1 subfolder, 0 files
snippet_examples.table.json
7.3KB
File<{extension: txt}>
File<(table)>
Dir