Luis Ceze — Accelerating Machine Learning Systems
From Apache TVM to OctoML, Luis gives direct insight into the world of ML hardware optimization, and where systems optimization is heading.
Listen on these platforms
Luis Ceze is co-founder and CEO of OctoML, co-author of the Apache TVM Project, and Professor of Computer Science and Engineering at the University of Washington. His research focuses on the intersection of computer architecture, programming languages, machine learning, and molecular biology.
Connect with Luis:
0:00 Intro and sneak peek
0:59 What is TVM?
8:57 Freedom of choice in software and hardware stacks
15:53 How new libraries can improve system performance
20:10 Trade-offs between efficiency and complexity
24:35 Specialized instructions
26:34 The future of hardware design and research
30:03 Where does architecture and research go from here?
30:56 The environmental impact of efficiency
32:49 Optimizing and trade-offs
37:54 What is OctoML and the Octomizer?
42:31 Automating systems design with and for ML
44:18 ML and molecular biology
46:09 The challenges of deployment and post-deployment
Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to email@example.com. Thank you!
I've never seen computer systems architecture and systems optimization being as interesting as it is right now. Because there was a period of researching this, it was just about making microprocessors faster, making a little bit better compilers, but now that we have to specialize and there's this really exciting application space with machine learning that offers so many opportunities for optimizations, and you have things like at FPGAs, and it's getting easier to design chips. We create all sorts of opportunities for academic research and also for industry innovation.
You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald.
Luis Ceze is co-founder and CEO of OctoML¹, founder of the Apache TVM Project², and a Professor of Computer Science at the University of Washington. He's an expert in making machine learning run efficiently on a variety of hardware systems, something that I'm super fascinated by and don't know a lot about. So I could not be more excited to talk to him today.
Why don't we just kind of jump right in, I guess. You're the CEO of OctoML, right? And that's based on the Apache TVM Project that I think you also authored. Can you just kind of, for people who don't know, kind of give a description of what that is?
Yeah, sure. And maybe a quick intro.
I wear two hats, I'm CEO of OctoML, and also a Professor of Computer Science Engineering at the University of Washington. I have many machine learning friends. On the area, I mean machine learning systems, so what does that mean? It means building computer systems that make machine learning applications run fast and efficient, and do what they're supposed to do in the easiest way possible.
And often we use machine learning in making machine learning systems better, which is something that we should touch on at some point, it's an interesting topic. Apache TVM...TVM stands for Tensor Virtual Machine. It started in our research group at University of Washington, about five years or so ago. And the context there was the following.
Five years ago, which in machine learning time is just like eons ago, there was already a growing set of machine learning models that people care about. In a faster and faster growing set of those, the fragmentation in the software ecosystem was just starting, the TensorFlow and PyTorch, MXnet, Keras and so on. And then hardware targets, that time were mainly CPUs and the beginning of GPUs, and a little bit of accelerators back then.
But our observation then was that, while we have a growing set of models, growing set of hardware targets, and then this fragmentation, it's either you have a software stack that is specific to the hardware that you want to deploy your model to, or they're specific to use cases like computer vision, or NLP, and so on. We wanted to create a clean abstraction that would free data scientists, or machine learning engineers, from having to worry about how to get their models deployed.
We wanted to have them focus on the statistical properties of the model, and then target a clean, single pane of glass. Clean abstraction across all of the systems and hardware such that you can deploy your model, and make the most of the hardware targets...make as much as possible from a hardware targeting point too.
As you all know here, since there are a lot of machine learning practitioners that listen to this, machine learning code is extremely sensitive to performance. Uses a lot of memory, uses a lot of memory bandwidth, which means that you use a lot of the ability of moving the data from memory to your computer engine and back, and also use a lot of raw compute power.
That's why, you know...hardware that is good for machine learning today more and more look like super computers of not too long ago. Like vector processing, and matrix, tensor cores and all of these things, a lot of linear algebra. Making the most out of that is really, really hard. I mean, code optimization is already hard. Now, if you're optimizing code for something that's as performance-sensitive as machine learning, you're talking about a really hard job. So anyways, I'm getting there, I know it's a long story, but hopefully it'll be worth it.
So at TVM, what started as a research question was that, can we automate the process of tuning your machine learning model and the actual codes to the hardware targets that you want to deploy to? Instead of having to rely on hand tuned libraries or relying on a lot of artisan coding to get your model to run fast enough, we wanted to use machine learning to automate that process.
And the way that works is TVM runs a bunch of little experiments to build, really, a profile or personality of how your hardware behaves, and uses that to guide a very large observational space to tune your model and your code. So the end result from a user point of view is that you pick up all this input in TVM, you choose a hardware target, and then what TVM does is, it finds just the right way of tuning your model and compiling to a very efficient binary on your hardware target.
And I guess when I think of like-
Does that answer your question, what TVM is? I know it's long, but I hope it was useful.
Yeah, no, it's great. I want to ask some more clarifying questions, I guess.
I'm not a hardware expert at all, and I guess what I've observed trying to make ML models run on various hardware types, is that it seems like it's harder and harder to abstract away the hardware. It seems like people are really like kind of building models with specific hardware in mind, sometimes specific memory sizes, and things like that. And I guess my first question-
And that's what we want to change. We want to remove that worry from the model builders. We want them to focus on building the best statistical properties possible, and then everything else should be left for engines like TVM and the Octomizer, I can tell you more about later.
And so this TVM though, is it actually like a virtual machine? Is it doing kind of real-time compiling to the hardware as the model runs?
That's part of the work. yeah. So TVM by and large, we call it just-in-time compilation. So, the reason the just-in-time compilation is important is because, well, you learn more about the model as you run it, as you evaluate it. And then second, you can do measurements of performance and make decisions about how we're going tune the rest of your population.
So, it is a virtual machine in the sense that it offers a clean abstraction. It's not a virtual machine in the VM-ware sense. It's more like a virtual machine in the Java virtual machine sense. Which could be a whole different conversation. It's even closer to my world as a computer systems architect, is thinking about those kinds of abstractions. But TVM is a virtual machine in the sense that it exposes a well-defined interface for you to express what your model does and gets that lower down to the hardware target.
Got it. And is this typically for deployment or could it also apply for training time?
So TVM so far, by and large its use has been for inference. So you have a model that's being trained. You often have done quantization too, by then and so on. And then you run it through TVM because...
We see that as a strength, is that you apply all the optimizations that could change the statistical properties of your model, and you validate your model that way. And then whatever we do from there on should be seen as a process that preserves exactly what your model does. We don't want to change anything because we see that as complementary to all of the optimization that model builders would apply before then.
So then once again, this is really like a compiler. It's a compiler plus code generator plus a runtime system, and we specialize everything to your model and the hardware target. We really produce a custom package that is ready to be deployed, that has custom everything. Has a custom line of operators for your model, has a custom runtime system for your model, and then wraps it up into a package that you can just go and deploy.
Got it. And are you picturing, typically, is this kind of like edge and like kind of low-power compute environments? Or is this more for like servers?
Yeah. Great question.
So, remember that I was telling you about automating the process and using machine learning to discover what the hardware can do and can't do well and use that to guide your optimization? That frees us from having to make that choice because essentially as long as there's no magic involved...obviously if you have a giant GPT-3-like model you want to run on a one-milliwatt-power microcontroller, this is just simply not going to work, that's obvious.
But in terms of the actual basic flow of having what we call cost models for the hardware target, and use those predictive models to guide how to optimize the model for that specific hardware target, essentially it's the same from teeny microcontrollers all the way to giant, beefy GPUs or accelerators, or FPJ-based stuff that we support as well. That means that TVM doesn't have any preference of either.
So we've had use cases both in the open source community, in the research space as well, that we support, and we still do it ourselves. All the way to our current customers at OctoML, we have customers for both edge deployment and cloud deployment, because the basic technology is effectively the same.
Some of the actual deployment aspects and the plumbing changes a bit. If you're going to deploy it on a tiny device, you might not even have an operating system, for example. So we support some of that. That's different than a server deployment, but the core aspect of how to make your model run fast on hardware targets is essentially the same.
I guess for kind of server-level deployments, I feel like with the exception of TPUs and a few companies, it seems like almost everyone deploys on to, like, NVIDIA stuff. Is this sort of like outside of CUDA and cuDNN, or does it translate into something that can then be compiled by CUDA? How does that work?
Yeah, this is an excellent question.
So first let's think about just a world with NVIDIA, and then let's just free ourselves from that tyranny, which actually is part of the goal here too. No, I love NVIDIA, I have many friends there, I admire what they do, but people should have a choice. And there's a lot of really good non-NVIDIA hardware. NVIDIA makes great hardware but there's a lot of really great non-NVIDIA hardware here.
Let's start with NVIDIA. Let's imagine a world that all you care about is deploying on NVIDIA. So NVIDIA has at the very lower level of their compilation stack, they do not expose their, what we call, instruction set. So that's actually...it's kept secret. They don't expose it. You have to program using CUDA, that's the lowest level.
And there's cuDNN on top, and also parallel to that, you have TensorRT for example, which is more of a compiler that you compile a model to the hardware target. TVM can be parallel, but at the same time use those. So here's what I mean.
Both cuDNN and TensorRT are generally guided and tuned and improved based on models that people care about, and moves with where the models are going. There's some fair amount of tuning that moves with where the models go. Whereas TVM, again, generates fresh code for every fresh model. So that means that in some cases we do better than TensorRT and cuDNN, just because we can specialize enough in a fully automatic way to the specific NVIDIA GPU that you have.
And then we generate raw CUDA code that you just compiled out. So essentially you run your models to TVM, which is a ton of CUDA codes, and then you compile that into a deployable binary on that specific NVIDIA GPU. But in the process of doing that, TVM... I mean, we do not take a dogmatic view that you should only use TVM. In some cases, of course, NVIDIA's libraries or NVIDIA's compilers like TensorRT can do better. And we want to be able to use that too.
So what TVM does, it does what we call "best of all worlds". The process of exploring how to compile your model, for parts of your model, say a set of operators, it sees TVM's version and then cuDNN and TensorRT and thinks, "Oh, this operator is better to use cuDNN", and you just go and put it in.
Then we link the whole thing together, such that what we produce for you, it could be a franken-binary. So bits and pieces are parts of cuDNN, maybe TensorRT, or TVM-generated code, and produces a package that essentially is specialized to your model, including the choice of whether you should or should not use NVIDIA's own software stack.
Okay. Did I answer your question on NVIDIA? So this how-
And by the way, this is just TVM. We should talk about the Octomizer later. The Octomizer, you want to abstract all of that away even further. Which is, you upload your model and then you can choose. You have a checkbox, all sorts of hardware.
There's Intel CPUs, AMD CPUs, NVIDIA GPUs, soon, AMD GPUs, and then Raspberry Pis, and an some cases you might choose to run and use a native stack for use. You don't even have to think about that. That's really what we want to offer, like we do not have to worry about it.
Apache TVM, let's just focus on the open source now, has got quite a bit of traction in both end users and hardware vendors. End users, companies like Microsoft, Amazon, Facebook and so on, have used it. Some of them using heavily today. But now hardware vendors got more and more into TVM, who are like ARM built their CPU, GPU, and NPU compiler and software stack on top of TVM. We're working with AMD to build one for AMD GPUs as well. Qualcomm has built their software stack with TVM, and we are working with them to further broaden the reach of the hardware that is supported by that.
The reason I'm telling you this is that as we enable hardware like AMD GPUs to be used very effectively via TVM, I think we will start offering users meaningful choice here. They should go with the hardware that better serves them without having to necessarily choose that based on the software stack.
Can I ask a couple of specific questions?
Does that makes sense or nah?
No, that makes total sense. So we do a lot of work with Qualcomm and they talk a lot about ONNX, which I think...my understanding is that's sort of a translation layer between models and places, like hardware that they could deploy on. How does that connect with TVM?
Yeah. So there's no visualization I could show you, but think of it as there's a stack. So, at the lowest level, you have hardware and then you have the compiler and operating system, then you have your code generator. So that's where our libraries are, too, that's where TVM sits. And then on top of that, you have your model framework, like TensorFlow, PyTorch, Keras, MXNet and so on.
ONNX as a spec is wonderful. Essentially it's a common language for you to describe models. And TVM takes as input models written as specified in ONNX, but it also takes native TensorFlow, native PyTorch, native Keras, MXNet and so on. But ONNX, if you go to the Octomizer service today, you can upload an ONNX model. And then in the guts of the Octomizer, you go and call TVM to import the model and do its magic.
Think of ONNX as a language to describe models.
Do you think that...I feel like one of the reasons that I've heard that NVIDIA has been so hard to displace as sort of the main way people deploy most of their stuff is because the cuDNN library is so effective.
Do you sort of imagine that as TVM gets more powerful, it opens things up to other hardware companies?
That's right. Yeah. I think NVIDIA has been brilliant in offering... I mean, they have really, really good software stack and they of course have good hardware too. But the fact that they have a usable and broad, and I would say, arguably some of the best low-level machine learning systems software stack there, gives them a huge advantage.
Some other hardware could be just as good in terms of raw processing, power model, memory, and kinds of architecture and so on. If they don't have a good software stack, they're simply not competitive. And we definitely see TVM as offering that choice too. Again, I don't want to sound like we are going to compete with NVIDIA. That's not the point. I'm just thinking...
So just think about this. Forget machine learning. Just think about operating systems. So you have Linux. Linux runs in pretty much all the hardware that you care about. You might still choose to run Windows, but at least in the same hardware, you can choose to run Windows or Linux.
Think of TVM as offering a choice of what kind of operating system you'd run on your hardware, except that you don't have to choose a proprietary one. In the machine learning world, with NVIDIA there's essentially no choice there unless you're going to go and write CUDA code directly.
So I guess one of the things, and this is probably the part of the show where I ask the dumb questions that my team is going to make fun of me for, but kind of in the back of my head, I feel like I always have this mystery where like a new version of cuDNN comes out, and the models get way faster with just a better library.
I think about what a model does, like a convolution or like a matrix multiplication. It seems so simple to me. That's kind of how it seems like, because I feel like I come from a math background and I'm just like, how could — many years in to making a library — how could there be a 20% speed up on a matrix multiplication? Like what's going on?
That's a brilliant question. Yeah. Great question, Lukas. All right, we should take a whiteboard out and I'll show it to you, because then it gets even closer to my world.
Let's think about computer architecture for a second. Let's say that you are an execution engine, like a processor or a core in a GPU. So you have to grab, let's start with one reason, you have to grab data from somewhere in memory.
It turns out that computer memory is organized in ways that, depending on where the data is in memory, which actual address physically in your memory it is, it gives you much better performance than others, by a huge margin. Because depending on how you lay it out, the data, you can actually make the most use of the wires between your memory and your processor, between your cache and your actual execution engine in the silicon itself.
But figuring out where that goes becomes a combinatorial problem because not only you have to choose where the data structures go, but also when you have a bunch of nested loops that implement your convolution, you have to choose, like, if you have a four-deep nested loop, in which order should you execute them?
Many orders are valid. Which order should you execute them? And then within those, you might want to traverse...like what size of blocks are you going to traverse that? All of that is highly dependent on the parameters of your convolution. I'm just picking convolution, so even just general matrix multiplication.
Long story short, for any given operator, there's literally potentially billions of ways in which you can compile the same bit-by-bit equivalent program in terms of outputs. But one of them is going to be potentially a thousand times faster than the slowest one. So picking the right one is hard. Often, this is done today by human intuition and some amount of automatic tuning called auto tuning.
What's happening in cuDNN as your model gets faster is that...NVIDIA can afford a large number of programmers, so a lot of really talented software engineers, they observe where the models are going. There's some models that matters to them. They're going to go look at the model, see the parameters of all of the operators, how they're stitched together. Then they're going to start tuning the libraries to make sure that they do better data layouts. They make better loop ordering. They do better tiling of how the data structure works. They choose the direction which they're traversing, data structures, and so on.
And that's just one type, that's just one operator. But now models, operators talk to other operators. So that's why there's something called operator fusion. If you fuse two operators, for example like a matrix multiplication, the convolution, to a single operator, now you can generate code in a way that it can keep data as close to your processing engine as much as possible. You make much better use of your memory hierarchy and that's yet another significant performance bump.
Am I giving you a general sense-
Totally, that was really helpful, yeah.
So I guess you can't actually decompose the problem down into... I was sort of picturing that each step in the compute graph, you could optimize it separately, but actually you have to-
No, you have to put them together.
In fact, if you read TVM, there were three PhD theses³ ⁴ ⁵. At the very least, those are the ones that I've been involved in on the core of TVM. If you read the first paper⁶, it's been around for several years now, one of the key messages there at the highest level was the following — by doing high-level graph optimization together with code optimization, that's where a lot of the power comes from.
So essentially, say, if you choose to fuse two operators in the graph, now we need to generate really good code for it. So now you're going to go and use our automatic, highly specialized code generator. They use machine learning to do the search for this new operator that fused the two with different parameters. By combining high-level graph optimizations with low-level code generation that's specialized to that, you have significant multiplicative optimization opportunities.
Does that give you...
No, that's really helpful, yeah.
Do the new TPU architectures kind of change anything about this optimization or does it change what you're doing at all?
Well, it's a different hardware architecture, so you need to go and tune for it as well.
But you remember that TPUs are also made of a bunch of transistor function units and floating point units and vector units, and they have wires. They have memories organized in a certain way that you want to make the most of. In a sense, a lot of these specialized architectures, what they do — and in fact TVM also has an open source TPU-like accelerator that's fully open source hardware, you can actually stamp it out in FPGA, some folks have stamped it out in actual custom silicon — it gives you sort of a template of how you think about these accelerators.
They also have parameters. So there are different sizes of memories and buffers, what data types you support, how many functional units you have to have the right throughput. It's all a balance of how you organize your memory, how much of your silicon you're going to devote to compute versus storage, how many wires and how is your interconnection network to move data around this connected?
The reason I'm telling you this is that many times the trade-off here is the following. You might make the hardware more complicated, harder to program, but immensely more efficient. But that means that now we need to rely even more on a compiler to make really good code generation and specialize how you're going to compile your code to that specific hardware target.
Because that's a fair trade-off. Compilation, you do once, it might be complicated, but then if you subsume harder, they have to do every time as data is flowing, it's much better to do it ahead of time.
I'm digging deep into my computer science education, but I feel like the story with the non-deep learning chips, hasn't it been sort of simpler, kind of like smaller instruction sets, and trying to simplify things?
It seems sort of the opposite direction of adding complexity to the hardware and then relying on the compiler to deal with it.
Yeah. Yeah. It's a great question.
There's so many, and I think it could be a whole other conversation too, but the whole...when the RISC versus CISC debate happens in the computer architecture class that I teach — at grad level, I actually have them have debates — the key aspect there was that by going to a simpler instruction site, you had simpler hardware so you can actually clock it faster. You could have lots of little instructions, but you execute a lot of them in a period of time, so you can make them run faster.
It turns out that even complex instruction computers today, like x86 and Intel, they break it down automatically to teeny structures and it still looks like a RISC computer inside. But fast forward to today, what's going on is that there was a huge change in the trends we've seen in performance coming from in different computer architectures.
As we get closer and closer to the limits of scaling of transistor technology, what happens is the following. You have a certain number of transistors, they're getting ever smaller and more and more power efficient. There was a change that transistors are getting smaller but not necessarily much more power efficient, which means that you can pack more transistors on the chip, but it cannot turn all of them on at the same time.
You might seem like, "Why am I telling you this?" Because that's the whole justification for going more and more specialized, and hav[ing] a big chip, a bigger chip with lots of different "more specialized function units". They're not general, but they're much more efficient because every time you add generality in the hardware, fundamentally, you're adding more switches.
For example, general purpose CPU that can do anything. There's like a large fraction, more than half of the transistors there, just simply sitting there asking questions. "Am I doing this or that? If I'm doing this, I do this." And then you have to make decisions about the data that's flowing through because it's supposed to be general.
So the trend that we're seeing now is that, well, we need to make this thing much more efficient, otherwise we can't afford the power to run a global infrastructure, or you can't afford the power to run machine learning. You have to squeeze efficiency from somewhere and the way you squeeze efficiency, you remove all these transistors just sitting there wondering what they should do, with transistors that do only one thing, and one thing very, very, very well.
Sure, it makes it harder to program because now you have to figure out when and how you should use these specialized functional units, but immensely more efficient in terms of performance per watt and immensely faster than general purpose computers.
Did that answer your question or did I make it more complicated? Did I confuse you, or did I...
No, this is incredible. I feel like I'm finally getting clear answers to questions that have been in my head for a long time, so I'm actually really enjoying this.
What should I be imagining as like a specialized instruction? I hear on the M1 laptop, there's like a specialized thing to play videos... what does a specialized instruction look like? Is it like there's a convolution structure, so it could pass through?
Yeah. For example, it's an eight-by-eight matrix multiplier, single instruction.
Yeah. You can evoke that. You set up, you put all the data in the right place and you say eight-by-eight matrix multiply. Boom, it happens.
In one tick?
Not exactly in one tick.
It's like, it's one instruction, which means that you're giving one command. It could be broken down into multiple cycles depending on how it's scheduled. But from your primary point of view, there is hardware there that's essentially in the arrangement of your transistors that implements your functional units, and your memory's organized in a such a way...there's something called a systolic array, I don't know if you've heard this term before.
Systolic array, it's an array of multiply and accumulate. So think of it that way. You can just flow data in a specific way that, if you just arrange it just right, then you flow between, in one flow you've done an eight-by-eight RTMM (?). But to do that, you have to arrange all the data in the right place and then click go. Not click, issue instruction "Go".
But now to answer your video compression question or video codec. We call it an instruction, but more likely, it's essentially a piece of hardware that's just sitting there, knows where to read data from, and what you do is just configure it.
You're not giving... the program for real is actually in the actual function-specific hardware. And all you do in your code is to say, "Activate that now. Here's the data stream, activate that". Then you have a fixed function hardware that just starts crunching through that and decoding your video, for example, or applying a certain computation.
Another thing that people are doing in hardware is activation functions. Some activation functions are so popular. People use it all the time, but why are you going to break it down into 30, 40 instructions when you can design a piece of hardware that does that and just that. And all you're doing is when you call that activation function, you just activate that piece of hardware.
So I guess if it's sort of laws of physics that are pushing this trend, it seems like you'd probably expect this trend to continue for a long time, right?
And if it does, where would it go? Would there be sort of even more and more complicated structures possible in the hardware, and wouldn't that sort of make research harder? What if you wanted to do a new activation function that wasn't available?
Yeah. So that's a really great question, Lukas. Let me try and answer the first big question first, and then we can branch down to these other sub-questions about research and how do we continue advancing this.
So, yeah, that's the reality. Right now, we already have quite a bit of diversity, not just different hardware chips and hardware parts. Just look at all the AI chip companies out there, just look at what's happening to the general purpose processors, like Intel processors getting specialized instructions that are relevant to machine learning and so on.
That's going to continue going because, honestly, there's just no other way to get efficiency. Unless, now let me open a nerdy speculation, unless we can teach atoms to arrange themselves at the atomic level to go, it's like "Let's reconfigure where your wires are", and therefore you have your chip doing a new thing.
There's a kind of chip like that, right? Like a FPGA. Is that it?
Yeah, but I'm going to get there. But there's no magic.
FPGA's is just, there's a bunch of wires that are there, you're just inserting data to tell you how, which wires you should use. But the wires are always there. And just the fact that you have a table that tells you, "If I have this bit on, I'm going to use this wire. If I have this bit ton, I'm going to use other wire", just that causes inefficiency. So it's always a trade-off.
Think of it as a trade-off between how general or hard is your...so there's a generality over specialized curve. More general, less energy efficient, easier to program. More specialized, more efficient, harder to program and so on.
But then you have FPGAs. How about FPGAs? FPGAs essentially, they are very general fabric with a very complicated programming model. Because FPGA, what they are is, they're a bag of wires and little routing tables, and sprinkled some multiply and accumulate, or more and more activation functions and other popular compute elements that you just sprinkle in, in an even fabric. And then you just set bits to figure out how you're going to route the data.
So the way you program that looks like how you design hardware, and they can be very efficient if you do it right. But fundamentally they're not going to be more efficient than true fixed function chips. You're never going to see an FPGA competing with a GPU on the very same task. You see FPGAs competing with things like GPUs and so on, when you can specialize to your application, and even with the efficiency hit of the hardware, you still have a win. Does that make sense?
So for example, let's say if you decide that you want two-bit data flow for... let's say quantization to two bits here in one layer, three bits on the other layer, and I know one bit on the other layer.
It just so happens, there's no existing silicon that can do that for an existing CPU, or GPU that can do that for you. Chances are, you're going to be living with an eigh-bit data plane, and you're going to ignore some bits there, and then you're going to waste efficiency there, or you're going to do inefficient packing. But with an FPGA, you can organize it such that you only activate...you only route your circuits to use the two bits or one bit or three bits.
In that case, because the data type is more unique, you can specialize to your model, then you can do really well with an FPGA.
That makes sense.
And on research, to answer your question on research.
Research, I think, is getting more interesting, honestly. Maybe I'm getting old and a curmudgeon here, but I feel like — I want to say curmudgeon, I mean I'm being old and optimistic here — is that I've never seen computer systems architecture and systems optimization being as interesting as it is right now.
There was a period of researching this, it was just about making microprocessors faster, making a little bit better compilers. But now that we have to specialize and there's this really exciting application space with machine learning that offers so many opportunities for optimizations, and you have things like FPGAs, and it's getting easier to design chips, we create all sorts of opportunities for academic research and also for industry innovation.
Hence, we see all these wonderful new chips, Xilinx with new FPGAs, and new FPGA companies, and some are novel reconfigurable fabrics, and all of these cool hardware targets.
I guess I'm curious, it seems like ML is becoming a bigger and bigger fraction of data centers, and data centers are becoming a bigger and bigger fraction of global energy use.
Do you feel like there's an environmental impact that you can have by making these things run more efficiently?
Absolutely, yeah. And we're not the only ones to make that claim. Essentially, every time you make an algorithm faster in the same hardware, you're saving energy, you're saving trees. You're reducing resource pressure.
Performance optimization is this wonderful thing that you can reap the benefits in so many ways. If you make it faster, you're gonna make your users happy. But also even if it's not latency sensitive, you're going to make your finance folks happier because they're gonna spend less on cloud bills. But in the end you're going to be using less energy. And that really matters.
Now, what's interesting about environmental impact specifically is that, as you pointed out, there's a growing fraction of energy in the world that's devoted to computing. I'm not going to get into cryptocurrencies. We're not going to go there right now. That's a whole separate topic, thinking about the energy costs of that.
Let's just think about the energy costs of machine learning infrastructure, that includes training and deploying models at scale. It's fair to say that in a typical application that uses machine learning today, the majority of the cycles will go to the machine learning computation, memory that you have to keep alive with energy.
So, absolutely. You should take every opportunity you can to reduce the energy that your models use, especially if it's applied at scale. Even if it doesn't matter from a user experience point of view, we should do it because that's just the right thing to do.
Can you really separate the model compiling and performance, and the way that the model is designed? It feels like a lot of the performance improvements in models come from sort of relaxing the constraint that you need to exactly do the convolution or the matrix.
I mean, just for example, quantization, where you do it in a ludicrously small level of precision, it seems to work really well.
No, absolutely, no. And I did not mean to imply that we should only do a model compilation. Remember that I said, I'm assuming that you're going to come with your model tuned for the least amount of computation you can possibly use.
That's the ideal case, but you're absolutely right that there are optimizations at the model level that actually changes the statistical representation of the model that enables new optimizations. And we can do that too, but TVM does have growing support for quantization.
But what I'm particularly interested in, in general, is how do you put things like TVM in the whole network architecture search loop? As you make decisions about your model architecture, and as you retrain for different model architectures, and you can make new optimization decisions on the model layer, and change the convolution, the data types and doing all sorts of things like pruning and compression, deep compression, et cetera.
Now, put a compiler in the loop, like TVM, and measure what the performance that you're getting as part of your search loop, because then you really get the synergies. You're right that you cannot completely... you can decouple them in principle, and you're still going to do relatively well. But if you do both of them together, I think you're up for more than the addition of either of them in terms of potential opportunities.
That's what TVM did in terms of high-level graph and low-level optimization. By doing them together, we show that we can do better. And I do think that the same thing... I have data points to show that the same thing could happen if you do model building and tuning decisions together with a low...model compilation and hardware tuning together.
Are there trade-offs between... Like with GCC, you can optimize for memory or you can optimize for speed. Is there a latency-memory size trade off here? Or are they both sort of like aligned with each other?
Yeah. So that's a great question.
Of course, one optimization that definitely impacts memory usage specifically is when you do model compression or if you do quantization. So if you go from FP32 to int8, you already have a 4x footprint reduction in your... You go from 32 bits to 8 bits-
But that'll also make it run faster, right? So there's no real trade-off there if the quantization keeps the performance UI, right?
If you're assuming quantization that's just like, you have the same model architecture and you just change the data type and go. But that's sort of like the easy, lazy quantization. The right way of doing it, in my opinion, is that once you change the datatype, you're given an opportunity to actually go and retrain it and some parts of your model become less...
I think the right way of doing quantization is not just "Quantize your data type and forget about it". It's actually "Close the loop and put it on a network architecture search", such that as you change the data type, you actually allow for different types of... and then in that case, I think you're up for significant changes to the model that would make quantization potentially even more effective.
But I did not answer your question. So what's the trade-off between latency and footprint? Well, it could be that. It could be that you actually quantize your model, but then you actually make it deeper to actually make up for some accuracy loss, which might make your model actually potentially slower, but use a lot less memory. So there is that trade-off there too.
I guess my experience of deploying models, and I'm just an amateur at this, but I love my Raspberry Pis and other cheap hardware.
And we support Raspberry Pis pretty well in TVM, you should try it out.
I will definitely try it after this.
So I did it kind of in the early days of trying to get TensorFlow to run, when even that was a challenge. And I felt like basically, with models, it was sort of binary where either I could fit it in the Pi's memory and it would run or I couldn't fit it in the Pi's memory and it wouldn't run. So it seemed like less about sort of optimizing and just, either I'm sort of stuck or I'm not.
Is that a common situation, or?
It's hard to say if it's common. Often, at least for the models that we get, they get to the point where we pay attention to them and we know that they run now, but they typically don't run, say, the frame rates that you want to get.
Half a frame per second and you can run or show your path to 20 frames per second. By that time, the model already fits, you're optimizing for performance. But often, this performance optimization comes also with model size reduction, quantization is another one.
Let's say if you can just go from FP16 to int8 and it works well, boom, you do that. You probably improve performance and you also reduce model size. But I've seen plenty of cases where the model already runs and what's hard is [to] actually get to target latency that would actually enable the model to be useful.
That's actually by and large what we tend to see, you get your model to run, you hack it enough to go there, but then it's never fast enough. And then you're going to go in, you know another 10x ahead of you for it to actually be useful.
I don't want to not ask you about your company OctoML. I feel like you're one in a growing line of people that I'm talking to that are professors and they're starting companies. I mean, what inspired you to build this company?
Yeah. Great question. So first of all, it's one of those moments where all the stars are aligned.
TVM had gotten quite... We started the company just about a little under two years ago. TVM, it had quite a bit of adoption by then already, and we saw more and more hardware vendors starting to choose TVM as their chosen software stack. We ran our second conference here in Seattle and I saw there's a room full of people. I thought, there's an opportunity here to make what TVM can do more broadly accessible.
And then I said the stars are aligning because I was looking to start another company, and I had become full professor a couple of years before then.A lot of the core PhD students in CVM were graduating. One of our big champions of TVM, Jason Knight, was at Intel at that time, was one of our co-founders, and was also looking to starting something and all the stars aligned.
I feel extremely lucky that we had that group of people ready to start a company. And we work really well together. There's a lot of synergy there. But that's sort of like "the stars aligned" part. Now in terms of technology, it became really clear to all of us that, look, you have this cross-product between model and hardware, and there's such a huge opportunity to create a clean abstraction there, and at the same time automate away what's becoming harder and harder about making machine learning truly useful and deployable.
Honestly, in MLOps — and I don't love that term, because it means so many things — but going from data to a deployed model, it's clear that the tools to create models got good pretty fast. There are a lot of people that can create models today, and good models, a large repository of models to start from.
But after interviewing a bunch of potential customers, we realized that, hey, you know, well, people have actually have a lot of difficulties in getting models to be deployed, precisely because of the software engineering required and the level of performance requirements and cost requirements to make it viable. So we formed OctoML to essentially make TVM even more accessible, or technologies like TVM even more accessible, to a broad set of model builders, and also make it part of the flow.
Let me just tell you briefly what the Octomizer is. So the Octomizer is a machine learning acceleration platform. It has TVM at its heart. You have a really clean...either API, just a couple of calls, "Upload model, choose and then download, optimize model", and you can choose a hardware target.
You upload them all, then you can choose the hardware targets that you want. The Octomizer calls TVM, or also can use ONNX run time end we're going to keep adding more...again, we want to offer users the abstraction that you upload the model and you get the fastest possible model ready to be deployed in your hardware in a fully automated fashion.
You either get a Python (?), ready to download, or we're working on GRPC packing, so we can deploy in the cloud or cloud functions and so on. So the value add here is all this automation that we provide on top of TVM, and also the fact that, as I mentioned, TVM uses machine learning for machine learning, and we have a data set for a lot of the core hardware targets that the world cares about just ready to go. So you don't have to go and collect it yourself.
I would think running OctoML, you would have real visibility into how the different hardware platforms can compare with each other. I'm sure you don't want to offend the hardware partners, but do you sort of have first-pass recommendations for what people should be targeting in different situations?
Yeah, and that's one of the things that I want the numbers speak for themselves.
So what you can do is you can, if you come to the optimizing facts, we are open for early access and we actually have some real users already using it regularly. So you upload a model, then you can choose all sorts of hardware tags, and then what you're going to get, you're going to get a dashboard saying, "Here's your model, here's the latency of each one of these hardware targets", and we can compare TVM with other runtimes, like ONNX runtime, for example, and we're going to show you which one you should use, and you can choose based on that.
Of course, we are working hard to improve the interface to enable users to make decisions about costs too, for example. You might want to get the highest throughput per dollar, for example. I would say that it's fair to say that models vary so much, that it's hard to say upfront, this is going to be the best. How you should run, run it through the Octomizer, get the most efficient version and binary of your model out, and then measure that.
Well, I guess that kind of actually leads me into the two questions that we always end with, which I want to give you kind of time to chew on. And I haven't asked you about a lot of your research. It seems super fascinating, but I guess I wanted to ask you, what do you think is a topic in machine learning that doesn't get enough attention?
That if you had extra time to just work on something that you're interested in, maybe you would pick to go deeper on.
Yeah. So I would say it's getting more and more attention now, but I've always been, and a lot of my research has been in automating systems design with machine learning and for machine learning.
TVM is one example of using machine learning to enable better model optimization and compilation, but also doing hardware design and programming FPGAs for example, is really hard, and machine learning could have a huge place there.
So let's say designing...what I want is really "model in and automatic hardware plus software out", ready to be deployed. I think that's one that I'm passionate about, and I think you can have quite a bit of impact precisely because you can reap the benefits in so many ways.
You get new experiences because you enable new applications, but also make it more energy efficient. So I think we should actually always look at what is the energy cost of deploying this at scale, if it's going to be deployed at scale. Because in rich countries, you don't think about it. You just go pay the energy, even if it's high. But now if you really actually think about the environmental impact of running these at scale, it's something that one should pay attention to.
So this is actually using machine learning to optimize the model?
Using machine learning to optimize, not just the model, but also the system that runs your model, such that you get better behavior out. They can be faster, higher throughput per dollar, but also much lower energy use. And I think it's definitely incredibly exciting and possible to do.
So that's one of them.
Now, let's see, one that doesn't get as much attention, but now it's getting more attention, that's dear to my heart, I will now touch into, is the role of machine learning in molecular biology.
Oh, right. Me too. I totally agree.
So as part of my research personality, for the past six years or so, I've been heavily involved in an effort to design systems for using DNA molecules for data storage and for simple forms of computation. Some of it is actually related to machine learning.
For example, we recently demonstrated the ability of doing similarity search directly as a chemical reaction. And what's cool about that is that, not only it's cool, definitely pushing as a new device technology alternative that's very viable and has been time-tested by nature-
Time-tested for sure.
Yeah, it can be extremely energy efficient and fundamentally, the design of molecular systems is so complex that I cannot imagine any other way to design them than using machine learning to actually design those models. And we do it all the time.
Like we had a paper that you might find cool late last year, it was in Nature Communications called Porcupine⁷. And we used machine learning to design DNA molecules in such a way that they look so different to a DNA sequencer, they're not going to be natural DNA. You can use this to tag things.
We designed these molecules, you can go and tag "arts" or tag "clothes" and so on. Basically you take a quick sample, you run through a sequencer and you can authenticate that based on these molecular traces. But that was made possible because of machine learning in designing the molecule and actually interpreting the signal out of the DNA sequencer and so on.
I feel this space...it's not fair to say it's not getting enough attention, I think it's getting more and more now precisely because of the pandemic and all of the other reasons why molecular biology matters. But I find it incredibly exciting and it's a lot of the high-level motivation for things that I do both in research and in industry, is enabling use cases like that. So the things that requires so much computation that wouldn't be possible before without a very efficient, very fast system.
I guess the question we always end with, which you've touched on a lot in this conversation is really, what you see the big challenges are today of getting machine learning working in the real world? Maybe when you talk to your customers and they sort of optimize their models, what are the other challenges that they run into when they're trying to get their optimized model just deployed and working for some end use case?
I devote a good chunk of my life into deployment, in automated engineering involving deployment, but I don't want to sound too self-serving to say that's the biggest problem. I think that's a huge problem. It's a huge impediment in terms of skill set because it requires people that know about software engineering, about low level system software, and know about machine learning. So that's super hard.
That's one, definitely getting them all ready for deployment. But then there's other ones which is just making sure that your model is behaving the way it's expected, post-deployment. Like observability, making sure that there's not unexpected inputs to make your model misbehave, have fail-safe behavior, and so on. I think that's one that is no news probably to this community, that some applications require...either because it's the right thing to do when a model is making decisions that are super important, you want to understand how they're done, and making sure that they actually hold in unexpected inputs.
So I think that's one of the harder ones, because like any engineer that's thinking about the whole system, you to think about the weakest link in the system failure. And not worry that if you don't do something proactively, the weakest link in these systems, they're going to start being the models that you can't really reason them in a principled way.
Yeah. Awesome. Well, thanks for your time. This was a lot of fun.
Of course. Thank you, Lukas, this is awesome. Yeah, I enjoyed it immensely. Thank you.
If you're enjoying Gradient Dissent, I'd really love for you to check out Fully Connected, which is an inclusive machine learning community that we're building to let everyone know about all the stuff going on in ML and all the new research coming out.
If you go to wandb.ai/fc, you can see all the different stuff that we do, including Gradient Dissent, but also salons where we talk about new research and folks share insights, AMAs where you can directly connect with members of our community, and a Slack channel where you can get answers to everything from very basic questions about ML to bug reports on Weights & Biases to how to hire an ML team.
We're looking forward to meeting you.