Listen on these platforms

Apple Podcasts Spotify Google Podcasts YouTube Soundcloud

Guest Bio

Phil Brown leads the Applications team at Graphcore, where they're building high-performance machine learning applications for their Intelligence Processing Units (IPUs), new processors specifically designed for AI compute.

Connect with Phil

Show Notes

Topics Covered

0:00 Sneak peek, intro
1:44 From computational chemistry to Graphcore
5:16 The simulations behind weather prediction
10:54 Measuring improvement in weather prediction systems
15:35 How high performance computing and ML have different needs
19:00 The potential of sparse training
31:08 IPUs and computer architecture for machine learning
39:10 On performance improvements
44:43 The impacts of increasing computing capability
50:24 The ML chicken and egg problem
52:00 The challenges of converging at scale and bringing hardware to market

Links Discussed

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
Phil:
We can no longer rely on things just getting better, where every two or three years we'll get another 50% or 2X energy efficiency, or whatever the scaling is. That's really slowing down. So the specialization of the processors is being driven by that. So we need an architecture that is more memory-efficient. If you go back to the fundamental processor, we don't move data very far. So the whole architecture is geared around data staying local for the processing and the physics of moving data is one of the things that really drives power consumption. So there's doing the actual operations, so driving the computational units, and then there's moving data to and from your memory subsystems. So if your memory's very close, the cost of moving data there is a lot lower, energy costs, compared with like if it's off chip, the cost tends to be a lot higher. This goes into the power or consumption of the device, where are you spending your power?
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host Lukas Biewald. Phil Brown leads Graphcore's applications team, building high performance machine learning applications for their intelligence processing units, or IPUs. Phil's background is in computational chemistry, which is maybe one of the topics that I really wish I knew more about. What he works on now is hardware for machine learning, which is the other topic that I really wish I knew more about. So I always say this, but I could not be more excited to talk to him today. I really want to talk about Graphcore and what it does broadly, and what you're doing there. But I thought it might be fun to start off with, I was looking at your background and I saw that you were originally trained as a computational chemist and then was working at Cray. And we've actually noticed at Weights and Biases a whole bunch of computational chemists using our software, which has been intriguing. I wanted to hear your career path and how you ended up at Graphcore.
Phil:
Yeah, certainly. So it's been a bit of an interesting journey and I would be interested to know what they were doing, whether they... I mean I guess running sets of molecular dynamics or quantum chemistry calculations.
Lukas:
It seems like there's a lot of drug discovery and some material science, yeah.
Phil:
Okay, yeah. That's pretty much what I used to do a long time ago. So running computational simulations of various different spaces. The way I ended up in the machine learning space was via the high performance computing arena. And actually my PhD was writing computational chemistry codes, so quantum chemistry, density functional theory embedded inside a molecular dynamic simulation, and actually looking to try and accelerate the density functional theory, the quantum chemistry bit of that, using very early accelerators. So actually, I did a PhD at the University of Bristol and there was a company in Bristol called Clearspeed that were building an early numerical accelerator. Think before we had GPUs before that period, right when the first GPUs were coming out, that same kind of cell processor came out, if there are any people who were playing around. So PS2, that kind of era. And this company was trying to build double-precision, so HPC accelerators, so I was actually writing code t to use those for these kind of computational chemical simulations. So about 2005, 6, 7 kind of timeframe. And actually as it happens, my boss today is somebody who worked at Clearspeed and was building those systems and a number of the, particularly the software team, have heritage kind of going back to that. There's a bit of a Bristol group of hardware and software engineers who have done various kinds of things over the years. That was really what got me in from being a chemist, I am not a computer scientist in any sense. I kind of dabble a little bit but I'm not a software developer or a computer scientist. And that took me from a pure chemist into the HPC and computational science domain of high performance computing and building this sort of thing. I spent a couple of years in a consultancy that was specializing in helping people buy these systems actually, and then went to Cray and then helping design and build these systems. And actually I did a variety of different things, I spent a couple of years focusing actually on weather forecasting. So the numerical science and how you build large production systems for weather forecasting. And particularly in the US, NOAA and the National Weather Service, here in the UK the Met Office, and actually around the world at that time Cray were building systems for 80% of the large weather centers, so the national weather centers and those kind of things. So that was great fun. But as the machine learning domain started taking off, I was quite interested in that as a field. It was clear actually that super computing, the high end, wasn't going to continue growing at an exponential rate and there happened to be this little company in Bristol called Graphcore that had a really interesting technology, was just starting to make some waves. So I got in touch with Matt¹ actually, and a few other people, and ended up coming to join Graphcore and actually working as part of leading the field engineering group, so customer facing. And technical teams to actually, working directly with our customers to build applications.
1: Matt Fyles, SVP Software at Graphcore
Lukas:
Can you talk a little bit about, I've always been fascinated by how weather prediction works.
Phil:
So weather prediction is an interesting field. Fundamentally it's quite simple, the atmosphere is a set of fluids interacting. So you can describe that with a set of equations and you can just kind of solve using those equations. So it in some sense is, it's just a giant fluid dynamics simulation. But it's also a bit more complicated than that because you've got particles, you've got lots of very interesting surface effects. You've got the Coriolis effect where the earth is actually rotating. You've also got quite an interesting initialization problem in that space because you don't... I mean climate simulations are much longer duration, weather forecasting simulations typically, you might only care about the next 10 hours, 12 hours, two weeks. So your initialization is actually critical. So actually the data assimilation where they take the global set of satellite observations and other kind of weather observations and integrate those into the model as the starting position is a really, critically important part of that. So there's lots of quite hairy maths and lots of big computers to try and scale these systems. The other thing that's quite close to machine learning or certainly common is this idea of time to train, time to get to a solution, is quite important. We're running a big simulation, then actually if you're going to have to wait three weeks for it, it's pointless actually running it. Your experimental cycle has to be manageable. And in weather forecasting, a weather forecast has to be, you'll be able to predict two weeks in two hours or something like that. So actually being able to meet that operational deadline for delivery was quite important for that.
Lukas:
How does the physics, the physics simulations, compare to a more machine learning approach where you make less assumptions about the underlying physics and just try to treat it as a standard prediction problem?
Phil:
In NWP and in most of the computational sciences in general, you're building a simulation based on some set of physics or chemistry or material science, or whatever particular discipline you're in, biology, there will be some set of fundamental principles that you are modeling in your system, so it's very much a science based, first principles based approach to solving these problems. I mean they typically do have approximations in them, so there's quite a bit of interest I think, particularly in the climate field, but also in the weather field, of replacing some of their parametrizations of systems where the physics is too expensive to run. So the particle interactions are too expensive to model directly at large scale. So up till now they have used approximations for that, actually trying to replace their basic approximations with machine learning models that will be cheaper, or more accurate, or both. So there is that kind of interaction where with everything, in the entire world, you could technically simulate everything right down to the lowest quantum interaction state, level, but that would be phenomenally expensive. You wouldn't necessarily want to do that.
Lukas:
Also you can't observe it? I think the observations would be messy.
Phil:
Well I mean if you're going right down to an individual electron, yes, you wouldn't be able to observe that state. But the quantum interactions, the difference between the biology and the chemistry, or the molecular dynamical sphere and the quantum mechanics sphere is where you've got these binding energies where you're actually making and breaking bonds. Those are the quantum mechanical effects starting to come in, like you're making those bonds. So you can accurately simulate those things, it's just you can't observe the individual particles at that level. So the simulation of the kind of binding energy is still possible at that level. But I mean that's phenomenally expensive. At the time I was doing it was difficult to model water and maybe small groups of water molecules where you've got the hydrogen bonds, that was getting a little bit expensive. I suspect a decade on we're probably a bit further than that now, but still you won't be able to model... Well, they might just be able to do a full protein or something like that. But it's also a question of, is it meaningful to actually... you don't need that level of fidelity or you don't need that level of modeling, where do you want to spend your compute time.
Lukas:
Or even your observation time, I'm imagining modeling the weather on planet earth, you can't get very fine grained at all, right? From observing the state of earth.
Phil:
Well that used to be the challenge, it very much used to be the challenge. It's a lot better now that they've got satellites that give them complete world coverage. The challenge before that was that you didn't have observations. Actually the Met Office do a great, or they have an interesting set of observations and analysis around V Day where in 1944, the invasion of Europe, the prediction of that weather window where they actually launched the invasion. The Germans did not think that there was going to be a weather window, their analysis of the weather, because they didn't have observations in the north Atlantic, but they had much, much sparser observations in the north Atlantic. So at that point in time there was a real lack of observational information. I think that's been closed, I mean satellites today give you full globe coverage for a lot of things. They maybe don't give you the vertical profile in the atmosphere that you might want in some places, but they also have observations from aircraft and a range of other things.
Lukas:
So this is kind of a naïve question I guess, when I look online or go onto Dark Sky or something and get the seven day forecast, are those meaningfully improving over my lifetime?
Phil:
Yes, I mean it depends if you're looking at these things emotively. If you look at the analysis, yes, they're measurably getting better.
Lukas:
What does it mean to look at it emotively? Just feeling that it's wrong?
Phil:
I don't know, we're British, we're always complaining about the weather and we're always complaining about the predictability of the weather here. It's raining here at the moment. But the improvement in these kind of forecasts are incremental, but I mean over a decade the accuracy of a forecast out a day has improved quite significantly.
Lukas:
This is probably location dependent, but at what point does just forecasting out based on the physics of what's going on stop being meaningfully better than forecasting based on climate. Or what's the average state of weather? Like can you predict out three weeks and have a meaningful gain with a physics-based model?
Phil:
So the numerical systems, and this is getting to the edge of my knowledge now, I think the numerical systems are good out to two weeks. So the long range forecasts are typically out to the two to three week window. Then they're now starting to do seasonal, bridging the gap between climate which is multi-year and decadal and the short term NWP². They're starting to do seasonal prediction, and they are showing skill, i.e. prediction above a random, prediction above the climatology³. They just look at history and base it on the average of history, so they're starting to show skill beyond that and things like El Nino prediction and this kind of thing, they are starting to show skill out in that kind of time. But it's very much the mean, for example are we going to have a wet summer or a dry summer. The challenge, I think, for those organizations when they're articulating that is... So the Met Office had a wonderful thing where they said it's going to be a barbecue summer. The headlines were barbecue summer, that was picked up by the press. What actually turned out was it was a bit warmer and a bit wetter, but a little bit warmer, a little bit wetter and so people's perceptions of what barbecue summer means, that means it's going to be nice and dry the entire time. That's not necessarily what the prediction's saying, it's going to be slightly warmer than average and it's going to be slightly wetter than average really translate to people's experience. So that's the kind of thing, the challenge, the interpreting the information can be quite challenging. But making it generally understandable is the challenge.
2: NWP = numerical weather prediction
3: Where they just look at history and base it on the average of history
Lukas:
We should talk about chips but I have one more question.
Phil:
We should, yes, we should stop-
Lukas:
One last question. What is the function that you're trying to optimize when you predict weather?
Phil:
Oh, well so they're not trying to optimize.
Lukas:
Well how do you measure success, I guess?
Phil:
So yeah, that's a better... So they have a very wide range of metrics. So they're looking at sea state temperature. You're comparing the state of the atmosphere that you predict against the state of the atmosphere that actually exists. So you have a set of observations, so the temperatures, the atmospheric pressures, the amount of precipitation. There's a huge range of skill scores that these organizations generate. If you're interested in this, ECMWF, which is the European Center for Medium Range Weather Forecasting has quite a detailed set of... if you go and dig into their webpage, quite a detailed set of analysis on their forecasts. And as they're producing new forecasts, they're producing analyses of where is it improving and where is it degrading relative to what they had before. And ideally you want all of the numbers to be green. So they're doing quite a lot of work there. And you can actually see the evolutions and going back to computing, you could actually see the evolutions in computers there as well because they step up the resolutions as they're getting better systems. As they work as a software team to develop their software, they're delivering higher resolution forecast which tends to translate to better accuracy in the models.
Lukas:
Thanks for digressing, that was fun. And if anyone's listening or watching this and knows more about this, let us know. I'd love to know more.
Phil:
I will have to admit, my knowledge is very much... It's probably five years old and even at that point I was not an expert in this space, so I will apologize if I have got anything massively wrong, please correct me.
Lukas:
It's okay, so I guess you felt like high performance computing wasn't growing as fast as what Graphcore's doing. I guess what is the difference between high performance computing and Graphcore? Why isn't it the same kind of problem with the same kind of hardware solution? And I should say, I don't really know what high performance computing is so I think I need some definitions to even understand the point.
Phil:
High performance computing in the sense of numerical simulation where you're using a set of physics or chemistry to create a model of a system and generate some kind of prediction of behavior, or generate some kind of output. Typically those systems are relatively input light, so you'll be inputting a small amount of information, a model or structure of a protein that you want to have... a ligand that you've got an interaction with or a description of a furnace or something like that, and a flame. And you want to understand how that system behaves. So you actually generate huge amounts of information out of those kinds of systems. And there's a huge space and it's been going for many decades, and has in the past 20 or 30 years been growing moderately fast. The machine learning space is really geared around taking very large amounts of data, very large amounts of data and using that to build... Rather than apply a set of rules to that data, you're using the data itself to build the model system and to learn the rules itself. So it's a learning system rather than a system you're designing to solve a problem. I think that's probably the easiest, at a high level, way of describing it.
Lukas:
What does that translate to you? I can kind of see how those are different, but I'm imagining, ah there's probably a bunch of linear algebra underneath both of those problems, why do you need different types of hardware to solve them well?
Phil:
So the difference is from a computational science perspective are generally the HPC simulations require quite high precision. And there is a bit of debate in that community about whether you really need 64 bit everywhere, whether you should really be doing 32 bit in some places. But generally you need quite high precision for most of that field. And 90% of it, I'm guessing today, is probably done in double precision. With machine learning you're trying to learn from a very large volume of data and make actually a relatively limited set of predictions out of it. But it's the learning process and what's become very clear is you don't need very high precision when you're in this kind of learning process. So NVIDIA started out, or the people when they were leveraging NVIDIA GPUs started out using single precision. GPUs were good at single precision and it was much faster than double precision. Then people discovered, well you don't actually need single precision, you can do it in half precision. So somebody built some hardware that was better at half precision. So they started leveraging 16 bit and people when they're doing inference they're using 8 bit INTs, and 4-bit INTs and they're looking... People are even playing around with binary formats. So it's very clear that this domain has, from a computation perspective, at that level a very, very different characteristic, from a numerical precision perspective, different requirements. And then the other thing that's quite clear, and actually quite interesting about this space is that, so today we treat everything that we work, almost everything that we work with as dense linear algebra. So if you look at a classic CNN model like a ResNet, that convolutional network is typically translated when you're actually doing the maths with it on the computer, into some kind of dense structure that you're working with. Even though a convolution could be looked at a relatively sparsely connected patterns. And if you look at transformers and these kinds of systems that we're using and seem to be eating the world in natural language processing, they are big matmuls, big dense matrix objects. What we also know is that if we train a model at the end, we can then prune it quite aggressively and not lose very much fidelity. Particularly if you go through a few training cycles afterwards as well. And there have been a number of papers, Rigging the Lottery and a number of other ones, that are theorizing that actually what we're looking for, the systems we're interested in are actually fundamentally sparse. So we want to be able to train sparse systems. We think if we could train these systems in a sparse way, we'd save a huge amount of FLOPS. If we only had 10% or 1% of the parameters in the system, we wouldn't be calculating all of these other numbers. So there's a real interest in these systems in actually being able to do sparse algebra efficiently. And not just for inference, but for training as well. We also are in a place where Open AI and some of the very large organizations in this space, or organizations with access to very significant compute power are building huge models. It would be really nice to not have to quite go as far as that. So if I didn't have to build a five trillion parameter model, and I only had to build a 500 million parameter model, that would save me a lot of compute. It would reduce the cost of using that model, it would reduce the cost of training that model. I might still have to train it over a very big data set, but it would make it a lot cheaper to do iterations upon that. So that's the other thing that I think fundamentally differentiates the machine learning space and the problems that we're trying to solve. And that's not to say there aren't sparse problems in HPC, there definitely are. But that combination of sparse and low precision, and particularly the sparse bit is not something we factored.
Lukas:
Well the sparse bit is not something that's really supported, right? In general practice. Is there ways to take advantage of that sparseness now with existing hardware to train faster?
Phil:
So today, or as of... Not really, and so this is one of these chicken and egg problems where somebody needs to go and build some hardware that allows you to solve these kind of problems, then they also... But nobody builds the hardware until the problem's there that really justifies it. So we are starting to see these kind of things evolve. So one of the things that I'm really excited about with our next software release is that we're including both static sparse libraries, the ability to work with static sparsity. So you know the sparsity pattern up front, this might be the attention matrix in a system or it might be a mask or something like that. You can typically know some of these things up front. As well as dynamic sparsity, where you don't know the sparsity pattern. So you can have a changing... and we can deliver this with actually very significant performance on our architecture. Because that's one of the things about the IPU, it was significantly designed to be a very fine grained processing system and to be able to target these problems as well as being fast and good at the dense stuff as well. This is the thing, you can build sparse computing systems but they typically go so much slower than the dense computing systems that actually just running sparse on the density, filling it full of zeros, makes much more sense.
Lukas:
That's funny, I was going to mention that. I mean I am decades out of date on this, but I remember doing a little bit of work on this in grad school. I mean I would predict I guess, based on my incredibly old experience, that a sparsity factor of 1%, you might as well just fill in all the zeros like you were saying and not even worry about the sparseness.
Phil:
Yeah, and this is going back to the HPC space, people have never used the sparse solvers within the HPC space because they're so slow, or the sparse linear algebra, unless they've got a 99.9% sparse problem, in which case they start making sense. So some of the interesting things about the characteristics we have in machine learning is they aren't that sparse, actually. They're dense enough that doing the pure sparse arithmetic doesn't necessarily make sense, but we also believe that some of our structures are big enough that you can get away with having small, dense blocks within them. So the thing that's really difficult with 100% sparse systems is... Well there are a couple of things that are difficult, the access patterns moving around a lot is something that's quite difficult to handle. But from a really low level computational perspective, the way that we get efficiency on all of these computer architectures is by having dense block structures that we work with. And particularly two dimensional functional units. So if you want to keep those busy, you need a block of work that's about the same size as those units. So for us those are quite small, they might be 16 by 16. So actually in big structures the accuracy degradation that you get over a pure sparse system going to one of these small block sparse systems isn't too much. I mean this is, and I say that, there's been a very limited amount of work on this because the hardware just hasn't existed. But the indications are that it looks like there's a really nice compromise, where you can get really great performance with a relatively... Whilst leveraging this big, sparse system. So I would say we're right on the cusp of people starting to be able to use these systems and fundamentally explore and develop the algorithms, both the sparse training as well as understand where the break points are. I mean it may be that we discover, actually, no, no, 16 by 16's too big. What we really want is a four by four, or we want an eight by eight. Or we actually need, we're going to need a 16 by 16 works great if we're doing GPT3 and you've got really big matrices, but it doesn't work so well if we're doing BERT and you've got slightly smaller matrices. So there's a trade off in terms of relationship versus the hidden side of something like that. So I think we don't know, I think that's what's so exciting at the moment, is that there is some really new ground. I would say the one thing that attracted me to this space was a), growing clearly a really interesting field. But also virtually, I wouldn't say completely green field, but there's so much we don't know. I mean the evolution over the last five years has been astonishingly fast and it's been really exciting to be part of it.
Lukas:
Can I just... I'm just trying to picture this, I'm not an expert at all on this space, but does sparsity help with something simple like for example, a convolution? I'm trying to picture what even a sparse convolution would mean, does it mean a lot of the parameters are zero? And my input data is certainly probably not going to be sparse, right?
Phil:
It possibly doesn't make sense to think of it in a convolution. Although you could clearly maybe have a larger... So typically in a convolution you have a small mask that you're moving across your image, you could potentially think about having a slightly bigger mask that had some holes in it. That would be an interesting sparse pattern. And we've gone to small masks because I think because, well partly they give you a nice characteristic in that they allow you to apply the same transformation everywhere. And we seem to have standardized around three by three in a lot of places. Whereas some of the early CNNs people were playing around with bigger masks and seeing where the sweet spot was. So I don't know whether the standardization around three by three was a performance, as in the accuracy of the model you were making, whether it was a computational compromise in that it was a lot cheaper and it didn't cost you that much in terms of accuracy, but whether that actually there's a better sweet spot with a better sparse model. I don't know.
Lukas:
It does feel like there's some intuition that, like for example if you're imagining images, pixels closer to each other would be more relevant to each other
Phil:
Yeah and I think that's certainly... and you would be picking up the edges and those kinds of things that you're thinking about actually going through an image processing process, I think there is some logic there. In the image context, I'm not actually very sure where, how we might be able to use this other than, it's a new toy, I'm sure somebody's going to go and play and find somewhere where it's interesting. I think the area that we're seeing probably the most interest is in the places where you're currently using fully connected layers, and you don't want to have to keep paying the cost of having a fully connected layer. So a stacking, multiple, partially connected layers together looks like quite an interesting approach, and an area that we know... I mean you see this with CNNs as well, you can prune CNNs really quite heavily after you've trained them and still maintain past performance. So can we train those fully pruned CNNs, can we train these fully pruned language models from scratch in a faster, more efficient way. So can we rig the lottery and find that lottery ticket within that large, dense model by a training process. Rather than doing that from scratch. And if we could do that and it's efficient, then we might be able to access an even bigger model because one of the things that limits my ability to train a model is do I want to spend a month waiting for it to train, probably not because I'm going to have to do this 50 times, knowing the ML cycles we go through. If I could do that in a tenth of the time, or even half the time, a quarter of the time, it maybe gives me access to something that's four times as big. And it might be better, and that's the other interesting thing is that if you want to keep going up the curve of model size and try and drive the accuracy higher, having something that gives us more flexibility, another lever that we can pull. Another tool in the tool box that we're exploring this space.
Lukas:
I can see how at inference time, a sparse, fully connected layer you could do a sparse operation, that seems quite clear. But the training seems tricky, right? If you don't know a priori where the zeros are and the non-zeros, how do you figure that out? I'm asking a deep question that's hard to answer, do you think you could explain that to me at all?
Phil:
That I think is one of the known spaces because people have not explored this. So DeepMind, I think, published a paper called RigL which is Rigging the Lottery, which was a way that they proposed to try and discover the right sparsity pattern, where you wanted your parameters to be. So I think it's... I mean we train these systems through an iterative search effectively, where we're learning the parameters, it's another parameter you learn, it's the sparsity pattern. So you'll be adding parameters in, you'll be taking parameters away elsewhere. You might have information in the backwards pass about where... So one thing you have to be careful of is you probably don't want to calculate the full set of gradients for that, the dense equivalent space because well... It depends what you're targeting. But if you're targeting something that is very big, that could get very, very expensive. So how you get the signal for where you should be adding and removing parameters, maybe it's something goes to zero and you randomly add it to somewhere else. Maybe you're trying to come up with some other method for adding these in, but that's one of the things we're going to find out is, can we do this efficiently? I mean it might not work, you never know. But I get to go and find out.
Lukas:
Is this the main thrust of Graphcore's point of view on the hardware? That sparsity's important or are there other...
Phil:
I think this is one of the things we're very excited about, actually one of the really interesting things right from the start of Graphcore is the founders, Simon Knowles and Nigel Toon, is they didn't set out to, oh we're just going to go and solve deep learning at the time as it was described. They set out to say we want to try and build a computer architecture that's designed for machine learning as a general problem. So what are the computational characteristics of this problem, so what do we need to do to solve that? And to an extent, take a punt at where they thought it was going to go. They got a few things right, I think they probably got a few things wrong as well. But what we've built is designed for general purpose architecture for machine learning. So it is very, very good at dense linear algebra and we're showing in the benchmark results, that I believe will be published by the time we go to the world, that we're showing world leading performance with BERT, one of the very common NLP systems, with some of the CNNs. But we're also showing that some of the classes of models that are more efficient, fundamentally, so Efficient Net, even in the name, but don't run particularly well on a TPU architecture or GPU architecture because they break up the structures that you work with. They're finer grained in the group dimension than the other standard CNN architectures are. Those work really well on our architecture, we have a significantly better advantage, a greater advantage with those kinds of architectures than we do with the standard CNNs. And that's really, we're pretty good at both of them, but everyone else is really bad at the more efficient architecture. That's the same kind of thing with these sparse models is that fundamentally our architecture has been designed to be massively parallel, very fine grain. So you can map these kind of sparse problems onto it very efficiently, and other architectures kind of weren't. They were designed to be very big, block, bulk structured. And they're trying to bolt some capabilities onto it, but it's just fundamentally architecturally a bit more limited in their capabilities.
Lukas:
So can you explain to me why it works better on, for example, BERT. I don't think of BERT as a... I mean BERT's an embedding, right? Those are sort of, they're not really sparse, they're dense aren't they? What's going on that it's faster?
Phil:
Well BERT as a model, it has an embedding and then it's got quite a deep stack of transformer layers. So there it's just, we are very efficient at doing dense, linear algebra. So we can beat the dense systems at doing dense linear algebra.
Lukas:
Wait but why? Can you explain that to me? What are you doing?
Phil:
So fundamentally, well, a) it was designed from scratch to target this kind of workload and we store parameters and activations locally actually within the physical chip itself. So one of the unique things about the IPU is it's a massively parallel architecture. It has about 1000 IPU cores per IPU, but each of those cores also embeds a very significant amount of memory. So we have about 900 megabytes of memory on each IPU, then we sort gang multiple IPUs together into a larger system.
Lukas:
So these are like registers I guess? So you have giant registers?
Phil:
It's not really registers, it's just a very fast local working scratch. So you might think about it like an L1 cache, but it's not a cache because it doesn't really cache anything, it's the memory that we work with.
Lukas:
So I feel like what you're describing though in my ignorant brain, that's sort of how I would describe what a GPU is doing. So what's the difference here? Is it a more extreme version of that or...
Phil:
Well so a GPU, its primary memory system is HBM⁴, so it's external to the chip. It's packaged in a kind of pretty package, but literally they are stacks of memory that are glued onto a silicon wafer next to the chip. So it's not in the main silicon entity – it's right next to it, you have to go five millimeters through another silicon wafer and go back up into a stack of memory. And that five millimeters means that they can only get, "only", about a terabyte a second of memory bandwidth out of their memory systems. Something like that, maybe it's one and a half in some of the A100's. Whereas we get about 50 terabytes a second in and out of our memory systems on one IPU. And actually from a power perspective we probably get about two IPUs per one of theirs, maybe a bit more. So the amount of memory bandwidth we can actually deliver is an order of magnitude, two orders of magnitude bigger in these systems. And that makes it really good at dense linear algebra because we can move data backwards and forwards. Actually dense linear algebra is a bit more limited by the core computational unit than the memory system. But a lot of our advantage comes with systems that are not quite as dense as the pure dense, linear algebra systems or the bits that go around it. So sparse systems, some of these other kind of flavors, that's where we really, really step out. So we're better on BERT, but we're a lot better in some of these other ones. I said this to an American who didn't really understand me, when I said it was kind of like jam today and jam tomorrow. So we're really good today, and then you also get some really great things to come tomorrow. Well, when we start to actually be able to exploit these new kinds of applications.
4: High Bandwidth Memory
Lukas:
And is there any trade offs to your approach or some differently... well I'm assuming with TPUs, Google was imagining a fairly similar workload, right? I mean this was machine learning inspired. So was there some fundamental decisions that you made differently here? Is there any trade offs where your chip might be harder to use or worse in some scenarios?
Phil:
Yes, so the interesting thing about the TPU is actually from a genesis and idea, the architectures kind of came up about the same kind of time. And the TPU went very, very big from a functional unit perspective so they said we're going to do really big functioning units so that makes life really easy for the compiler developer. It makes life from a software perspective, it makes it a lot easier to target. But it means they really struggle with anything that's not a big, big matmul because their only big functional unit is a very big matmul. Whereas we've got a lot more flexibility with being able to handle smaller and more fine grained workload. So they're inspired, we want to target machine learning. But the observation that they took was okay, well that means we need to be really good at big matmul. And the observation we took was, okay well that means we need to be good at dense linear algebra but we also want to have all this other flexibility. So I would say if there's a downside of our architecture, it's that it makes the work of our compiler and library team quite a lot harder. So they had to work to build the library and the software ecosystem to allow us to attach directly into the frameworks and to provide the lowering from a large scale application workload. So we write in PyTorch, we write in Tensor Flow, to take that and translate that into something that maps onto our massively parallel architecture. So there's not a massive downside from a user perspective, it's a bit more of a downside for our team. I think it's taken our team a little bit longer to get that stack up and running. But what we do see, quite interestingly with this stack is we get very predictable performance across different architectures, across different frameworks. And actually between inference and training, so whereas... So some architectures you might have to go through a dedicated inference back end to get great performance. For us, we just take Tensor Flow, we take PyTorch and we just compile it, run from the framework and we get absolute, tip-top performance straight out of it because we put all of the work into the front end framework, trying to make it as fast as possible.
Lukas:
I guess it's funny, there's this thing that always makes me feel like I wasted my computer science education or something because I use typically NVIDIA chips. So I upgraded the cuDNN library, which I think is kind of similar to what you're talking about. I mean I think sometimes it'll give me a 30% speed increase. I just feel like this deep mystery of "what happened?" The hardware's the same, conceptually it seems like a fairly simple problem. How could you get such a massive increase with a smarter compiler? I guess that's some of the stuff you work on, can you talk about why this kind of conceptually simple thing is so complicated to get right, and why we can continuously improve our compilers to make these things run faster? If compiler's even the right word here, the translation from a network to a hardware...
Phil:
Yeah I mean I think compiler is the right word and our stack is probably about three compilers stacked, or maybe more than that. So I think the challenges are that these are... Oh if I was a computer scientist, I think that these kind of compiler transformations are an NP hard problem I think, but I might be wrong. But I think that's why that actually solving these kind of systems is quite difficult. So the compilers are typically developed to be quite general, or ideally you want a compiler that you can feed it anything and it will give you something that works. But it won't give you something that's 100% optimal in every domain, because that's a very, very tough problem to solve. So as you find new applications and architectures, then you might put a bit of work into trying to optimize the performance of those. So sometimes what you're seeing is that the software engineers will have found or come up with a different way of laying out the data sets or a different... Sometimes these might be fundamental architectural innovations in that they change the behavior of a system. So that I think is what you're observing here, is that the GPUs have a very different execution model, so sometimes when they're fusing and doing some kind of transformations, that helps them in some particular areas. And I don't really know too much about the development details of those kind of platforms, but for us I think one of the things that we've observed is that I think we've still got quite a lot of headroom. So one of the other things that I'm excited about is that we are quite young in our development process of the libraries and the software and I think we've got quite a lot of performance headroom. So there are some numbers that I've done on the back of the envelope, and I know how fast the chip can go. The chip can go at 250 teraflops and it can get very close to that, sustaining linear algebra. And I know that some things I put through it don't go that fast. And they probably should go faster than they're going at the moment. So that gives me quite a lot of hope actually that even the things that we're talking about at the moment have quite a lot of potential. And that's really, the compiler we have is doing a pretty good job, but it's not doing a perfect job. And if we go and make it better, it'll give us a better set of performance. That's work and actually some of the people that are doing this work are, I mean, exceptionally capable engineers so it's just a case of giving them enough time and space to do some of this optimization.
Lukas:
So are your chips commercially available? Could I buy one and try it out?
Phil:
Yes, absolutely. And actually we have just or are just about to launch the second generation of our processors⁵. So we actually launched the first generation a year ago, I believe. They've been adopted and deployed into Microsoft Azure, so we're really excited about the second generation of our product that it will be, we announced this I think a month or two ago, is coming very soon. The interesting thing is we've actually slightly changed the form factor that we're deploying these. We used to build things that looked a bit like a GPU, a PCIe card. We've actually moved to a slightly more integrated form factor that has four of our IPUs in, it looks a bit more like kind of a 1U pizza box sort of server. And it's designed explicitly for scale. So we've moved from thinking about systems that are server-based with a host-processor and a set of accelerator cards to a system that's designed to be able to just rack multiple of these IPU machines together and cable them with an interconnect and you have host remote across the network. So just aggregate that host from IPU processing, but also scale IPUs, we can go from one to 64 out to 1000s of IPUs in a very tight integration. So yeah, we're really excited about this and actually the performance, the scalability, all the kind of aspects of this technology are really interesting. And we're talking about some classes of models, BERT, ResNet, we've talked about some of these CNNs. Actually these are all, I mean BERT's fairly big today. A couple of hundred million parameters, but it's nowhere near the really, really big models that people are working with. So I think that's some of the things that we're really interested in is being able to drive the scale of these training systems, but also do it more efficiently. So we give people the tools to train large systems or train systems to high levels of accuracy without needing to go all the way into that completely dense, linear algebra.
5: This episode was recorded in late 2020, before the launch of Graphcore's second-generation IPUs.
Lukas:
Do you worry about some of the things that have been in the zeitgeist lately about models getting bigger and bigger? Like only the biggest companies having access to be able to train them or carbon footprint. Is that a real effect? I imagine it might actually help you, but maybe bad for society?
Phil:
So the societal impact of access to this technology are a fascinating topic. I'm probably not one for this because I suspect we could spend another hour on that alone. We're really focused around trying to make this technology available to as many people as possible and also as efficient as possible. So I think the way that we'll lower the bar for access to this kind of thing is by enabling people to run models that are more efficient, and enabling them to work with architectures that don't require a billion dollars of compute to train the model. I mean the big challenge around that is always going to be access to the data, because I mean the one thing, we find a compute person, I think about the compute, we also to an extent have to think about the data and access to that. And really that's the bit that seems to be favoring some of the very large organizations today is that they have the ability to pull together the training sets that most people don't have access to. So there are two sides to the access to this technology story that I think are-
Lukas:
What about energy issues? Do you think over time these kinds of chips will become a significant user of energy?
Phil:
I'm not convinced, compared to the rest of the fleet of web-service infrastructure in the world that ML's ever going to get to the scale where it's more expensive than they are.
Lukas:
Didn't Google say that some huge fraction of their compute centers was doing inference?
Phil:
If they have, I've missed it.
Lukas:
I could be wrong.
Phil:
So that would be an interesting observation. I mean it's not going to be zero, so the question I think is how much of a percentage of that it is. And also how much of it is going to be training versus inference? I guess if they're driving their search backend via inference and if they're driving all of the back end Google Photos and YouTube and all of those kinds of things-
Lukas:
And certainly they are, right?
Phil:
Well yes, so you follow down that, maybe it is. So yeah you could be right, the inference workload could look quite large. But again I think that's probably an area where you would be looking to deploy dedicated chips. This is why people build dedicated chips because they're more efficient than the general purpose chips. So the whole idea of trying to do this is to make something that is more cost effective, so it costs less in terms of dollars per model trained or dollars per inference served to your customer. And part of that's the power cost, part of that's the procurement cost of these kind of things. So I think that comes into the factor, that's why we build these special purpose architectures, or at least specialized architectures. The other comment is with the end or slowing down of Moore's Law, there is a very significant plateau in the rate of improvement, or the shrink and also the energy efficiency. We can no longer rely on things just getting better, where every two or three years we'll get another 50% or 2X energy efficiency, or whatever the scaling is. That's really slowing down. So the specialization of the processors is being driven by that. So we need an architecture that is more memory-efficient. If you go back to the fundamental processor, we don't move data very far. So the whole architecture is geared around data staying local for the processing and the physics of moving data is one of the things that really drives power consumption. So there's doing the actual operations, so driving the computational units, and then there's moving data to and from your memory subsystems. So if your memory's very close, the cost of moving data there is a lot lower, energy costs, compared with like if it's off chip, the cost tends to be a lot higher. This goes into the power or consumption of the device, where are you spending your power? And so that's... One other premise is actually of the IPU is fundamentally more efficient, higher floating point operation watt of energy input because we don't move data as far, we try and keep everything as local as possible for as long as possible.
Lukas:
I guess one more question on chips, just the timing. Apple recently came out with a new M1 that a lot of folks are talking about that included some ML-focused stuff. Do you have any opinion on that?
Phil:
Well it's a really interesting bit of tech, and they showed some really interesting overall performance improvements. I think this is an example of specialization going out into all of these kind of systems. I think it's also an example of the spread of machine learning and the workload out into all of these kind of systems. So I'm not sure in the context of Graphcore and building data center scale training and inference systems, it's probably not something that is particularly relevant in terms of marketplace, but it is interesting to see... I mean we've seen this with mobile phones with dedicated inference chips being embedded into, I mean I think all of the ones that I've got kicking around have one of these things in somewhere that they're using for photos and other kinds of things. So I think that's just you'd almost expect it because every kind of modern, consumer-facing workload has some kind of ML embedded into them or I would guess that most of them do.
Lukas:
Well thanks so much, I mean this has been super fun. I feel like even if it wasn't being recorded, I've learned a lot, I love it. So we always end with two questions, I'd love to ask you these. So the first is pretty open ended, well they're both open ended, but the first one is also open ended. The question is what is one underrated aspect of machine learning that you think people should pay more attention to than they do?
Phil:
So machine learning is a bit of a chicken and an egg in that because it's built around processing very large volumes of data that require quite a lot of compute, the bar to actually to get to a state of the art solution is quite high, just in terms of the amount of work that you have to do from a computational perspective. So you have to have, any kind of data processing algorithm has to be quite efficient and be able to run at teraflops, tens of teraflops to be able to chew through that. So either something that's much more data efficient in the way it learns, or something that we can find new computational architecture to give us the efficiency on new classes of models, I think those are things that might be really interesting.
Lukas:
I have to say it's funny, we've had a bunch of computational chemists talk to us on this show and also in customer interviews, they're all talking about graph based networks. It seems like that might be an area where there's a lot of interest in.
Phil:
So one of the ones that we've been working on, and I'm not sure when we're going to be able to publish it, but actually is a graph based neural network using the spectral library in Tensor Flow, and it's a very small example. It's not anything fancy or ground breaking, but it's just an example I think to doing molecular binding prediction using that kind of approach.
Lukas:
Cool, the final question we always ask is what's the biggest challenge of making machine learning models work in the real world, but I'm kind of tempted to modify it for you. I'm wondering, what's the biggest challenge of taking a new piece of hardware to market? It seems like there must be challenges everywhere. But where's the surprising challenges?
Phil:
So I would like to answer the first on as well because one of the things that we've done quite a lot of, so we've talked a lot about performance. How fast does it go, and actually performance is a beautifully simple thing because it's very easy to measure. What's the images a second? What's the sequences a second? How fast does it go? But the other bit of that is, actually, you don't just care about how fast it goes, you care about it giving you the right answer as well. So you care about your systems converging. One of the things that we've been really interested in exploring, actually part of the reason that we're working with Weights & Biases, is as part of these kind of building very large convergent systems, leveraging and doing all of those kind of experiments. So finding the right kind of batch size that gives you the optimal performance whilst not impacting your convergence scheme. That's one thing that we've been working with. We had quite a lot of fun I think with the numerical behavior in some of these systems which particularly, so we talk about low precision, good, goes much faster. Also, dangerous because you need to manage the precision a little bit more accurately than you might do in some other kind of systems. So building a system that gives you great performance and also gives you the right answer, I think that's one of the things we've found interesting as we bring these systems up and particularly, I would say that the first generations of our systems and we had some really interesting convergent schemes running very, very low batch sizes showing actually extremely rapid convergence, even on some big models. And they were really good, the one thing that we observed today looking at our large scale systems, is that they wouldn't scale. They wouldn't have enough batch size to be able to scale to very large systems and we're actually reworking some of the systems we work with to support much larger batch sizes. So looking at optimizers, we would be uses SGD or SGD-M quite a lot, SGD-Momentum quite a lot. We're looking at LAMB, very large scale batch optimizers that have been used by Google and Nvidia as well for their large scale systems. So yeah that's certainly been something that's been a whole bunch of fun, and I would say has been very challenging. I mean the number of hours of compute time that we have been spending developing these kind of systems, to a certain extent, finding the bugs in the models sometimes where oh, we've got the layers wrong or there's something that's just not quite laid out correctly and that's impacting the convergence of these systems, so we need to go and find that. So there are those kind of things. In terms of actually building, bringing the new hardware to market, that has been a tremendous journey. It goes all the way from completely new architecture, massive amounts of memory on chip. How do you at the fundamental silicon level test that system and make sure that your processor actually works. So that was an interesting problem that some of our team had to tackle, and we very successfully worked through how do you take one of those systems and integrate it together into a cluster of 16 IPUs, a cluster of 64 IPUs, a cluster of 1000 IPUs. How do you make that kind of system work at that kind of scale? How do you take all of the various applications and map them down to the frameworks, how do you support multiple different frameworks efficiently? There's been lots of fun across all of these spaces. So one of the things that I would observe is building these very large scale training systems is one of the big challenges, it's one of those really big... It's a bit like building the old super computers, the grand challenge problems of our time essentially. So it's quite interesting to go and try and do that from scratch with a completely new set of architectures, and actually I mean one of the fantastic things about Graphcore is how quickly we can move through some of these processes. There have been a lot of challenges through that phase, I would say we've met most of them with great success which is quite nice. We're at the point where we can now bring this all to the world, which is very exciting.
Lukas:
That's so exciting. It seems like such a fun job and congratulations on the latest benchmark, we'll definitely put a link to that in the show notes.
Phil:
Yes, thanks for having me, I mean it's been a lot of work from quite a large team of people. And actually very little from me, so the hardware and the software team at Graphcore have been beavering away for a long period of time and they've all done a really fantastic job.
Lukas:
Awesome, thanks for your time.
Phil:
Excellent, thanks very much Lukas.
Lukas:
Thanks for listening to another episode of Gradient Dissent, doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to the episodes. So if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot.