Jonathan Frankle: Neural Network Pruning and Training

Jonathan Frankle and Lukas Biewald discuss neural network pruning and training, the "Lottery Ticket Hypothesis" and much more on this episode of Gradient Dissent.
Dave Davies
Created on April 10|Last edited on April 10
Comment
﻿
﻿
About this EpisodeJonathan Frankle, Chief Scientist at MosaicML and Assistant Professor of Computer Science at Harvard University joins Lukas Biewald on this episode of Gradient Dissent. With comprehensive infrastructure and software tools, MosaicML aims to help businesses train complex machine-learning models using their own proprietary data.
In this episode of Gradient Dissent they discuss:
Details of Jonathan’s Ph.D. dissertation which explores his “Lottery Ticket Hypothesis.”
The role of neural network pruning and how it impacts the performance of ML models.
Why transformers will be the go-to way to train NLP models for the foreseeable future.
Why the process of speeding up neural net learning is both scientific and artisanal.
What MosiacML does, and how it approaches working with clients.
The challenges for developing AGI.
Details around ML training policy and ethics.
Why data brings the magic to customized ML models.
The many use cases for companies looking to build customized AI models.
And much more.
Connect with Jonathan﻿Twitter﻿
﻿LinkedIn﻿
Links﻿The Lottery Ticket Hypothesis﻿
﻿MosaicML﻿
Listen﻿
﻿Apple Podcasts﻿﻿    Spotify﻿     Google Podcasts    YouTube 
Transcript (via Trint)Jonathan Frankle: [00:00:02] I find personally a lot of impact and being downstream with these problems, if I'm going to make messes, I have to clean them up. So in some sense, my policy work is an attempt to make sure that, you know, as I'm on the bleeding edge of creating this technology, I'm also providing that same insight to policymakers so they can adapt to this as quickly as possible.
Lukas Biewald: [00:00:17] You're listening to Gradient Descent, a show about machine learning in the real world. And I'm your host, Lucas de Waal. Jonathan Frankel is chief scientist at Mosaic, AML and soon to be Harvard professor. He wrote an exceptional paper lottery ticket, a hypothesis about how neural networks learn and how you can print them. He's also taught policy at Georgetown University Law Center. This is a super interesting conversation, and I hope you enjoy it as much as I did. All right. Why don't we start by hearing about your sort of journey to what you're doing now? I think you've had kind of an interesting background and career, probably a great start there. 
Jonathan Frankle: [00:00:53] Yeah, it's been a winding road. If you go and look at my CV, you'll be a little bit confused. I think about some of the things that have happened and how they came to be. But you know, the high level is, you know, I'm a computer scientist, trained from the beginning, you know, from undergraduate all the way to the present. I'm actually defending my dissertation this Friday. So, you know, I can't quite say I have three degrees yet, but, you know, very, very close to it. But it's a bit of an odd trajectory. As an undergraduate, I did some research on programing language theory, which is what I got my master's degree, and then I went and spent a year teaching at a law school and doing technology policy work in D.C. Then I came back to MIT and wrote a paper on cryptography and somehow stumbled my way into machine learning. Somewhere along the way, having never taken a class on the topic prior to grad school. And I think that journey there really two big takeaways. I think if you want to understand me better and kind of get to know how I think about the world, one is that as you start connecting the dots on the topics that I've worked on and what I've been good at, they're all the messy, hard to measure problems. You know, I don't like to work on clean things where you get a nice proof and call it a day. I love working on the messy things that intrinsically don't have answer security and privacy, law and policy and now deep learning where you know there's no night and there's no nice, neat proof that's going to wrap everything up and tell us all the answers. It's going to be intrinsically messy. We're dealing with complex problems in real world data, and what we do at Mosaic is to try to exploit that messiness and find, you know, find a path through it in order to deliver more efficiency to people. So, you know, if you really want to connect the dots, that's really how you put the pieces together, as far as I understand it.
Lukas Biewald: [00:02:24] Awesome. Well, you know, I'd love to start with this is probably bad from a podcast marketing perspective, But, you know, I want to start with the thing that I'm kind of most interested to ask you about, which is, you know, you did a really well-known paper on pruning neural networks, the lottery ticket hypothesis, I believe, or what was the title of it?
Jonathan Frankle: [00:02:44] That's that's the one that's notoriety at this point.
Lukas Biewald: [00:02:47] And, you know, I guess before we get into it, if you could kind of describe the sort of thesis or the key results of the paper, and then I have a few questions for you. 
Jonathan Frankle: [00:02:55] Yeah. Speaking of describing a thesis in my other tab right now, I've got over leaf open with that thesis. But the really the really simple statement, not the 200 page version, because I'm sure nobody wants to hear that if you want to. It'll be an archive. But the really simple version is that the neural networks we train, you know, ask yourself why do we train them in the particular way that we do at that learning rate with this recipe, with this optimizer, with this kind of normalization? The answer is usually, well, because someone else did it and they got it to work. Usually, you know, when it comes to Resnick coming, did it that way when it comes to a transformer, you know, which is first why they did it that way. And so that's the reason why we train it that way. And in many ways, the story of my career in machine learning is questioning those choices. And in the lottery ticket work, I questioned one very specific choice Why do we use all these weights? These networks are really big. We know they're so-called overdramatized. But why? And I read at that time in my career all these papers on neural network pruning this topic where you train the network and then delete, you know, connections that seem unnecessary and you end up with a much smaller network that, you know, as far as we can tell, performs about the same as the original network. We started with why did we need those weights to begin with then? Is there something intrinsically harder about learning than there is then kind of representing what you've learned? Like, you know, is it easier to kind of know the rules of calculus than it is to like, learn and process them for the first time? Maybe our networks have to be big early on. It can get small later as they get kind of smarter and have more compact representation. That was what one of my professors at MIT told me when I asked him, Why can't we train smaller networks? And the lottery ticket ideas are one way that I found to make it possible to train smaller networks. And the trick is that any weight you were going to delete at the end of training, you never really need it. You could have gotten rid of it, you know, at the beginning or nearly at the beginning. But with one catch, when we create a neural network, we kind of set each connection to a random value. At the beginning, we, you know, we have to initialize it to something. We don't know what yet. And the whole point of optimization is to get those weights to good values. But it turns out those those random values aren't so random. Or rather, you know, the specific sample we get from that random distribution is really important. And each weight in this smaller, sparse, pruned and out. Work needs to be set to the right value for it to be able to learn. And what I found is actually the value is that those weights are randomly assigned to actually are really important for making those particular weights important. This sub network won the lottery. It happened to get a set of weights that allowed it to train. Well, if you sample a new set of initialization for it, it does really badly. And this initialization sensitivity is something that we don't typically see when we train traditional neural networks that aren't pruned. Ironically, it's something we now see all the time with these foundation models. Like the whole point of a foundation model is a good initialization. So I think these ideas have come back around again in any way. 
Lukas Biewald: [00:05:33] So are you saying that that really the point of having many more weights than you need is just that some of them randomly get assigned good initialization values? Is that what you're saying?
Jonathan Frankle: [00:05:45] It's a possibility. And the only reason I'll say it's a possibility is because I'm an empiricist. If I'm going to make a claim, I need to have an experiment to evaluate it and try to, you know, falsify it. And it's hard to figure out how to falsify that claim. If I wanted to do that, I'd really have to try every possible sub network to see how many lucky ones there are and whether, you know, whether there are other sub networks that got lucky that just didn't happen to emerge or whether this one was kind of the one and only. That's my conjecture. But testing it is very difficult that perhaps by a certain point in training, just, you know, the network is optimizing in a lower dimensional subspace that some of the weights just become unnecessary. And so learning actually can take place pretty successfully without those extraneous weights or at least whatever subspace it's in could be access aligned in such a way that you could prune a bunch of weights from the network. Now, you know, again, testing that is exceedingly difficult. And if you had a way to do that, please let me know. I'd love to have another dissertation chapter. But I think that is the high level conjecture that in the original paper we called it the lottery ticket hypothesis. There's a hypothesis and the conjecture, and that is the statement of the conjecture.
Lukas Biewald: [00:06:41] And I guess one way to check it is just to go back to what they were originally set to and see that it has the same quality of performance. Right. 
Jonathan Frankle: [00:06:51] So when you go back to what they originally set to and it has the same quality of performance that at least indicates that that subnautica is sufficient, But it doesn't necessarily mean that some network is actually important to training when the whole network is there. It could be that we've kind of we found our way into some sufficient sub network that was actually doing nothing in the context of the whole network. So being able to say what that Seven Network is doing as part of a whole is a little bit more difficult. It's entirely possible that the optimization picture looks completely different when you have the dense network that isn't pruned and the sparse network that is apparent. And we do know that that optimization behavior is pretty different, pretty different. 
Lukas Biewald: [00:07:23] So you can't just you can't just reset the weights and to what they were when you started training and then remove all other weights and get the same performance, then I guess, Oh. 
Jonathan Frankle: [00:07:33] You definitely can say that. Yeah, that's kind of the crux of the lottery ticket experiment is, you know, removing all the weights except those that you kept from pruning and then just setting them back to their initial positions and training them again. That does work quite well. But the question is what purpose was that sub network serving within the context of the dense network? That sub network is good. It's able to learn on its own, but that doesn't necessarily mean it was useful in any way for the dense network. It's entirely possible that there are two completely different dynamics going on when you have the whole network versus a sub network. And you know, I can't say for certain it's that that gets into tricky empirical scientific questions that we don't really have an experiment for right now.
Lukas Biewald: [00:08:09] But the you observed that the performance of the sub network is similar to the the entire network, right? The dense network.
Jonathan Frankle: [00:08:18] Definitely. Definitely. 
Lukas Biewald: [00:08:19] I'm not quite sure what you're saying there. The sub network seems like it sort of is responsible for all of the performance then of the the entire network. Right.
Jonathan Frankle: [00:08:28] Or I think it's a necessity versus sufficient distinction. The Seven Network is certainly sufficient to get good performance, but it's unclear whether it's actually necessary. And, you know, one way to actually test this is to take that sub network and try your I'll pose you I love these thought experiments, take that sub network and instead of keeping only the sub network, actually delete only the sub network and keep all the other weights. So if you've got a sub network that's like 1/10 of the size of your original network and you've just wiped it out, what do you think is going to happen to that dense network when you try to train it? Except missing with that hole in the middle where the sub network should be is going to be really badly. 
Lukas Biewald: [00:08:58] I mean, it's I guess it's an empirical question, but it was sort of imagine that it would find another 10% to lean on. Is that is that right?
Jonathan Frankle: [00:09:05] That's exactly right. And so it's, you know, the the claim that it's necessarily leaning on that 10%, I think is it's something that we can conjecture about, but it's something we can't say for certain because we don't have evidence to back it up, because if we were to delete that, it'll lean on a different 10%. If the leaning is even happening, we could make that claim. But we need some hard evidence to show that it's even leaning on that 10% to begin with. That 10% happened to have the highest magnitude weights at the end of training, but even magnitudes of weights doesn't necessarily confer importance to that. It's hard to say what which weights are actually important, in which weights aren't for the function and the using magnitude key here is stick is a very bad one. At least it's you know, there are all sorts of fancier ones in the literature. They don't tend to work that much better. But, you know, people would argue that magnitudes are very naive thing to do. 
Lukas Biewald: [00:09:47] I guess that makes sense. Do you think that like high rates of dropout cause more of the network to get used or I would sort of imagine that if if there's a lot of dropout happening, it might force the network to use more of the weights available at least to have a. Like a redundant mechanism now maybe.
Jonathan Frankle: [00:10:04] So there are a couple of complications there. One is that dropout typically works in the neurons rather than the weights. And so it may end up having a very different effect potentially. And there does tend to be a huge difference between pruning weights and pruning neurons in terms of how well you do and how much of the network you can prune. The network seem to like having extra representation capacity that is extra neurons. But each neuron seems to not use that many different inputs, hence why you can prune individual weights much more easily than pruning entire neurons, even if you're pruning, in effect, the same number of weights. So the other piece here is that, you know, we have this intuition for how for what dropout might be doing. We don't necessarily have evidence to back up that that's what's happening. The original dropout paper makes all sorts of, you know, claims that I would consider pretty outlandish and unsupported by any empirical evidence. And I like to only say what I what I have evidence to show. So it's hard to say that there's, again, necessarily a relationship there unless we can come up with an experiment to test whether dropout is somehow, you know, making the network more robust, a pruning or something along those lines.
Lukas Biewald: [00:11:01] Well, that does seem like an empirical question, right? If. Oh, yeah, that's the printing. What is there sort of an inflection point in the pruning? Like. Do you have a sense of like, hey, you can prune up to X percent before there's there's problems sort of generally holds across networks or across ranges of data or anything like that.
Jonathan Frankle: [00:11:18] Not that holds in general, unfortunately. And one nice way to test this is actually even for the same network in the same training regime, you can play with the difficulty of the data in ways that make it harder to prune or easier to prune. So, you know, as an example, training a network on just a standard image task. You can some percent of the network and you know, let's say for a recent 20 on CFR ten, something that anyone listening to this can probably train in a few minutes on a few. At this point, that's about 90% of the network that can be pruned or 85% somewhere in there before accuracy completely starts to, you know, drop off a few period and all the way. It's obviously things don't go very well and there's some inflection point there. If you were to try doing this on an adversarial, robust training task, which demands more of the network and a little bit more capacity intensive, one would imagine you aren't able to prune as much typically before accuracy starts to drop off, you know, anything. The task, the way that you optimize the network, the final performance, they can all affect your ability to prune in how much you can recruit. I wish there were a nice general rule of thumb. The answer is usually somewhere between two x and ten x, you know, compression via pruning, although, you know, in some, you know, crazy cases, if people set it up right, you can prove 100 X or prune down to 100 x smaller. Usually people set that up to make their pruning methods look really good in the literature, even if, you know, at the end of the day, those are two examples that are just meant to get gaudy looking numbers as opposed to, you know, really being scientific.
Lukas Biewald: [00:12:31] Is it even consistent across training runs to the to the printing performance say the same on the same data set with just say different random initialization?
Jonathan Frankle: [00:12:39] It does tend to be pretty consistent across random rationalizations and random seeds. But then again, we've chosen, you know, in some sense we've evolved the way that we train these networks to be consistent across initialization and in seeds. You know, we've we've spent 20 or 30 years trying to do exactly that. And, you know, to the point I mentioned earlier about how these sparse networks are very picky about their initialization, I imagine that if we had made it the goal 30 years ago to have networks whose sub networks aren't picky about initialization, we might have a completely different architecture and completely different optimizers. So, you know, it's we have to remember that, you know, 30 years of grad student dissent has landed us on these particular networks with good properties that, you know, in this case we're exploiting.
Lukas Biewald: [00:13:14] And so I guess there's you know, there's tons of different properties of of a, you know, a network that you could that you can examine, like what is there like a practical application of printing that gets you excited about it or what? Like what even caused you it to look into this?
Jonathan Frankle: [00:13:28] I like it honestly for the scientific application. I'm really excited about the idea that we can understand how these neural networks learn more effectively. You know, right now, or at least when I started doing my research back in 2018, one thing that really struck me was just how utterly unscientific the literature is. It's just littered with all these claims about flat minima and about the noise of stochastic gradient descent, about what dropout does about internal covariate shift. Most famously with that norm just, you know, a term that they completely made up in the paper and never actually tested to see whether the effect was real before they proposed their, you know, supposed remedy. That was just how the science was back at that time. And I feel like I sound old when I say this. The science has gotten a lot better, but now those sorts of claims don't generally get into the literature without some evidence supporting them. And I like to hope that, you know, the lottery ticket work was part of that trend that, you know, we do want to get a better scientific, empirical understanding. And it's not enough just to, you know, say things and not try to support them with facts the way that a lot of the older so-called great papers, you know, from around 2014 to 2017 do. But I mean, the other piece was obviously, I was very jealous of the labs at MIT that had GPAs. My lab did not. And I thought, you know, that's not fair. Can we make this more efficient? Can't we get rid of some of those weights? Won't that reduce the cost of training? Unfortunately, you know, doing unstructured, sparse pruning is generally it's very difficult to accelerate that because, you know, it's an irregular pattern. The hardware is not designed for it. There are certain specialized chips that can do unstructured sparsity pretty well. But, you know, they're not widely accessible and the sparsity isn't generally. Applicable right now for those listening who are working on those chips. Feel free to let me know if I'm wrong, but that's certainly been my experience so far. So, you know, I would say this was a bit of a swing and a miss on that front. It was certainly effective on the scientific front. We've got all sorts of cool ideas that have come out of the lottery ticket work. But I think for me, Mosaic is really kind of, you know, the second at that. In some sense, it's an attempt to ask the same question, you know, how do these networks really learn empirically and is everything we're doing necessary? Are our recipes actually good or they're better ways to train them out there this time without the sparsity, which is hard to take advantage of, and instead with an eye toward anything that will actually speed things up and actually produce cost savings. You know, immediately today on real hard work.
Lukas Biewald: [00:15:39] And so I guess, you know, it's funny, I was thinking maybe this is a good segue into a mosaic, but like when I think about, you know, Transformers and attention, that is another case like Dropout where we have these sort of like evocative words, like attention that one wonders, you know, how how real the sort of hand-waving explanations are. But we still, I think, you know, kind of generally use them. I'm curious if you have thoughts on how much Transformers is sort of it's a product of somebody doing something that kind of works well and everyone's sort of copying the details of it or, you know, some kind of fundamental insight like do you think if you ran back history 100 times, you would get transformers? Like, like what parts of transformers do you think you would get in in every case? And what's just sort of the the one off of like the sort of random path that we took to this architecture?
Jonathan Frankle: [00:16:26] That's a great question and it's hard to say. I do think there is something pretty fundamental to what we call self attention. I don't know what it is that's so fundamental to it. That's very tough to say. It does seem to work quite well and we've had plenty of attempts to replace it that have, you know, had varying degrees of success. But still nothing has supplanted it. And given that the attention is so effective and also so cheap relative to the massive feedforward layers we use and, you know, are are giant models, you know, the the really, you know, 10 billion plus parameter models today. There's no reason not to use it if it's effective. It's not really asking to be replaced in some sense the way the norm is asking to be replaced. If anyone has a bad storm replacement, please let me know. I want to get rid of it very badly. I think it's such a simple architecture. I think we would have arrived there eventually. Like at the end of the day, the self attention is really the most powerful new component. Otherwise it's just a feedforward network and the self attention was already kind of bouncing around the literature in various ways, and the folks who wrote the paper really put the pieces together exceedingly nicely. You know, these these good inductive biases are hard to come by. I have a bet going with Sasha Rush right now that Transformers will still be the go to way to train NLP models in another five years or so. And I that bet is placed because convolutions have lasted a really long time vision. The vision transformer is still something that I almost never get asked for at Mosaic. It's an academic curiosity by and large. The rest of it is the real workhorse still. So, you know, convolutional networks are they stood the test of time, the current network stood the test of time. I think Transformers will as well. These little inductive bias insights are pretty hard to come by. But when you take a step back, they're relatively simple tricks at the end of the day. 
Lukas Biewald: [00:18:03] And I guess what what then matters for for speeding up training at this point and kind of what are the things you're working on at Mosaic around that everything matters. 
Jonathan Frankle: [00:18:13] I wish I could tell you that when we get a seven speed up on a model like we did with thrust net 50 animation or a three or four X speed up on Bert Pre-training, which will announce probably by the time this podcast is out, it will be announced or you know, the speed ups will have coming on GPT three that I don't know yet, but I'm sure will be out by the time this podcast is out. I wish I could tell you. Here's the one neat trick you need to do to get that speed up. The answer is the rest of 850 recipe that seven acts is 17 different interventions into training, affecting everything from data augmentation. The inductive bias of the network, the optimizer, regularization, the loss function. It's basically anything and everything that there is in the network, even shaping how things go over the course of training. I wish it were one thing, but you know, as with all good systems afterwards, it's 5% here or 5% there. And once you stack enough of that up, you get to something really impressive sounding like seven X. 
Lukas Biewald: [00:19:04] But I guess the challenge, you know, with like maybe all neural net research, right, is like each experiment is kind of expensive and these things don't typically, in my experience, sort of add up linearly, like how do you even kind of know what's contributing to your speed ups? 
Jonathan Frankle: [00:19:21] This is why we have a research team. This is the hard part of our jobs is trying to piece together what may work together with what, you know. People often ask me is the secret sauce, some of the speed up methods. And the answer is no. You know, we put that out Open source for free. The secret sauce of Mosaic is the research team that has developed the methodologies and the intuitions and the ways of thinking about the problem. It's an emerging science, this kind of science of composition. It doesn't necessarily we don't necessarily have a good recipe. I wish there were some automated system that would do it for us so I could tell the researchers to go to the beach. But a lot of it really comes down to some principles we're developing. Like the early part of training. Nothing that important or interesting tends to get learned. So, you know, we can generally get away with. Say decreasing the resolution of images or truncating sentences or playing with the masking ratio for a bird or something like that. Principles like that. You really only have a certain budget of regularization for a given training run length. And so you need to use that wisely on things that won't slow down training. And there are regularization methods that are, you know, no effect and some that are actually pretty meaningful slowdowns. And you have to choose wisely on that front. Some ideas around balancing which parts of training you. You know, if you make that prep faster, you got to make forward faster as well. Otherwise you start hitting diminishing returns on anything else that makes a backdrop faster. But there is a lot of art to this as well. How do you get it such that it's good enough for this model, but not so over overly specific to one data set that it won't work. If somebody comes in, has a new data set, they want to try out this model there. There's some kind of balance between, you know, how specific and fast the recipe is and how general and perhaps slower the recipe is. And again, these are all kind of, you know, in some sense subjective tradeoffs that we have to make. It is little artisanal at this point. 
Lukas Biewald: [00:20:57] Do you think it'll stay artisanal? Does that does that seem like. 
Jonathan Frankle: [00:21:00] Yeah, I think I think it'll stay pretty artisanal. Um, in some ways that's good for business. Like if it's not artisanal enough, my research team needs to find other stuff to work on, but it is artisanal insofar as every model is different. The way that we train it is different, and each of these interventions is different and has a weird effect. And you and I think of Lottery Ticket as being just one of these interventions among dozens that we've tried at Mosaic. And so the way that I've spent five years getting to know lottery ticket and all the ins and outs of how sports networks behave, we kind of have to get to know each of these interventions and how it behaves and what effects it has. And that's a you know, that's a long journey. And then seeing that applied to a new model sometimes or ideas translate over and sometimes they don't. And understanding why or why not, you know, that's new knowledge we can use to build on and and try to understand these methods better. But they do almost feel like friends in a lot of ways. And they, you know, they have complicated personalities and understanding how they work together is it's tricky, frustrating at times and makes our researchers want to pull their hair out. But you know, it I think it's really there is some intuition and some high level rules of thumb that we start to use, but I don't think we'll be automating this anytime soon. It's like network architecture search, but or hyper parameter search, but even more difficult because now we're adding different changes to training whose effects are difficult to predict until you've really trained the model to the end.
Lukas Biewald: [00:22:15] And so I guess before we go too far down this path, we should probably talk about Mosaic model. I mean, what's the story behind it? What do you guys offer to the world?
Jonathan Frankle: [00:22:24] The way that I'll try out a new way of describing it, you can let me know if this is any good. I kind of I like to try these things.
Lukas Biewald: [00:22:30] I love it. So.
Jonathan Frankle: [00:22:31] You know, in the hardware world, we have these foundries. We have a company like TSMC, Taiwan Semiconductor Manufacturing Company. They're, you know, all sorts of geopolitical interesting right now because they are the best place to train to get your semiconductors made. They have the most advanced process technology, the smallest transistors, which means, you know, the best power efficiency and the best performance of anyone. And there are a few other companies in the world that do this. They're Samsung. You know, there's Globalfoundries, which used to be part of AMD. Intel has its own internal foundries, but at the end of the day, if you're Nvidia, AMD, Apple, anybody, you go to one of these foundries and you say, Hey, there's a chip, I'd like to get made. Um, they give you, you know, TSMC gives you some high level APIs or some high level, you know, abstractions you can use to design your chip for them. Then you go and you hand it to them and say, Print me the chip. And they're not experts in chip making, you know, to, you know, TSMC or in designing chips. TSMC doesn't make its own CPU. They're really, really good at taking your designs and bringing them to fruition. They have all the latest technology, they have the most efficient stuff. They know how to improve yields. And I think of us at Mosaic as being TSMC.
Lukas Biewald: [00:23:40] And just say so. Can you describe like what I talk about one customer and what their their specification look like?
Jonathan Frankle: [00:23:47] Definitely. So we're not in the business, we're not making a GT3 three clone, you know, we're not training our own language model for us. We're never going to put out an API and say, This is the mosaic chip, come here instead of open AI or something like that. We're happy to take all comers and say, Come print your model. We have the latest technology. If we're doing it efficiently, we know how to improve yield as much as possible. So, you know, we know how to kind of make these training runs go well. The first time I can stretch this metaphor quite far and we'll see how far it goes.Lukas Biewald: [00:24:13] But yeah.
Jonathan Frankle: [00:24:14] But we've worked with, you know, we've we've worked with several customers now who want to train large language models that have specific properties or where they have some specific data set they want to use or for compliance reasons, they can't use GPT three internally. You know, you can imagine lots of enterprise customers are pretty concerned about what might be in that model and the fact the training data isn't public and everybody has lots of data. That's been one thing that struck me. Companies don't realize they have lots of data, but everyone has tons of unlabeled data. You know, someone came to me and they were using some pre-trained models in Image Net to do some vision stuff. And we asked, you know, do you have any unlabeled data? And they were like, Yeah, we've got a little. And we said, How much? Nine petabytes. I mean, I was like, And you're pre-training on Image net? Are you kidding me? Someone else, you know, they were using the Bert Pre-trained on wiki text and we said to them, Do you have any unlabeled data in your domain? They were like, Yeah, a little. Or like, how much? 300 billion tokens. You know, enough to train an exceedingly large language model. It's like, why are you using wiki text? So I think the data is out there and you know, people want these large language models. Often they want them with some specific properties or with, you know, something tweaked about the model. But that's we're here for to give you, you know, a great process, technology and the ability to customize to what you'd like. So I like this foundry metaphor quite a bit, and I hope it I think it distinguishes very well what we do and what we don't do. We're not here to, you know, put the chips in computers and run them. We're not here to do inference. You know, we're not here to design your model for you. We're not a consulting company. We're here to help you build the best model you want for the lowest amount of money and to get something that works really efficiently.
Lukas Biewald: [00:25:47] But I guess when you say that, you know you won't build the model, for me, it kind of sounds like you're building the model for me.
Jonathan Frankle: [00:25:54] Like where I guess I should specify what building the model means. I'm not going to tell you how to you know, I'm not going to tell you which model you should use. I'm not going to tell you, You know, you should use a resident for this and you should use a GPU for that. I'm not going to tell you. Here's how to set up your optimization problem on this. I'm not going to go through and help you curate your data or things like that. Really, my focus is on you're ready to train, let's train the best darn model we can and let's get it right the first time. So I'm not here to solve your machine learning problem or to set up your machine learning problem for you. I'm here to help you train the model that you'll eventually use in production. Once you figure it out, how you want to solve that problem, you know strategically how you want to go about it.
Lukas Biewald: [00:26:34] Interesting. And I guess like one challenge putting myself in your shoes might be that it's kind of hard to know, at least in my experience, up very, you know, how well training is going to go. Like, how do you work that out with the prospective customer?
Jonathan Frankle: [00:26:50] It's tricky. We I think we're learning how to hold people's hands through that process a little bit in the same way that I'm sure TSMC does not say, you know, here are our tools. Let us know when you want to print a billion of these chips. You know, you want to go through and you do sampling and you try to, you know, you simulate the chips before you build it. And we have a similar process, You know, before you train that 100 billion parameter model, we should probably train a billion parameter one and make sure that things work end to end and you get a result that looks reasonable. Did you use the right tokenize or, you know, are you getting results that, you know, if we train a 1 billion, that a 3 billion, are the results getting better? You know, is your data quality high enough? Can we go back through and, you know, make sure that all the inputs are looking good and we've seen everything that could possibly go wrong at this point. And there's so much that goes around when you train these models. I know the folks at Facebook were really kind of put out their log book for all the stuff that happened when they were training the OPD model, 175 billion won and everything you can imagine went wrong. Hardware dies, mid training run, you get lost bikes. The resumption times are really long. If you're not careful like these multi terabyte checkpoints you have to load. I'm getting the data loader spun back to the point you are in train and could take a very long time and just some weird stuff happens. I mean, some cases are just like you know, on a training run this long, you'll get memory errors like, you know, that one in a billion, you know, cosmic ray strike in your memory will happen. You know you get things like oops I used a different tokenized for training as I did for evaluation. And so all my evaluation results look really bad, even though my model is really great. So we try to get all that stuff ironed out at the small scale and then we'll work with you to kind of, you know, go bigger and bigger and bigger until we're ready to, you know, to to build the giant ship, as it were, and really actually train the model.
Lukas Biewald: [00:28:30] And I guess in my experience, people typically, you know, don't just want to train a model one time. They want to continuously update it forever. Do you then sort of take over that process for a customer? How does that work? 
Jonathan Frankle: [00:28:42] We're certainly in the loop on that front. We have APIs and and an Python software development kit where, you know, if you want to do data intake and just schedule retraining to take place, you know, you can just do that on our platform. You know, really, you know, we've got the platform and it's very easy to program around it for simple automation like that. And we have tools to help you do that. And I think you're right, Like a lot of people say to us, Aren't you going to have a bad business? Like customer is going to come to you once train that model and leave? And I think the metaphor our CEO Naveen likes to use is like, if you're building a piece of software and you get to version 1.0, do fire all your engineers and say, I'm done. You know, software is done. Now, of course, you've got more features you want to build, you've got things you want to update. Nothing is ever really done. And I think it's good for our business. But also, you know, once you've done that big training run, we're developing ways to make sure that your second training ground is much cheaper based on taking advantage of aspects of your big training run. And that's a place where we're investing politically in the technology so that each each incremental run should be cheaper and cheaper than the last one. Almost like a frequent flier program. You know, the more you train, the more you save in some sense. But there's a lot of really interesting science behind how to do that without, you know, having your first model determine how all of your other models are going to go, because your data may change a lot. 
Lukas Biewald: [00:29:52] And I guess, how do you think about, like engaging with the research community? I mean, obviously you're still publishing papers, but you also kind of talked about your your secret sauce. Is there sort of like a bright line in your mind about, like, you know, kind of what you publish and what you what you keep to yourself? 
Jonathan Frankle: [00:30:07] Definitely. I mean, the first thing I'll say is we don't publish. That is one line I did draft for the team early on. We're not Google Brain. We're not here to be an open ended research team. We have a job to do in customers to serve. We do science in service of that. But, you know, for anyone here who's looking for an interesting job, don't come here if you want to write papers or do open ended research. That's not what we do. We do like to share our results and we'll talk about everything. We have to talk about our speed up methods. If we don't talk about them. Imagine if you came to me and said, Hey, I want to train this model. And I said to you, Well, we're not going to train. That model will train something slightly different, but I won't tell you what. That's a secret. You wouldn't really trust me. We do have to be open about that algorithmically. So the secret sauce is really, I think, a couple of things. One is the expertise we built as a team to be able to really attack these models and speed them up. The secret sauce is in some sense experience and wisdom and kind of the culture and the scientific practices we have on the team. The, you know, the way that we make money is that, you know, we put all our speed ups out there, but our cloud platform, you know, our orchestration software, the tools that make it really easy to train these giant models, they kind of the managed version of this that you have to pay for. And that's really when you're training a large language model. Good luck doing it. Without this, you're going to have to stay up 24 seven and watch the loss for the spike and then figure out how to roll back and restart. And a lot of those tools, you know, are part of our paid offering.
Lukas Biewald: [00:31:21] So you do publish then your algorithms, and I understand that. Right. Oh, sorry.
Jonathan Frankle: [00:31:26] Sorry. Let me let me clarify the word publish here, because I think we're using it differently. We don't submit papers to conferences for publication in that way, but we certainly do share openly what our algorithms are, what our recipes are, and that's all available in blog posts and in our open source composer library. So that is, you know, freely available for anyone to to see and use, but I guess publish in the academic sense, honestly, it just takes too long and, you know, we can disseminate our results without having to go through peer review and all that good stuff.
Lukas Biewald: [00:31:51] Hmm. I see. I see. You know, another topic I want to make sure I. I hit with you is it seems like you're you're a bit of a skeptic of this current approach, sort of leading to HCI. And you seem kind of maybe quite sure about that, that point of view. I wonder if you want to sort of say say more how how you came to that or if that's a fair characterization of your perspective?
Jonathan Frankle: [00:32:14] I think it's a very fair characterization of my perspective. I genuinely, you know, first of all, getting a good definition for AGI is pretty tough. I mean, either it's kind of everything or nothing. It's something, you know, pie in the sky that will never really reach its true human intelligence, in which case, you know, trying to get that out of a feedforward neural network seems like a you know, it's I think I've heard the metaphor building the ladder to a to the moon a lot lately. You know, that's not how we're going to get there. It's going to take something fundamentally different if it's kind of, you know, it's fossil chat is AGI. If you know, you have a very narrow definition of what AGI is. And I've heard some people arguing that. So if you want to go, I think AGI is really an all or nothing term and I'm more of the you know, more people talk about it as the all sort of definition. You know, this is really truly general humanlike intelligence and the ability to learn and adapt in an environment in which case a feedforward neural network is not going to get us there. That's just you know, this is not fundamentally not the right technology for that. I really think AGI is being used pretty cynically by a lot of people in the field as a way to get people to to give them money, either get people to give them money because they claim they're going to make something happen or get people to give them money because they claim they're going to study something catastrophic that would happen. But either way, I take it as a cynical kind of, you know, in pursuit of resources, in pursuit of power and money, not something that people mean very seriously, at least, you know, other than the extent to which they're misleading others. 
Lukas Biewald: [00:33:41] Interesting. What what are some things that. Some sort of like reasoning tasks, maybe that you think a feedforward network surely wouldn't be able to do?
Jonathan Frankle: [00:33:54] I mean, you know, the the average feedforward network today from chat is probably just looking at a very long context. And so, you know, if that context is essentially our memory space in our state space and the model is able to just write back to that context and reuse it for future token prediction, that's a pretty raw way of giving the model the ability to interact with itself and interact with an environment. So, you know, it's hard to point to a task. People like to quiz me on this, like what is the SAT score at which you'd consider this to be AGI? I had someone really, really badger me about that when I gave a talk recently when I expressed skepticism of AGI. So it's hard to pin on a task that to say the model can't do this right now. And if it does that, that's AGI. Look at the Turing Test. I mean, we've been passing the Turing Test for 40 or 50 years, and that's been a pretty awful test of whether, you know, something like Eliza had AGI. So, you know, I don't like to point to one task and say, you know, this is you know, this is a thing that something must do in order to be AGI. But I don't think we have you know, I don't think the set up of a feedforward network where we're just adding tokens to a context and hoping that it's able to take advantage of all these tokens, you know, is in any way going to lead to some kind of general intelligence.
Lukas Biewald: [00:35:03] I mean, I guess, honestly, I you know, I, I don't think I have a super strong point of view, you know? But, you know, you seem like very empirical and it seems like a very strong claim to say, you know, surely this approach, I kind of can't do this. And I guess maybe you're saying that it's just sort of so, like, poorly defined that, you know, that's not sort of like a meaningless claim. But I guess it's sort of.
Jonathan Frankle: [00:35:26] I think it's a meaningless claim. But I also think that, you know, for many definitions of AGI, I don't think feedforward neural networks are really going to be able to pull it off.
Lukas Biewald: [00:35:34] Yeah. So I guess I'm just trying to get at that of like what? Like what are some things maybe I'm not like, I mean, certainly I gotcha questions just for like what are the kinds of things that you think Feedforward networks will never be able to do?
Jonathan Frankle: [00:35:46] I mean, right now we're watching GPT chat or really struggle with long context lengths where someone goes back and forth with it for enough iterations that it starts cycling or it clearly loses track of what was happening earlier on. And we still don't even know how to solve that basic problem. That's a problem we're going to have to overcome. We can really do, you know, large scale, you know, just handling large amounts of information, being able to somehow reason about it, you know, hierarchically or something like that. You know, we're still nowhere close to that. And I think now that people are finding some of the some of the soft spots of chat, we're we're seeing that happening in real time. That's a basic problem. We're going to have to overcome. These models still attend to the tokens that are closest to the current token. They don't really attend to that far. Um, and you know, if these things end up in reasoning and loops because of that, but if we want things to reason, this seems like a pretty inefficient way to get something to reason in and of itself. So I'm, I'm pretty skeptical that just taking the same things and making them bigger will solve any of these problems. And those are basic problems we're going to have to overcome before we get there.
Lukas Biewald: [00:36:46] You're unusual that you were, I think, maybe like a really strong interest in in policy. Maybe. Can you kind of tell us a little about that and kind of what you think is sort of important at this moment? I guess it's December 20, 22. Like, what should we be or what what kinds of things are you advocating for?
Jonathan Frankle: [00:37:07] So I'm curious. I'm going to I'm going to turn turn the question on you for a moment. How would you define policy? I love to do this to people because you always get interesting answers.
Lukas Biewald: [00:37:18] How to Define policy. I guess my first thought around better studies is. Government regulation of what companies can and can't do. And then I think there's there's another thread of sort of like what maybe outside of regulation, what companies sort of should do to make sure that their the work they do has a positive impact on the world. So what am I missing?
Jonathan Frankle: [00:37:47] So you're your first one here. You know what? You know, regulation. I would consider that law, but not policy. So that's an instantiation of policy. But I think the big distinction here is this question. The second point you got to what should we do? What is kind of the art or, you know, what is now what should we be accomplishing? What do we want the world to look like? And in some sense, you know, the rest is implementation details. That's when you get to lower concreteness. But even policy at a high level can be simple questions of what should we do or what direction do we want the world to move in? And so from a policy perspective, I don't see policy as necessarily advocacy. Advocacy is one thing you can do. You can advocate for, you know, what we should do. But the other is, you know, simply providing consultation to the people who do make policy. Two of the parliamentarians from around the world or, you know, what have you, the people who are setting policy and trying to figure out what direction they want their country, so they want the world to move. And that tends to be the role that I take. You know, I tend to be a technical expert that gets called in to help provide context to policymakers on topics in this case related to machine learning. But in the past it was privacy or security. So I spent a year at Georgetown Law, kind of as the technologist in residence, helping them to, you know, to make their decisions better on what kinds of policies they recommended or what kinds of research they did or how they understood, you know, what their findings were on various topics, specifically in that case, police use of facial recognition in the U.S. We did a big study showing that I think at that time, one third of all American adults were in a police facial recognition database at that time. That was earth shattering news today. I think we all understand we're probably in Clearview and a bunch of other things. We've you know, we've given in to a surveillance state in a way that we hadn't before. Today, I spent a lot of time with an organization called the OECD, which is kind of, you know, a UN style organization that does economic policy for kind of mostly the democratic capitalist countries and helps to do research and help them, you know, do things like think about national capacity and how they should be setting that. So it's less about advocacy. But I think the important distinction here I'd make is I'm one input into this process. I'm a source of consultation and a source of expertise and a source of detailed knowledge about how A.I. does and doesn't work. I can provide feedback. I can make recommendations about when someone has a policy goal, what the right implementation would look like. I do see a lot of us in computer science kind of expressing the hubris that we should be the policymakers or we should set the final policy. You know, we're not just one input into the process. We know better than the lawyers who've been thinking about questions of, say, fairness and bias for decades, centuries, however long. That is a lot of our questions around alignment or safety or things like that. You know, did we ever realize there are regulatory agencies that have been dealing with, say, automobile safety for a very long time and probably have some good ideas about how to structure constraints on what we would think of as a safe car. In computer science, we tend to have the hubris to make to think that we can reinvent the wheel better than other people. We like to disrupt things. In the case of a lot of these topics, I think we're leading ourselves wrong. And perhaps the right way is to engage with people who, you know, have built up expertise specifically in taking on these kinds of ambiguous questions that don't have clear answers. And, you know, we should we should be consultants, but we're not the only input into that process. And we should trust people who have legitimately studied this, not people who have made up new definitions of fairness because they thought it was interesting.
Lukas Biewald: [00:40:53] Maybe I'll ask a question in a different way. Like what? Are there, I guess. Like you have like front row seats to, you know, the sort of explosion of, I guess, like use cases around, you know, language and and vision models. What kind of concerns you the most about about where things are headed?
Jonathan Frankle: [00:41:15] I think we're getting to a place this is not a novel concern, but I think we are getting to a place where I think you're seeing this even with all the things I've seen on Twitter with DVT chat, these models are very confident even where they're full of crap. These models sound very convincing, even when they're speaking complete nonsense. And we don't have a way to tell the difference right now. And, you know, we've seen this danger and, you know, many times in the past, in other forms that, you know, information from a source that seems reliable or kind of feels reliable doesn't necessarily have to be true. And the the ability, the the line between what does and doesn't feel true. Well, let me put that a different way. I think that it's becoming a lot of our training as people about how to tell the difference between what is and isn't true and what should and shouldn't be trusted is being exploited by some of these models in order to convince us that things are true, that aren't, or make things seem real, that aren't. We're not we're not mentally prepared for a model that sounds really confident and speaks really intelligently, but is just blessing because it's a language model that was trained on Wikipedia and Reddit or, you know, pictures coming out of something like a diffusion model that really seem real but aren't were the world is moving much faster than our cognitive biases are. I think we'll adapt in the same way that, you know, people adapted to yellow journalism back and, you know, back at the turn of the 18th tonight or the 1800s of the 1900s. You know, we've adapted, I think, reasonably well to fake news, as people were now pretty skeptical of what we read online. You know, even if it looks like it comes from a publication of some kind and will adapt here. But things are moving so fast, it's hard for anyone to keep up and it's hard to really like, I don't know, I don't think we're ready. I don't think our cognitive biases are quite ready for the onslaught that came last year, let alone the one that's coming this year, let alone the one that will come next year.
Lukas Biewald: [00:43:03] You know, and I guess this is someone, you know, like building a foundry for for making like lots of these models. Is there like, are there like things that you feel sort of like obligated to do on your side to sort of like kind of help with these issues? Or is it really just sort of like a training of, you know, the consumers of these models to maybe like, you know, not not trust confidence, which might be, you know, a useful thing for people to do anyway?
Jonathan Frankle: [00:43:33] No, we're definitely obligated. And there are a lot of different ways of addressing that. I find personally a lot of impact and being downstream with these problems. If I'm going to make messes, I have to clean them up. So in some sense, my policy work is an attempt to make sure that, you know, as I'm on the bleeding edge of creating this technology, I'm also providing that same insight to policymakers so they can adapt to this as quickly as possible and to make sure that, you know, we're we're asking the right things of people as these mother as these models change. Part of it is also that we need to be responsible about who we work with. You know, there are some companies that at the end of the day, we may choose not to work with or some, you know, organizations we may choose not to work with if we don't think they're mature enough to handle this technology properly. That means partially that we need to move further and further down the chain, not just to, you know, how do you build this model, but, you know, trying to think of the right metaphor for, you know, in the chip world. But it's probably something like helping a company pin test their processor to make sure that, you know, I'm not going to tell you how to build it, but I do want to provide you with a toolkit to make sure that, you know, you built it such that it's robust to X, Y and Z, such that you don't have timing channel attacks. So, you know, we may need to move further down the chain and help people evaluate their models effectively. That is something, though, that a lot of fantastic organizations are out there already working on. And far be it from us to reinvent the wheel. Part of it is, you know, having great partners like weights and biases. We, you know, we work with extensively to make sure that we can offer customers a full solution. No one company is going to solve all their problems, but they're fantastic companies out there looking at questions of bias who are probably going to adapt as these models, you know, as these models fool us or, you know, you know, get around our cognitive biases and increasingly sophisticated ways. I'm going to want to show up to a customer saying, Hi, we're Mosaic, we train models. But, you know, here's our preferred partner who we work with closely, who's an expert in how to help you evaluate and test this model before you put it out in the real world. And we highly recommend you work with them. Here's our partner in the same way that today we say, Here's our preferred partner for experiment tracking and we highly recommend you work with them because they're great. We don't want to solve everyone's problems. It's about, you know, putting all these pieces together into one great solution. But, you know, at some point, if we get big enough, we'll certainly convene an advisory board of some kind. That's something I've seen work reasonably well from my perspective as a policy person in that world. And there are certainly a lot of friends who you know, you know who you are. I'll be calling on you for a favor to help us make good decisions. You know, someone on the outside who, you know, has the trust of the community and has my trust to help us make good decisions on that. Someone did like in what we do. You know, one of my friends to, you know, building cyber weapons for people. It's these models can certainly be used in that way. And we do have responsibility to, you know, help make sure that these models are being used carefully and, you know, keep an eye on what our customers want to do with them.
Lukas Biewald: [00:46:13] I guess, you know, you have really front row seats into applications of these models. Do you? What's your perspective on like what sort of new use cases are opening up with this technology?
Jonathan Frankle: [00:46:31] Everything. I mean, I will give you a worse answer than just browsing your Twitter feed right now. To some extent, I'm watching my research group back at MIT. Try to make chat right programs. I guess I would scoop them because this podcast won't be out for a little while, but they just have a section where they're playing around with this and looking at the strengths and weaknesses of this model. And, you know, it's really impressive, even if it's, you know, deeply flawed. It's so impressive that we've gotten to the point where we have something where the flaws we can point out are things like, well, it doesn't seem to remember facts that well. Like, that's a huge win over where we were a couple of years ago and we need to celebrate these wins even if these things aren't perfect, you know, or folks who are using diffusion models for all sorts of really creative things don't want to scoop them either. You know, but, you know, things that go beyond artistic purposes or, you know, things that go toward using it to generate new products or new ideas or new design. Really, the I'm trying to think of the really I'm not the domain expert here. And I think the beautiful thing about this is the barriers to entry are low enough that any domain expert can see how this tool can help solve their problems. I had someone reach out recently about using this for some sports related purposes that I thought were really cool. I wouldn't have thought of that. But this person happens to work in the sports industry and had an idea for how to use, you know, a large language model for that. It may not be quite the right thing or it may take some more machine learning work, but it's at a point where this is mainstream enough that I'm not the one to tell you the cool applications. Go look at the world and all the brilliant creative professionals out there and the people who are in their own industries trying to solve problems. They're the ones who should do it, you know, in service of not trying to have too much hubris in the service of trying to be humble in the way I'm around policy, I'm I'm the foundry. TSMC is not the one thinking of innovative hardware solutions. They're not serious building a wafer scale engine. You know, they're not graphic or thinking up, you know, a whole new way to organize a chip or anything like that. They're just saying, Oh, that was brilliant, cool use of our technology. Let's help you make it. 
Lukas Biewald: [00:48:21] But I guess if I'm coming to Mosaic, I'm probably doing more than just like fine tuning, you know, an open source model out there. So I must have something that I really, really care about. So I love to, like, just get it. I mean, who who comes to Mosaic to do this and why are they doing it? I mean, I understand the like why you might not want to use, you know, GPT three, which has to be like hosted in the cloud and you know, you can't hold the model but like, you know, there are like open, you know, model like language models out there right now and like what, what causes a company to, to undertake this sort of like very big expense of building one of their own foundation models data.
Jonathan Frankle: [00:49:05] It's specifically data in one word. And by that I mean that your data is your identity. You know that that means both from a personal perspective, but also from a company perspective. Every company is sitting on so much unlabeled data. We now have incredible methods to leverage that unlabeled data between images and text, and probably soon combining the two and combining all sorts of other modalities. That data is your identity. And would you rather use the same identity as everybody else? Would you rather use your identity? Would you rather use read it, which is probably a pretty large part of what's GPT three? Would you rather use your data and your identity? And I think that that's what people are really coming for.
Lukas Biewald: [00:49:44] So that makes sense for them. Like what are people doing done downstream of it that's like so important that they're willing to make this huge investment?
Jonathan Frankle: [00:49:51] Oh, anything and everything from, you know, actually using this for customer service scenarios to using this to do interesting, you know, open ended classification tasks or the kinds of, you know, few shutter zero shot tasks being able to prompt in a domain specific way, you know, anything and everything, all the standard applications that you see if a GPT model, you know as applied to whatever their scenario is inside their company, it's the same applications. But with the benefit of having your data imbued into this model, one thing that we always think about are kind of one paradigm that I'm thinking about a lot these days is the idea that these large language models are really databases. They know things. You know, you can you can query, you know, the Jeopardy chat and get all sorts of really interesting facts out of it. A lot of those facts aren't quite right. There was a beautiful thread earlier today about asking, you know, what the fastest marine mammal is, and it said a falcon, you know, but it has knowledge and it has facts. There's a great bit of work from a fear press. I'm a researcher at the University of Washington. He's a Ph.D. student right now. He's on the job market, by the way. I have to say that for anybody I know who is on the job market doing this thing, he calls self ask where he gets the model to kind of reason through a task by asking repeated follow up questions. And then he did this really cool thing where he Googled each of those follow up questions and and took whatever answer Google gave in its knowledge box and gave that back to the model. So having these models now interact with databases or having these models be the databases themselves and probably some combination between the two, I kind of think of every relational database out there as like an exact version of a database. You have a schema, you have a way of querying data. These large language models are in some sense soft databases. You can ask them natural language questions. They can find relationships between data that aren't expressed in a relational database, but might be expressed if you give it enough data and, you know, teach it what text looks like or what language looks like. And you can create these things like databases or even connect them to databases as well. So I think that's really for me, I see that as an emerging application area, just thinking of these not as models, but as like these soft databases that give you the ability to make connections that you could never do if you had an exact relational databases, a kind of fuzzy databases or approximate databases. I'm sure someone will coined a much more clever term than that. But, you know, that's kind of that's how I think of them today. And when you think of it from that perspective, do you want your database to be whatever it is, whatever web crawl opening are used or you want your database to be your data? The answer is probably a combination of both.
Lukas Biewald: [00:52:15] Probably both. I was going to say, Yeah.
Jonathan Frankle: [00:52:17] But you certainly want your data in there and if you have a lot of data, whether you're pre-training from scratch or starting from, you know, O-pee-chee or some other, you know, pre-trained starting point, you still need to imbue this model with your data.
Lukas Biewald: [00:52:28] How many companies do you think will try to build these large models from scratch?
Jonathan Frankle: [00:52:33] Hundreds, at least. Possibly thousands. At least. Our business so far seems to reflect that possibility. Business is really good for training large language models and we do feel a little bit like TSMC right now in that, you know, getting capacity on TSMC is really hard and you got to be an apple sized customer to fill the book that capacity a large way in advance. And I, I feel that way right now in terms of how we're booking our large language model training at the moment. My team will certainly if you talked, if you manage to find them right now, they will tell you that as well. But the answer is everybody, everybody is sitting on so much data. Many companies are going to need lots of these things. And you know, lots of companies are going to need at least one of these things. We're seeing this all the way down to relatively small companies that want to do some fine tuning of these models because they have, you know, some business specific data that is really important for them to be able to use these models effectively. So I genuinely think, you know, what is it? Jupiter Chat got up to a million users like after a week of use. I think that should tell you something about the number of people who have found interesting use cases, or at least very curious about where this technology might fit in. And if that's a company, they have a lot of data that they can use to kind of customize this model for them. And it really at the end of the day, if we've learned one thing from these large language models, it's all about the data. The models are very interesting. The Transformers, yeah, it's a cool technology, but transformers are pretty simple compared to an Elastomer or something like that. It's not even about the way we train it, other than the fact that the way we train it is really expensive right now and music needs to get that cheaper. That data is where the magic happens. The data is what gives this model its superpowers, and that's a place where everybody's going want to customize it.
Lukas Biewald: [00:54:05] It's interesting you say that as someone that that does the the model building, Do you do you offer any kind of services like active learning or like ways to. Improve the data. If you feel that the data is the most important input into the model, how do you engage with the data?
Jonathan Frankle: [00:54:20] Um, right now, I mean, we have a lot of excellent partners who are fantastic at this and we're a startup. We need to stay focused and our focus is on making it cheap enough that you can even contemplate doing this. The data is a problem. You only think about when you can actually train the model. And if I told you the model is $10 million to train, well, you know, in many cases you don't care about data quality at that point. You just know that you're not to be able to afford to train it. If I tell you that model is $100,000 to train, then there's another conversation to be had. And I think for the first time we're even having that conversation in a way that I don't think we could have prior to a lot of the efficiency work that's happening at Mosaic, AML and elsewhere. So now that we're in this place, it's something we're definitely thinking about and it's something that we're building tools to provide. Um, but it's something that, you know, there are a lot of fantastic companies out there and a lot of fantastic partners that are experts in data. And far be it from us to, you know, walk in and think we know better than them.
Lukas Biewald: [00:55:11] Can you tell me some of your favorites? I mean, I'm curious, I'm neutral in this because I used to do a data collection company, but I haven't in years. Do you have a. 
Jonathan Frankle: [00:55:21] I don't want to I don't want to play favorites or, you know, in public or anything like that. Um, you know, there are a lot of great folks out there and we've, we've worked with a lot of different folks in the past, and we're probably gonna work with a lot of folks in the future. So don't play favorites here. But you know, it's a really competitive space, which means there are a lot of really smart people working hard to make to do better at data curation and, you know, data labeling. And, you know, I would trust them over me right now for sure. You know, I haven't been doing this for years the way that many of these companies have. And you could go to the experts were really, really good at training. And we do that really well. And depending on what kinds of data someone's working on and, you know, what kinds of customers or what kinds of companies are looking to work with, you know, we can point them in a number of different directions. But, you know, it's it's fun being a startup. It's like being in academia in some sense. Nobody's an expert in everything. And if you want to accomplish anything significant, you have to collaborate. And I like to look at the world as being full of, you know, awesome collaborators who we can work with. 
Lukas Biewald: [00:56:17] All right. Well, we always finished two questions, and I want to make sure I have time to get them in. And one of them is something that I'm sure you'll have a great answer for, which is kind of outside of what you're doing now. What's another area of research that you wish you had time to look into? Like, what do you think's, I guess, an underappreciated research topic in machine learning? [00:56:38]
Jonathan Frankle: [00:56:39] Oh man, there are so many. I would tell everyone who's working on adversarial examples or federated learning or anything else that's kind of, you know, academic and not very exciting to go work on any one of these. Instead, you know, for me, I am really excited about the data questions. I think understanding how much data quality really matters, understanding how much these things like reinforcement learning with human feedback or just instruction, fine tuning actually matter. Are these red herrings, you know, opening? I said they were important. Why should we believe them? Why should we believe anyone until we've reproduced this and seeing whether it's actually valuable, there's a great opportunity for us to you know, for those of us in academia, you know, in my other life, I will be an academic next fall again. And technically, I guess I still am for another three days until I defend. For those of us in the academic world, these are fantastic questions to ask. I don't think we should take for granted that someone said, you know, someone like opening. I said, This is what they do. We're talked about it and therefore it must be what they're doing. These are questions we can scientifically approach. And I think the beautiful thing about this is they're not that expensive to take on in a lot of ways, especially the fine tuning questions. Those just involve fine tuning a model. They don't involve training something from scratch. The data questions are a little trickier, but even at small scales, we should be able to see effects that should give us a sense for what may happen at large scales. And so for me, this seems to be the key leverage point right now. I mean, the the other questions that always get me excited are the questions of how these models really learn. This is a wildly complex process where we're taking tons of data, tons of parameters and tons of compute and throwing it together in a stew and mixing it. And sometimes good things seem to come out. But what are kind of the chemical processes actually happening there? How does learning take place? What does the network learn over time and how does it learn and what are those dynamics? And that always gets back to, you know, the application is the mosaic question of how do we speed it up? But the mere act of understanding how learning happens is itself a fascinating scientific question. And if, you know, for those looking for Ph.D. programs and, you know, I don't think it's still too late to apply to Harvard this fall, you better really start writing and, you know, get your letters like yesterday. But, you know, if you want to come work at the Ph.D. at some point, those are the questions I'm most excited about academically, because I just find the fact that you can stir all this stuff together and get something like GPT three out to be endlessly mind blowing. 
Lukas Biewald: [00:58:52] And totally agree. Totally different direction. I guess when you look at taking these these models and actually getting them running in the real world, you know, both sort of inside of Mosaic and then then I guess it's your customers where you hand off the model, guess about where you see the unexpected bottlenecks, like what are their actual hard parts about, you know, getting a big model, doing something useful for someone in the real world.
Jonathan Frankle: [00:59:19] It's always the stupid stuff, just like, you know, anyone who's ever done software engineering knows it's always the semicolon you forgot. It's never that your algorithm has the wrong time complexity. It's always that you mixed up two variable names or that you accidentally overrode a variable somewhere in your for loop and didn't realize it. We see that all the time. Like the example I gave of using a different tokenize or for training than for evaluation and thinking that your model's not training at all. It's those kinds of mistakes. But there are so many different places where this can happen because these are such complex systems that are so nuanced that, you know, it's those dumb mistakes that kneecap you like, did you know and I didn't know this that on an a100 or av1 hundred, if you change the memory format of a reset to channels last, this doesn't change anything about the model. It's a one you know, model. I think channels lost or something like that, you get a 30% speed up. Did you know? I sure didn't know. Did Ph.D.. Student Really pretty? Student Me Really wish you knew that. Yeah, that would have saved me a huge amount of time and a lot of money for my advisor, and I would have done a lot more science. Who knew? It was written down in a beta PyTorch documentation page somewhere? A few people in the know knew about it. We wrote it down, put it out very publicly, and we try to do that with everything so people know where to find it. But it's the little things that always kill you, not the big stuff. At the end of the day, we can figure out how to get the stuff running, but it's the little things that just drag you down. It's death by a thousand cuts a lot of the time. I think, you know, the the other killers are always efficiency when you're training at this scale. If it's not efficient, it's going to take the life of the universe. That's kind of the mosaic problem in some sense. The other is that there's a difference between getting it running and getting it running smoothly and reliably and consistently. The difference between having something jury rigged. I'll give you a quick anecdote here on whether this makes the cut or not. We'll see. Here's how I trained my models back in my Ph.D. Google was kind enough to give me some free TPU capacity, but there wasn't a job scheduler associated with this. You had to just manually spin up a TPU SSA and start your job search out. Let it go check in at the TPU. Sometimes froze up or died or my code crashed, so I wrote a little orchestration platform on top of it. I wrote a Google spreadsheet and used the Google spreadsheet API so that whenever I typed a line into the spreadsheet with an experiment, my little daemon process I had would read the experiment off check to see if it was already running. If not, it would. I started. If you kick it off, run it and update the spreadsheet with its status, change the color of the line and everything. And I used Night Train. I use this to collect over the course of my Ph.D., almost a million checkpoints. I just deleted those checkpoints. They were taking up God knows how much space, and it was over a million files deleted. This was the jack iest thing in the world. It worked to my advisors chagrin. It worked. It was terrible and unsalable and unsustainable. And if I'd had to do something even bigger, if I were running a business and this were what we're holding things together, God help me. It was good enough for a Ph.D. student, and a lot of us settle for that quality of solution. A lot of us say we're okay with that, you know, in the way waythat advice, context, a lot of us say I'm okay with ten support, even though it requires like 128 CPU cores and gigabytes or terabytes of memory and it still crashes all the time and doesn't really do anything right. And in some sense there's there's nicer stuff out there. We know how to build better tools, but I think many of us are. We don't either. We've been in academia for so long that we feel like we don't deserve better. You know, why should I have access to the nice tools? I'm an academic. I've been in academia. My research team says this to me all the time. Like, you know, our our engineers will say, Why didn't you report this bug? Why did you just do this horrific thing to work around it? And, you know, after we go through enough for hours, it typically comes down to, well, I didn't want to disrupt you and I didn't feel like I deserved to have this bug fixed. We've we've accepted the self-flagellation as part of being an academic or being a researcher. But to some extent, you know, I think it's that like there are clean, good solutions that just work out there for a lot of these problems. And if they're not, someone's building them right now and that janky solution you knitted together using SLAM, you shouldn't use that that tensor broad based solution that like crashes half the time and gets the lines mixed up and has these weird spikes and gaps and everything in the data. There's something better out there than that. And I think, you know, we should be we should be willing to use it. We also have this philosophy in computer science of like, I shouldn't have to pay for things. My poor friend who was like using Lieber office when they could have just paid a relatively small amount of money to Microsoft or Google and gotten a product that really genuinely did work. But instead they really wanted to do that. And I had a professor I worked with. He really wanted to use his. Linux laptop, even though it almost never connect to the Wi-Fi and barely ever print it. And I think, you know, we look at things like slammer tests of word and we think I see those as kind of the Libra office type solution. It will kind of sometimes get the job done. But also it's not that expensive to get the real thing, especially if you're an academic. There's academic pricing for all this stuff. And if you're a company, the cost of making your engineers time, the cost of your engineers time absolutely dwarfs the amount of money you'll pay to any one of these companies out there to get a product that works. So I do think there's something tied up in this. You know, did I home brew something crappy? Because, you know, I didn't know any better. Did I deserve better? You know, do I have this intrinsic trained interest to not pay for things? Because, you know, that's I don't know. We have an open source ethos in computer science, but all of those things I see dragging people down a lot, they certainly, you know, to some extent the tools didn't exist when I needed them to. But now there are fantastic tools out there and I if someone were to have the same infrastructure I had or the same tools that I had, the same capabilities I had the same hardware that you use, they should not have done it the way that I did it. In this day and age, there are tools out there that will help you to use that infrastructure more effectively. So, you know, we should be focusing on the problem here. And I think, you know, straight away, away from your question rambled a lot and we'll see how this gets cut up when all is said and done. But I think to make a very long story short, it's not as hard as it used to be. There really are tools that will help you avoid making mistakes and you should just use them.
Lukas Biewald: [01:05:15] Well, you can. There's a role here for you as a head of marketing. If you're just getting started, it's good to get the good stuff.
Jonathan Frankle: [01:05:25] In as a dedicated user of your product, you know, we're willing to pay to not have to suffer and it's worth it for us. 
Lukas Biewald: [01:05:32] Nice. Thank you. We appreciate it. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out. 
﻿
﻿
﻿
﻿
Add a comment
Tags: Gradient Dissent, Podcast, Articles
Iterate on AI agents and models faster. Try Weights & Biases today.