Skip to main content

Peter & Boris — Fine-tuning OpenAI's GPT-3

Peter and Boris dive into the world of GPT-3: how people are applying OpenAI's flagship language model, why fine-tuning GPT-3 improves performance, and the development of OpenAI's GPT-3 API.description
Created on February 8|Last edited on November 18


About this episode

Peter Welinder is VP of Product & Partnerships at OpenAI, where he runs product and commercialization efforts of GPT-3, Codex, GitHub Copilot, and more. Boris Dayma is Machine Learning Engineer at Weights & Biases, and works on integrations and large model training.
Peter, Boris, and Lukas dive into the world of GPT-3:
  • How people are applying GPT-3 to translation, copywriting, and other commercial tasks
  • The performance benefits of fine-tuning GPT-3
  • Developing an API on top of GPT-3 that works out of the box, but is also flexible and customizable
They also discuss the new OpenAI and Weights & Biases collaboration, which enables a user to log their GPT-3 fine-tuning projects to W&B with a single line of code.

Connect with Peter & Boris:

Listen



Timestamps

0:00 Intro
1:01 Solving real-world problems with GPT-3
6:57 Applying GPT-3 to translation tasks
14:58 Copywriting and other commercial GPT-3 applications
20:22 The OpenAI API and fine-tuning GPT-3
28:22 Logging GPT-3 fine-tuning projects to W&B
38:25 Engineering challenges behind OpenAI's API
43:15 Outro

Transcript

Note: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!

Intro

Peter:
We have these kind of two camps of users. The researchers and the developers. Developers keep telling us like, "Hey, I just want one button. I just want the best model to come out." And then a lot of the researchers want to fiddle more with the parameters. I think we can probably satisfy both for a long time.
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host Lukas Biewald. Today, I'm talking with Peter Welinder, longtime friend and currently VP of Product and Partnerships at OpenAI, running GPT-3 and other things. Before that, Research Lead at OpenAI, where he was one of Weights & Biases' very first customers. And before that, Head of Machine Learning at Dropbox. I'm also talking with Boris Dayma, Machine Learning Engineer at Weights & Biases, and we're going to talk about GPT-3 and the recently announced integration that GPT-3 did with Weights & Biases. This should be a lot of fun.

Solving real-world problems with GPT-3

Lukas:
Peter, the last time we talked I think you were working on research at OpenAI. That's most of the time that I've known you, but now we find that you're VP of Product and Partnerships at OpenAI. I'm kind of curious what that means and what you're doing day to day.
Peter:
Yeah, sure. What I do today is quite different from when I did research, for sure. For me, doing research has always been about solving the hardest problems that are out there, in order to actually have some sort of impact on the world. I'm kind of personally much more driven by the end goals of research rather than the research itself. It's really fun to do research, you know, go down and explore things research-wise, but it's always been with some goal at the end of it. One exciting thing that has happened with GPT-3...a lot of the things that I did when I started at OpenAI was like, I did things on the robotics side. With robotics, there's still some gap from the stuff you can do in the lab and what you can do in the real world. With GPT-3, when we got our first results in GPT-3, it was kind of clear that we had something that we could start applying to real-world problems rather than just do cool demos. When I worked in robotics, what we got at the end was a really cool demo of a robotic hand solving a Rubik's cube, but it's not like you start deploying this in everybody's home. Even if it worked robustly enough to do that, I don't know how useful it would be to solve a Rubik's cube. It's a very expensive way of doing that. But with GPT-3, we had a language model that you can now apply to solve all kinds of different problems. Everything from translations to summarization to things like classification and question answering and so on. It was a very flexible model. So, what we set out to do was to kind of start just seeing if this was good enough of a model to actually solve real-world problems. For me, that's just a really fun area to focus on. When you have this really powerful new technology that has the potential of just changing a lot of things in the way they work, it's all about finding the right problems to go after. And then seeing how you take the tools you have in your toolbox to solve those problems. The difference is that what I did as a researcher was very much kind of coming up with the right kind of benchmarks and the right ways to measure progress, where there was a goal that was really far out and you needed to come up with these toy ways of evaluating progress. And now it's like customers telling us like, "Hey, I'm trying to apply GPT-3 to this use case," and it doesn't work or it's too slow or something like that. Those problems are much more concrete. My day-to-day...right now, it's much more around building a team that can solve these real-world problems with the technology that we have developed at OpenAI.
Lukas:
When you look at GPT-3 versus the other approaches for large language models out there — that kind of seems to be a trend — are there key differences that you notice in how it works? Is the take different somehow?
Peter:
Yeah, it's a good question. I think that what I really like about GPT-3 — and the main way in my mind that it's different — is that it's just extremely simple. All that GPT-3 does... So, GPT-3 is a large language model, big neural network. It's using this Transformer architecture that Google introduced a couple of years ago that has been really popular. It's basically powering all different language models these days, and it's starting to make its way into other areas like computer vision as well. But the way GPT-3 is set up, it's very simple. It has some context, which basically means it has...you can look at a history of texts. Like, if you're reading a book, you can look at the page of texts or the paragraph of text. And then it's trying to predict the next word. That's the way that GPT-3 is trained. It's just trained on lots of texts from lots of different sources, mostly from the internet. It's just trained to kind of over and over again, based on some words it's seen, predict the next word. You can start with only a few words, but when we train these models today, we train them on the order of like a thousand or a few thousand words. You can look back at those thousand words and then try to predict the next word. So the setup is super, super simple and you just train it on these huge datasets of texts in order to keep on predicting the next word and get really, really good at that. I think the surprising thing with GPT-3 was that if you do that, and then you make the model really, really large — so it has a huge capacity of learning — then it gets really good at a bunch of tasks for which you previously needed specialized models. If you wanted to do a translation, you would need a specialized kind of translation neural network. Or if you wanted to do summarization, similarly you would set up your network in a particular way, and then train it on only summarization tasks. What we found with GPT-3 is that you actually get very close to state-of-the-art performance on a number of these benchmarks — that measure things like summarization, translation, question answering, and so on — with a model that has just been trained on the Internet to not do any of those tasks specifically, but by just being able to reproduce text in a similar way that it has read it.

Applying GPT-3 to translation tasks

Lukas:
Practically though, how do you apply it to a translation task? How do you take "predicting the next word" and make it do a translation?
Peter:
Yeah, that's a great question. In a lot of those other large language models, there are certain steps where you would take a piece of text and you would encode it. So you would create some representation in your neural network, and then you would have a decoder that would take that and then write some sentence. If you did translation, for example, you would encode that into some sort of representation, and then you would have a separate piece of your neural network that took that representation and tried to output what you wanted. The input might be like a sentence in German and output might be a sentence in English. And, you know, it's been trained specifically for that. For GPT-3, to your question then, what do you do with GPT-3? The simplest way you would do it is that you would provide a few examples of what translations might look like, in just pure text. You would write, "German:" and some sentence in German, and then "English:" and some sentence in English. You could provide only a single one, then the serve is called one-shot. You can provide a few examples of basically, "German: English:" examples, and then you would put in the new sentence that you would want to translate. That's called few-shot training, where you have a few examples and the model would, just by looking at the pattern of what it's now seeing in its context, it can predict...it can produce a translation. It's a very simple set up. Basically, the way I think about telling GPT what to do is a little bit like how you would actually tell a human to do the same thing. Like, if you're writing an email...if I'm writing an email to you, "Hey Lukas, I want you to translate some sentences," what I would do is like, I would just ask you, "Please translate these sentences." And I would maybe provide a few examples to give you a sense of the tone. Like, do I want a more formal translation, more casual translation, and so on. You would pick up on the pattern. Given then a sentence in German, you — I don't know if you know German — you will be able to translate it to English. It turns out now with our latest models, you don't actually even have to provide those examples. You can often just ask the models just as you would ask a human. Like, "Hey, translate this sentence to me," or "Summarize these piece of texts". We just found that that's how people wanted to use the models. We made them more work like that, but that's how simple it is. You just tell it what you want to do and it will do its best attempt at just doing it.
Lukas:
Did you make a concerted effort to train the model on multiple languages or was it mostly English? Where did the corpus come from?
Peter:
We actually did the opposite. Initially, when we trained GPT-3, we made a concerted effort not to train it on other languages than English. It turns out that even though these models are huge, there's a trade-off in your dataset mix. If you train it on English, but then lots of other languages, it would just not end up being as good at English tasks. And ultimately when we train this, we want to see, just generally, how good can it be at more general capabilities? We didn't care as much about translation. So whenever we put in extra languages, that would just be at the cost of being good at performing other tasks in English, like question answering, and summarization, and so on. But it turned out even by explicitly trying to filter out most other languages, probably a few small percentage points of the data turned out to be in other languages. And even with that, the model is just incredibly good at translation. It's close to state-of-the-art in a lot of translation tasks. I'm a native Swedish speaker, but I've lost my ability to write things in Swedish these days because I never do it. What I do these days is, I write it in English and I ask GPT-3 to come translate it to me. That's usually my point. It won't get it perfect, I need to fiddle with a few things, but it's surprisingly good. And the amount of Swedish training data in the model was really, really small. We've been constantly updating our models and making them better and better, so now we are introducing more and more language data, as we kind of figured out how to make these trade-offs in more optimized ways. But yeah, originally we actually wanted the opposite. We just wanted to be really good at English.
Lukas:
Is it predicting words or is it predicting one character at a time? How does that work?
Peter:
It's neither of those. It's actually predicting something called tokens, which is like...."part of words" is maybe the way to think about it. The most common English words, they are captured by a single token. A token, it's basically...in our current set up, we have about 50,000 of these tokens and we map them onto sequences of characters. It ends up being like...a common word like "hi" or "the" ends up being one token. But if you have a more uncommon word, like "encyclopedia" or something, you're probably going to break it up into two or three tokens. It's like word pieces that just make it easier and more efficient for these language models to consume texts. In principle, you can actually do it at the character level as well. It just gets very inefficient. But you know, that's where the field is probably moving. Eventually it's going to just do that at the character level.
Lukas:
But I would think that might make foreign languages really hard. Like, for example, would Asian languages be impossible then? If they have far more tokens. Or I guess maybe you could argue they've sort of done the tokenization for you by having a larger number of characters that encode a bigger chunk of meaning.
Peter:
Yeah, it is definitely the case that the way you train your tokenizer would have an impact on the performance of different languages. Usually those two things have trained in two different steps. You would train your tokenizer on some corpus of data, and then you would separately train your models with that tokenizer on some other datasets. And in order to get your model really good at different languages, you need to train the tokenizer as well over multiple languages. It's definitely...it's more expensive to use other languages because they end up...a German word just ends up being more tokens because we've trained on much less of it, while English is very efficient, where a lot of words are a single token. So it makes it both a little bit worse at other languages and more expensive.
Lukas:
Could I translate something into Japanese? Would that even be possible for GPT-3?
Peter:
Oh yeah. One comment I remember was a Japanese user of ours. They really liked to use GPT-3 to translate technical documentation between English and Japanese, because they found that GPT-3 was much better at this translation of technical documentation than Google Translate. This was like a year back, so it's possible that Google Translate is better now. But probably just a chance thing based on the datasets that we had. The really cool thing, actually, with the translation capabilities of GPT-3 is that we haven't really trained the model on explicit pairs of input and output, translated pieces of texts. Like what you usually call "aligned pieces of text". It's just seen a lot of Japanese. It's seen a lot of...well, not super much. It's seen a bunch of Japanese, but a whole ton of English. Somehow, through learning how to predict the next word, there's been enough little pieces of texts, blog posts, or whatever — where the author is switching between Japanese and English and maybe doing like some translation on some sentences — where it found the mapping and then somehow has a representation that's good enough then to generalize to arbitrary translation tasks. For me, that's just magical. That it's just by reading lots of English text, lots of Japanese text, and then maybe like accidentally finding a few kind of aligned pairs in all of the data, it's able to do that translation. That's pretty crazy to me.

Copywriting and other commercial GPT-3 applications

Lukas:
That is really amazing. Is this performance tangibly different than earlier versions of GPT? Like, was there something that happened in GPT-3, where OpenAI thought, "Okay, we can use this for real-world commercial applications"? Was it a performance level that it needed to get above?
Peter:
Yeah, definitely. I think the big difference between GPT-2 and GPT-3 was really...it was trained on more data and it was a bigger model. Like by two orders of magnitude. I think the original GPT-2 was about 1.5 billion parameters and GPT-3, the biggest model, was 175 billion parameters. It went up by two orders of magnitude, and since it was much bigger model, it also needed more data. The surprising thing is that that's what it took to go from feeling fairly kind of dumb to interact with...like GPT-2 was kind of cool, but it also felt kind of incredibly stupid most of the time. I think with GPT-3, you went to being sometimes just surprisingly good. Don't get me wrong, GPT-3 does a lot of silly mistakes still. But it does the right thing probably like 30-50% of the time on some tasks, and sometimes even better than that. It's sort of like suddenly...before you would need to sample and try out tasks and maybe once every 20 or something you would see like, "Oh, this looks pretty good." And with GPT-3, it started happening like every third time, or every second time, or every fifth time. And you're like, "Oh my God, this is actually..." For things like summarizing text...one example we have is summarizing a piece of text in the style of a second grader. It's just incredible how the model is able to simplify words, get the gist of a piece of text, and so on. Again, it's not perfect, but it's just really good. Obviously, we have...there's a lot of academic benchmarks. You can run these models and you can see it's just getting much better on those academic benchmarks. But it was a whole different feel to it when you wanted to prototype something. The difference is that now it's just easy to get something that works pretty well. That's sort of why we decided like, "Hey, now it seems useful." GPT-2 didn't seem really useful to the same extent, but GPT-3, for all these tasks we felt like, "Okay, it's close enough to state-of-the-art."If you have a specialized model or whatever, a clever programmer should be able to apply it to whatever tasks they have. That was what we set up to validate with the API.
Lukas:
What are some of the use cases that you feel really proud of, where it really works? Are there any that you could point us to, where we could go interact with it in a commercial setting somewhere?
Peter:
Yeah, sure. I think some of the areas where we were most surprised were copywriting and question answering. Generally, creative writing. For copywriting, what happened there was that there was a number of companies that started building on top of our platform. Some of these companies are like...Copysmith was one of the first ones; CopyAI; there was also Jarvis, I think recently they changed their name to a different name; and a number of other of these companies. What they did was really clever, because they realized that — as I said — when you're using GPT-3 to do some task, it's not perfect. Every now and then, you would get something that doesn't really make sense. But if you're doing copywriting tasks, like if you want to write some engaging product description based on some attributes of a product — like a shoe, maybe the type of sole, the color, some other attributes of the shoe — and you want to write something really engaging about that, then the problem that you as a human face is that you get into some kind of writer's block. Like, where do I even start? What these companies started doing is they took GPT-3, and they used it to generate a few starting points or a few variations of how you could write product descriptions. What you find is more often than not, if you generate like five of those examples, one of them would look really good and you can use that as your starting point. You maybe just take it as it is, or you make some small tweaks to it. It's a way to almost aid in human creativity, you know? I think that's just so cool. Writers would tell us like, "Hey, I've been trying to write this book for like half a year now. I just keep on getting stuck in writer's block. Then I started using a playground for GPT-3, and now it took me two weeks to turn out the whole book." When you get stuck, it can create an interesting storyline. As a creative writer, you start exploring that like, "Okay. I wouldn't have thought of this character going down in that direction, but let's explore that." And then it becomes a much more fun, engaging process. It's almost like as a human, now we have a brainstorming partner that you can apply to all these different tasks. I think what I found was really cool is to see a number of companies really leveraging that and creating new experiences that you couldn't do before. I think that one is really exciting. I think question answering is also really cool, but this one was quite unexpected. I don't think we would have predicted that one being such a big use case.

Fine-tuning GPT-3 with the OpenAI API

Lukas:
It seems like one of the advantages of GPT-3 is that it works right out of the box. But I could also imagine for some teams there might be a concern about what do you do if something goes wrong. I guess I'm curious. Do you typically work with ML teams inside of companies, or is it more engineers that view the benefit here as that they don't have to figure out how machine learning works to get the benefit of natural language processing? Or do you tend to integrate this with ML teams into a kind of bigger ML workflow?
Peter:
Yeah, that's a good question. It's a bit of a mix, I would say. We've had multiple machine learning teams who already had their own models that...they would have downloaded the models online, and so on, and they would have adapted them for the tasks. And then they find our API and start doing the same thing using our API, and it just turns out that you can get much better performance from our models. Like, just because there doesn't exist... there isn't an open source version of the biggest models that we have, or the best models. So for a lot of tasks, that's what works the best. But I think probably the majority of our customers are more in the other camp of just "really smart developers". When I say "developers", it's pretty broad a group. We see everything from programmers and engineers, to designers and PMs. A number of people have told us that the OpenAI API was sort of what got them into programming, because they got really good results from just our playground, where you can interact with our models. They got ideas, and they started to learn how to code, and got connected with no-code tools like Bubble IO and stuff like that. It's really lowered that barrier. You don't have to become a machine learning expert to get really good results out of these models. You just kind of have to be good at iterating and figuring out how to write the instructions to the model. It's a little bit like...everybody becomes a manager. You have to give really good instructions to your employee if you want them to do the task as you want it to be done. It's very similar with these models. Like, if you under-specify your tasks, you're going to get very high variance in the outputs. But if you get really good at specifying — even providing a few examples — then you get really good results. That's not a machine learning skill, that's almost more of a task specification, management skill. I feel like a lot of people can pick that up really quickly. I've been really excited about that, just seeing so many people get access to these models that just seemed like you had to have a PhD in machine learning to work with before.
Lukas:
I feel like I've heard of people talk about a new role called "Prompt Engineer" that might be related to this. Figuring out how to prompt GPT-3 to get it to do what you want it to do.
Peter:
This one is interesting because...early on when we had the first version of the API, we had a really smart guy who is a world-renowned author, but also a programmer; Andrew Mayne. He was one of the early users of the API and he got the internal name of "the prompt whisperer," or "GPT-3 whisperer". He really knew how to craft the prompts to get the best results. Since it's been trained on the internet, you kind of need to put your mind in like, "How would the text in the internet start?" If you wanted a really good recipe, you have to start writing in the tone of a recipe book or a food blog post or something like that. It's not like you could just ask the model to do what you wanted it to do. I think, initially, there was a big piece to that. You really had to be good at understanding the intricacies of GPT-3 and design really good prompts. Over the past one and a half years since we launched, we saw people struggling with this a lot, so we developed a new set of models. We call it InstructGPT, which actually just like last week became the default in our API. The reason we're calling it InstructGPT is because you just provide instructions. So I would say prompt design is a little bit less of a thing now. You could just tell the model what you want it to do and provide a few examples. There's still a little thing about...the formatting might impact how you provide your examples and so on. GPT-3 is super robust to that, but sometimes it does matter a little bit. Some tweaking matters. But I would say it's less of a thing now than it was a year ago. And my hope is that it becomes less and less of a thing, and it becomes much more interactive.
Lukas:
You've also launched the ability to fine-tune the models. What's the thinking there and where's that useful?
Peter:
The surprising thing with GPT-3 was that you got really good results zero-shot, where you only provided an example...no example, just the instructions of like, "Hey, translate this sentence from German to English." Or you provided few-shot examples, where you provide a few pairs of German and English. With just a few-shot examples, you could get surprisingly good results. But what that meant in practice is that...the accuracies are very task-dependent. For some tasks, maybe 30% of the time you got to an output that was kind of acceptable to put in a product. And then for other tasks that were more simple, you'll get it like maybe 70% of the time. When it's not good every time, you have to be very clever in the way you can expose it in your product. That's why, for example, it worked well for a lot of those copywriting companies. You could just provide a few examples and you kind of knew that at least one of them would be good, and that's all the user needs. But with fine-tuning, what you can do is basically...you can customize your model. You can provide it more examples of the inputs and outputs you want it to do. If you want to do translation, or if you want to summarize articles, you can provide a few hundred examples of articles that have done human-written summaries, and you can actually update GPT-3 to do much better at that task. You couldn't put all those examples in your prompt. The prompt has limited space. But with fine-tuning, you're working these examples into the connections of this neural network, into the weights of the neural network. In some way you have like an infinite prompt. You can provide as many examples you want. Obviously, the more examples, the longer it will take to fine-tune and the more costly it will be. But fine-tuning is basically that concept of taking a bunch of input and output examples, and kind of working them into the model, and getting a new version of the model out that's really good at that task for which you provided examples. It turns out with only a few hundred examples — or around a hundred examples — you can get significant boosts in accuracy. We have a number of customers that have used it. Like Keeper Tax, they're analyzing transactions to find these tax write-offs and stuff like that. What they're doing is they're extracting the relevant pieces of texts, they're classifying, and so on. They fine-tuned models and got much, much better results with fine-tuned models, for example. We've seen that over and over again with our customers. They can get really good results that can often be good enough for a prototype, but then in order to get it to high enough accuracy to put it in production — which is usually more than 90% or 95 or 99% — fine-tuning on some datasets that they have, or they put together, gets them all the way. That's enabled many more applications than you could do before. So we just made it very simple to do this kind of fine-tuning.

Logging GPT-3 fine-tuning projects to W&B

Lukas:
Cool. And you know, I have to ask you about the Weights & Biases integration. I mean, we're so excited about it. I don't know if people listening would know that you used Weights & Biases from the very early days and provided a ton of incredibly useful feedback that's in the product. But I was curious how you thought about how that integration might be useful for users of GPT-3.
Peter:
So, this is the background of my usage of Weights & Biases. I was one of the first users and it just improved my research workflow so much that I'm a big Weights & Biases spokesperson now. Basically what it does, right, is that it allows you to track your experiments in a really nice way. As you're training your models, you can get all the stats. Anybody who has trained machine learning models knows that you have to look at a bunch of curves as you're doing your training, to make sure that the models are learning in the way that you want. A lot of the work you do as a machine learning engineer is to do that sort of iteration on your models and see if you can improve your results. And a lot of that is looking at those learning graphs and on. It's really good because Weights & Biases provides you with this history of the experiments you've run. They let you compare experiments and let you track your progress and share it with your teammates and so on. What we did is basically make an integration, so that as you're fine-tuning your models — your GPT models — via our API, all your experiments, all your training runs show up in the Weights & Biases interface. You get that same convenience, but now for things that are training in our clusters. You can see as our fine-tuning process is happening — as the model is updating its weights based on each new iteration of going through the dataset — you can see your metrics, and so on, improve. You can also...we provide a number of different parameters, so it lets you iterate and try out different parameters and see your progress. It's just much more delightful to train your models that way, to have that place where you can go and look at your results in an ongoing way. That was a super exciting integration for us. It lets you keep track of all your fine-tunes in a much better way than...we have a command line interface, it's not at all as pretty as the Weights & Biases way of tracking things.
Lukas:
Boris, you actually said you did the integration. You said it was one line, is that right? I mean, my question for you is more how you thought about how it might be used, but I'm curious. Was it really a one-line integration?
Boris:
There's a few more in the code, but the way for the user is just to type a line, to type "openai wandb sync", and it can automatically sync all these runs to a dashboard. The idea was that there's a lot of people who use the API that are not ML engineers, so you don't want them to have to learn, "Okay. What am I supposed to log? How do I take care of a data set?" The OpenAI API, it was so convenient. When you want to train a model, you just pass a file that is your dataset and it cleans up the dataset, and then you pass a new command and it fine-tunes everything. It was a bit the idea of keeping the same simplicity. You will just type that one command, and then all the magic happens behind the scene. You have all your visuals and you can compare your models and see, "Is it worth giving more training samples? How much did my model improve from that? What is the effect of tweaking that little parameter here? What dataset did I have when I trained that model?" It's trying to make it as easy as possible for users to benefit from all the features when they don't necessarily know Weights & Biases initially.
Lukas:
I guess for both of you, what are the parameters that you can actually tweak? Because the way you've described it, it sounds to me like there might not be any parameters. How do parameters get involved here?
Peter:
Before I answer that question, one thing that Boris said that really stands out to me...why I really liked this integration generally was that there there is this concept of just making these advanced things very simple. I still remember when Lukas, you, Shawn, and Chris did the first Weights & Biases demo. It was basically just like "import wandb" to just start logging an experiment. I think that philosophy of just making it super simple to get going is something we have tried to also do in our API. You "import openai" and then like a single API call or Python or JavaScript gets you to use GPT-3 and start creating completions and stuff. I really liked that simplicity, and that's what we tried to do within this integration. But, to your question about the kind of parameters, we tried to make this quite simple in our API. We tried to make the defaults very, very good. Generally, you can get really good results with fine-tuning without fiddling much with our parameters at all, but some make more of a difference. You can set, for example, the learning rate. That's how much you're updating the weights with each learning step. You can set things like how many passes you want to go through the data. It turns out if you go through the data too many times, then you're going to overfit on your data set. These GPT-3 models being really big, you often only need on the order of two to five iterations through your data to get really good results. If you go further than that, you sometimes overfit. There are more advanced parameters as well, but I kind of feel like playing a bit with the number of epochs you want to train it for and their learning rate, that gets you 90% of the way there. If you start fiddling with other parameters, it's not going to give you that much more.
Lukas:
Was part of the thinking of leaving the parameters in to just give the person...you can get the joy of messing with parameters?
Peter:
Honestly, I would love it if it was completely automatic. That said, we do have a number of more research-oriented customers who really do like the fiddling. So I think it would be hard for us to remove it. But, as I said, we have these kind of two camps of users. The researchers and the developers. Developers keep telling us like, "Hey, I just want one button. I just want the best model to come out." And then a lot of the researchers want to fiddle more with the parameters. I think we can probably satisfy both for a long time.
Lukas:
Boris, I don't know which category you put yourself in. You make some amazing, beautiful demos, and you also love to tweak parameters. I'm curious your experience playing with the GPT-3 model.
Boris:
I definitely like having a good default, because initially you don't really know what you should change on it. Let's say you would choose the wrong parameter and nothing works. It wouldn't be a nice experience. So I like that if you don't choose anything, it's already going to be pretty good. Then, I really like to tweak the parameters to see, "Okay, what would be the effect?" and try to play with intuition. In addition to the parameters that Peter mentioned, there's two that interest me a lot too. You can decide which model you fine-tune. There's models of different sizes. If you use a larger model, maybe your API is going to be a bit slower but your [?] will be better. Maybe sometimes you don't need it, maybe sometimes indeed. So I like to see the effect of which model I use. I like to also see the effect of "How many training samples can I give?". Like if I give only 20 samples, versus giving 100 or 200. Because then it gives you an idea on how much my model is going to be better as I develop a larger data set. There's all kinds of parameters I like to play with and see what are the predictions based on these.
Peter:
Yeah, that last one, it's actually super important. I think it's one of the most common advice we give people over and over again. It's like, start with a small set of examples, then double it and see how much of improvement you get. You usually...if you double your amount of training data, then you get to see some linear improvement in your error rates. So if you have 10% error rate or something, and you double your training data, you're going to get down to maybe 8% error rate. And then you double it again, you get down to 6% error rate, and so on. If you can start seeing that trend, then you can suddenly get a sense of, "How much would it actually cost me — in terms of labeling more data and so on — to get the result that I want?" and so on. It's a very powerful thing to do.
Lukas:
Are the results of training these models reproducible? How much variability is there each time you fine-tune it? Would you get the same model if you fine-tuned on the same data two different times?
Peter:
In principle, you can set it up to be quite reproducible. If you basically train it on the same date..basically what you want to do when you train, is on each train iteration you have a batch of data, like a number of examples. You can actually...the API can set the batch size, how many examples per update you want. I think it defaults to 32 or something like that. When you do that, you also want to shuffle the data. You want to take a random sample of your training data. As long as you keep those randomizations consistent between your training run, you're essentially gonna get the same model at the end of it. It's going to be fairly reproducible. The only caveat is that, in practice — this is true, even for inference. We have a parameter called temperature where you can set the variability in the output. Higher temperature, the more variability — even if you put a zero there's no real guarantee that you're going to get completely deterministic output. There's enough noise and a little weirdness with floating point arithmetic and so on in these GPUs with these really big models, that it's very hard to guarantee complete determinism. We get people asking about that a lot, and the answer is always like, "Well, unfortunately we cannot provide that, but you can get something that's fairly [?]." But you should just make your experiment robust enough that you don't really care too much about the determinism.

Engineering challenges behind OpenAI's API

Lukas:
I would think, operationally, having everyone have their own fine-tuned model would be much more of an infrastructure challenge than everybody using the API that hits the same model. Has that been a big undertaking to allow that to happen? Like, do you have to swap in and out of the different models as people start to use them?
Peter:
Yeah, no, for sure. When we started out, the way we did fine-tuning was basically...in some way, you almost rented a set of GPUs where the models ran on. For some of the absolutely earliest fine-tuning customers, we essentially charged them by GPU hour, to some extent. Like per hour, how much they were using the models. Even from the very beginning — I think like within six months after launching the API, we had a few select customers that had fine-tuned models and stuff like that — that's sort of the way it worked. The problem with that is, if you're trying something new, GPU hours are expensive. You don't want to really pay to reserve a GPU for like even a fraction of an hour. It just adds up really, really quickly. We just set a goal of saying, "Well, as soon as you have fine-tuned your model, you should immediately be able to just use that model, and you should just have to pay for basically the tokens that go into it at inference time." Like, whatever you put in your prompt. That was definitely a huge engineering challenge to make that experience really great. You just kick off your fine-tune, and when it's done get a fine-tuned model name out. Now you can use that model in the API to just get a result immediately. And you're not going to be charged by hour or whatever, you're just going to be charged the same way you're going to be charged for the API. That was really tricky. We have an amazing engineering team at OpenAI that really figured out a lot of tricks around balancing where these models end up, and cacheing them in the right way, and so on, to create a great experience around that.
Boris:
I'm curious if you fine-tune the entire model or you fine-tune just part of it to make it more efficient.
Peter:
There's just lots of tricks that we're using to make this happen. We're constantly trying to figure out new ways of doing it. There are challenges if you want to fine-tune a whole 75 billion parameter model. It can get really expensive and hard and so on, and there are tricks you can do to make it much faster.
Lukas:
Do you feel like the thing between you and everyone using GPT-3 for natural language tasks is more quality and performance of the model itself? Or is it something else? Is it something about integration, or monitoring in production, or something like that?
Peter:
Definitely the key things we focused on when we built the API was...what matters the most is really the capability of the models. Then number two is like, you need to have fast inference. Before we created our API, for language models nobody cared about the inference. Everybody cared just how quickly can you train them, because that's what It mattered, you know? So you can get your benchmarks resolved at the end of the day. We did just a ton of engineering to make inference super, super fast. I can remember over the course of the first few months of us getting the first prototype of the API to a customer starting to use it, we increased the inference speed like 200-fold or something like that. Lots of effort was done to make that super fast. The third thing is things around safety-oriented things. One of the reasons we invested in these InstructGPT models is that we saw that sometimes you can get surprising outputs of models, that you don't expect. For example, you might write a very innocent sentence and it might turn very dark for some reason, or you might get some biased outputs in different ways. With our instruct-oriented models, by default they behave in a much more expected way, but you can also specify the behavior in a much better way. It turns out when safety and capability come hand-in-hand...it just becomes a better product when you can control it better. Those are definitely the things we have focused on, and I think we're doing much better on than alternatives that are out there. But there's also...the third thing that we have put a lot of focus on is just making it really simple to use. The fact that you don't have to load up models, that you can just call a fine-tuned model that is just a single line of Python to call the API, that's also been really central to us. We want this to be easy to use by everyone.

Outro

Lukas:
Awesome. Well, thank you very much. It's really nice to talk to you and congratulations on making such a successful product. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work were really hard to produce. So check it out.

Iterate on AI agents and models faster. Try Weights & Biases today.