Emad Mostaque — Stable Diffusion, Stability AI, and What’s Next

Emad shares the story and mission behind Stability AI, a startup and network of decentralized developer communities building open AI tools.
Angelica Pan, Riley Fields, Anish Shah, Justin Tenuto
Created on November 10|Last edited on November 19
Comment
﻿
﻿
About this episodeEmad Mostaque is CEO and co-founder of Stability AI, a startup and network of decentralized developer communities building open AI tools. Stability AI is the company behind Stable Diffusion, the well-known, open source, text-to-image generation model.
Emad shares the story and mission behind Stability AI (unlocking humanity's potential with open AI technology), and explains how Stability's role as a community catalyst and compute provider might evolve as the company grows. Then, Emad and Lukas discuss what the future might hold in store: big models vs "optimal" models, better datasets, and more decentralization.
🎶 Special note 🎶 
This week’s theme music was composed by Weights & Biases’ own Justin Tenuto with help from Harmonai’s Dance Diffusion. Learn more: 
﻿Making an ML Theme Song for Our ML Podcast﻿
﻿A Gentle Introduction to Dance Diffusion﻿﻿﻿
Connect with Emad and Stability AI﻿Emad on Twitter﻿
﻿Emad on LinkedIn﻿
﻿Stability AI on Twitter﻿
Links﻿lucidrains (Phil Wang) on GitHub﻿
Listen﻿
﻿Apple Podcasts﻿﻿    Spotify﻿     Google Podcasts    YouTube ﻿﻿
Timestamps0:00 Intro﻿
0:45 The story and goals behind Stability AI﻿
10:27 What it's like to get funding from Stability AI﻿
13:44 Stability AI's short- and long-term business models﻿
20:51 How Stability AI thinks about research and product﻿
24:30 Moving from larger models to better datasets﻿
33:32 Emad's thoughts on time series and Transformers﻿
38:13 Why Stability AI is focusing on media applications first﻿
40:56 The ethics and regulation of open source, large models﻿
50:05 Community structures and governance﻿
56:25 Using AI to improve and democratize education﻿
1:03:24 Emad's thoughts on autism and machine learning﻿
1:07:18 The importance of data and latency﻿
1:10:03 Outro﻿
TranscriptNote: Transcriptions are provided by a third-party service, and may contain some inaccuracies. Please submit any corrections to angelica@wandb.com. Thank you!
IntroEmad:
We have to decide what should be open and a public good. This is not from a business perspective, but from a societal perspective, is what should be closed? Should the tools to allow anyone to be creative, anyone to be educated, and other things like that be run by private companies? Probably not. 
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald.
Emad Mostaque is the CEO and cofounder of Stability AI, which is one of the most exciting companies in the AI space right now. Before that, he was a hedge fund manager, and before that he was an engineer and analyst.
This is a super fun interview, I hope you enjoy it.
The story and goals behind Stability AILukas:
Alright, do you mind if I just go “rapid-fire” questions?
Emad:
Yeah, sure. Go for it.
Lukas:
All right. 
Emad:
Good to see you, Lukas.
Lukas:
Good to see you, Emad. 
Well, I think we need to start with defining, in your words, Stability. I think everyone probably has heard of it, but everyone seems to have a slightly different impression of exactly what the company is and what it does. So let’s hear from the source directly.
Emad:
Yeah, so our official mission at Stability is to build the foundation to activate humanity’s potential with a motto of, “Let’s make people happier.”
Stability was basically set up in the belief that these new models that we have — these transformer-based models and similar — are essential for basically unlocking people’s potential in some of the most powerful tech that we’ve seen, and the belief that having them open source so people could build on them and use them was not only a great business model but essential for closing the digital divide and getting this out as widely as possible. 
So, we basically catalyzed the building of open source AI models, and then we take those models and we scale and customize them for customers. And that’s what we do.
Lukas:
How did you get started with this? Your background isn’t originally in AI — or is it?
Emad:
So, actually, I started my career in math and computer science at Oxford. I was an enterprise developer in my gap year. Then I did hedge fund managing for many years. So I was a huge AI and video game investor. But then I took a break when my son was diagnosed with autism, and I used AI to do drug discovery. So biomolecular pathway analysis of neurotransmitters and literature review to repurpose drugs to help ameliorate some of the symptoms while advising a bunch of hedge funds and others on governments on AI and tech and geopolitics, et cetera.
Going through that experience — that was about 12 years ago that I started that — it was super interesting. And then we saw that a lot of the technologies were evolving, but not until the last few years has it retaken off, obviously. So, I went back to running a hedge fund after that, and it was fine.
And then a couple of years ago I was one of the lead architects of CAIAC, which was Collective and Augmented Intelligence Against COVID-19, which launched in Stanford in July of 2020 to take the world’s COVID knowledge and then use AI to compress it down and make it useful. That’s when I first really got exposed to, again, some of these new types of models.
I was like, “Holy crap, this is huge. And they’re getting good enough, fast enough and soon cheap enough to go everywhere.” And, “Does it make sense that all this tech that’s so amazingly powerful is going to be controlled by big companies and they believe their edge is that?” Not really. Let’s go away from that. 
So, I’ve got some AI experience and others, but mostly what I do is see big pictures and big patterns and then put them together. A bit of mechanism design, as it were.
Lukas:
That’s cool. You’ve had such a meteoric ascension in the collective consciousness. I’m curious, has it happened exactly how you drew it up or has it been surprising? Like, when you started the company, what were you thinking? Because it wasn’t even that long ago, right? And then how has it unfolded differently than what you’d expected?
Emad:
Yeah, so we had the idea of Stability three years ago. The first thing my cofounder and I did was we took the Global XPRIZE for Learning, which is a $15 million prize, for the first app that could teach literacy and numeracy without internet. That was bought by Elon Musk and Tony Robbins. We were deploying tablets into refugee camps, saying, “What happens if we use AI to make this better and more powerful?” We didn’t use AI yet, but we just finished our RCTs showing literacy and numeracy in 13 months of education on one hour a day of being taught for refugees and camps. 
Lukas:
Wow.
Emad:
There’ll be some big announcements about the “AI-ification” of that next year. But that was, like, a fuzzy one. 
Then, we set up Stability properly two years ago to do the United Nations-backed AI work on COVID-19 and fell into a lot of bureaucracy and other things. We really kicked off properly literally a year ago. I think nobody at that time would have expected that it would have gone like this.
Like, originally, we helped support the communities at Eleuther and LAION and others. And the thinking was, like, this is a web3 Dow of Dows. Like, “Let’s reward all the community members and get them together.” But then, after a month or so, we realized that commercial open source software of scale and service was the way. 
And, while I was funding the entire open source art space, I thought it would be at least until next year that we’ve got anywhere near the quality that we’ve seen now. So I think there’s that pace of compression of knowledge and the ease of use and being able to get some people’s devices. But that surprised me, because I thought it would be another couple of years, at least, before we got there.
But, I think that’s been the main catalyst, right — Stable Diffusion being the first model that is good enough, fast enough, and cheap enough that anyone can run. Like, it’s a two-gigabyte file from a hundred thousand gigabytes of data. That was the insane thing that I think has allowed it to go off massively.
Lukas:
Is it an accident that the name Stable Diffusion and Stability AI are connected like that?
Emad:
Well, so, this is an interesting thing. What’s the actual role of Stability? We’ve got over a hundred people. We’ve got some amazing researchers. But our role is a catalyst in the community, right?
So with Stable Diffusion, it built on the work of the CompViz lab at former University of Heidelberg, now LMU Munich, under Björn Ommer. And so the two lead authors of Stable Diffusion were Patrick Esser at Runaway ML and then Robin Rombach who works with us.
They came up with the name all themselves, because — we provide computer infrastructure support, and then obviously [??] themselves there — but we always try to give developers lots of flexibility, especially when working in these collaborations. It does get complicated there. We can discuss that a bit later. And they came up with it, and I was like, “Yeah, I love that name. Go for it.” 
But at the same time, there is this inherent tension, because a lot of people want us to manage the whole community, but that’s not how open source works, right?
The whole thing about open source is that there’s lots of different things, even if you’ve got, like, Linux or Red Hat or something like that. And for models, it’s also a bit different. Because with normal open source software, you have loads and loads of contributors. Like hundreds. Thousands. You don’t really have that for models.
You can do the whole thing just with a team of two to ten people. Or, if you’re like lucidrains, you do that all by yourself. He’s one of the developers that we support. He just cranks out models every day. If you’re a programmer that wants to feel bad, go and look at github/lucidrains for productivity.
Lukas:
All right, I’ll put a link in, but I don’t want to look at it right now. I’ve been very unproductive over the last few years.
Emad:
Yeah, it’ll make you feel terrible. Like, “Ah, jeez.”
Lukas:
I was curious about exactly, like, how that interaction works today with people building models. What’s your way of working with folks?
Emad:
So yeah, I think that the best way is always collaboration. We have our supercomputer cluster here. It was four thousand A100s originally. Now it’s going much, much larger, because I view that as a key unlock, and then the infrastructure support to make stuff usable there.
We had the communities that were spinning out into independent foundations like Eleuther and others where we provide employment and benefits and equity, et cetera. And then collaborations with academia and non-academia independent researchers. 
I think the goal for the open source side of things is to put a lot more structure around that. So everyone knows when stuff is meant to be released, what happens if you’ve got ethical concerns, and things like that. But again, really be a catalyst for the community. 
Some of the models you’ll see released over the next period are entirely Stability models. Some of them are combination models, but we want to make sure that these things are clearly defined because otherwise people get sad. And it’s understandable, as well — attribution should be given.
One of the unique things that we have brought in, though, is that we’re building an entire infrastructure to be able to scale and train these models. And if we do inference on any open source model, we actually put aside 10% of the revenue from that for the developers.
So 5% goes into a community pool that will be activating in a month or two, where every developer affiliated with Stability can vote to allocate to the coolest research they can find. And half of it goes to the developers themselves, even if they don’t work at Stability.
So again, we’re really trying to give back a bit to the community and recognize the authors’ things — and they can donate it or whatever — from that angle, and trying to make it so it’s clear how we interact with these.
Because we are the fastest providers of compute, support, technical support and input of anyone in the market. You could access super compute before, but it was only really through these giant clusters with, like, 6 to 12-month processes for application, like, from JUWELS — which is pretty good — to Summit — which is much more bureaucratic — and some of the others. And that obviously doesn’t keep pace with the pace of AI development now, which is literally exponential. 
This is why…what happened is that a lot of academics basically had to leave, to either their own start-ups — which, as you and I both know as CEOs, is incredibly difficult — or join a big tech company, which isn’t so much of an option anymore given the freeze that’s going on. And that was it. And then, that doesn’t fit with academia.
So academia is one area that we’re supporting in general. And again, I think compute is the key unlock there. 
But over time, it’s going to be increasing the infrastructure side of things and having standardized stuff. Like, right now, not everyone uses excellent tools like Weights & Biases, for example, to track their runs. We would like to move to more and more open runs so you can actually see how they’re doing, like BLOOM did with their updates, et cetera. So there’s a lot of work to go, but we’re trying to be as collaborative as possible.
What it's like to get funding from Stability AILukas:
Say I’m a researcher and I have an interesting area of work, and I’m looking for infrastructure support.
How do I apply to Stability, and how would you view my application? Like, what would you consider? How would you decide whether or not to fund it and how much to fund it?
Emad:
So the way that we do it at the moment is that if you’re an active member of any of the communities — from HarmonAI for music, to Eleuther for language models, to LAION for images — you’re most likely to get compute that way. 
And that can be from an A100 up to 500 A100s, depending on how good your thing is, particularly if you bring in members of that community as your team. That’s the primary way.
Right now, we’re setting up a grant-making portal, and we’re working with certain universities in that regard, but then also trying to figure out how we do, like, large clouds of almost “Google Colab on steroids” to allow people to unlock things from day one. 
This fits in, as well, with the next stage of our program, which is that we funded a handful of PhDs so far who’ve been active members of the community. We’re planning to fund 100 in the next year. And they will come with dedicated compute support for their labs and their projects, as well. And there’s an independent board being set out for deciding that because, again, one of the tensions is always going to be our business side versus the broader side. 
Like, why are we funding OpenBioML? Because it’s useful. There’s no business logic to it at the moment. But we want to keep that mix of supporting the entire ecosystem so we have a nice place in it and then focusing on some of the business stuff, which is generative media at the moment. 
So I’d say for the moment, generative media, if there’s anything interesting you can just reach out on the communities, and we fund most things in there. The other stuff, we’re building up the infrastructure, but just join those communities. Join the OpenBioML and other communities and contribute. 
That’s the best interview of all, right? You’re more likely to help people who help your communities.
Lukas:
And then, like, what’s required of me? Say I’m someone with a new idea for generating, like, awesome music. Does that mean that I need to contribute my model to the community after it’s done training?
Emad:
No. We encourage open source, but a large part of it is open access, as well. Like, we have incubator arm coming, a VC arm, and others for those who don’t want to go open source, but we heavily encourage open source.
I think not everything needs to be open source. What needs to be open source is the benchmark models. It’s like, “leave nobody behind,” but the reality is that open source will always lag closed source. 
Midjourney just released version four, which is amazing, right? And DALL-E 3 will come out soon, which will be even more amazing. Why? Because they can take open source basis and go ahead, or they can just do something different. 
So, Midjourney version four was completely different, but Midjourney version three with Stable Diffusion was a mixture of the two. So you will always get this iterating, where open source will lag behind. We’re just trying to make it so the lag is minimal and people start on that same basis.
But for people who come and use our cluster, the priority for the first cluster is open source, but we’re going to have more clusters where they will also be for the companies that we’re incubating, our own use, and other things like that. Yeah.
Stability AI's short- and long-term business modelsLukas:
How do you think about the sort of broad buckets? It sounds like you do it by use case.
It seems like you’re good at recognizing larger scale patterns. Do you have an opinion between the value of investing in infrastructure for audio generation, image generation, these large language models? How do you even approach that question of allocation?
Emad:
Right now, I would say, from a business perspective, media is by far the most lucrative, and that can fund a lot of other stuff.
So Google and Meta have amazing research labs that they fund through advertising. That’s basically… we all hate advertising. Advertising is manipulative, and particularly with these new models has become even more manipulative. The area that we’ve focused on is the world’s content. So audio, video and others, those will all be in foundation models in the next five to 10 years, and we're focusing on that to fund everything else.
I think that’s a reasonable model because the Disneys and Paramounts of the world will eventually have to transform their entire archives. Like the VHS to DVD uplift on steroids, because you know how difficult doing these models is. 
So that’s our core focus from a business perspective. From an impact perspective, it’s not more difficult.
This is also why, like, one of the things we’ve done now is, again, within a year, we’ve built this giant cluster. So 4,000 A100s isn’t the largest private cluster. But on the public top 500 list, it’s in the top 10 probably. Like, JUWELS Booster with 3,744 is number 11. The fastest supercomputer in the UK, Cambridge-1, is 640. The same with Narval in Canada, for example, and NASA’s got about the same. So this is a chunky old beast.
The reality is that should be a public good eventually, and there is a national research cloud discussion led by Stanford and a bunch of others that say this is needed for US universities. I think it’s needed for international universities. And so hopefully we can figure out a way to transfer over there with this value function that you’re discussing because otherwise it turns into fiefdoms.
Right now it’s quite a centralized thing, where we’re just like, “What can be most beneficial for the community and attracting assets to the community?” And this was media for us.
We’re still doing the LM training, but large language models, I think, are less impactful, because language was already 80% there and we’ve gone to 90% there. Whereas a lot of this image stuff was like 10% there and suddenly we’ve gone to 80 and now 90% there, and so it’s a lot more immediate for people.
This brings us to the final bit, which is that the nature of these models — and the data that they run on — is that they can do just about anything. So if you have them converging in terms of quality from different players and then an open source version, where’s the value?
The value can’t be in models if they can do anything, right? The value has to be elsewhere. And so that’s going to be very interesting to see, especially from, like, say, the societal value versus business value.
Lukas:
But what’s interesting is, as far as I could tell, the main thing that you’re doing — the thing that you’re really passionate about — is democratizing access to creating and opening up these models. So if the value isn’t there in your mind, how do you think about creating a long-term sustainable business?
Emad:
Basically, the value is in going into Hello Kitty as a business and transforming all their assets into interactive ones — it can be for the metaverse, it can be for new experiences, it can be for wherever — and then building tools to enable them to access their models and other people to access their models, piping it around the world.
Our main play as a business is basically content and helping big companies with that and then helping everyone else through the software that we built. Like, DreamStudio Lite is just a very basic piece of software. DreamStudio Pro — that’s going to be released in late November — is a fully functional animation suite with storyboarding, and fine-tuning capabilities, and ability to create your own models, and other things like that.
So again, being in that infrastructure layer and allowing the infrastructure to be usable is we’re at. Plus, of course, our APIs, which are industrial scale. We’re negotiating the cost down and down and down because the data in how models are used, as many people on this call will know, is as useful as the models themselves.
Because then you can instruct them, and you can guide them down. The Carper team led by Louis has done an exceptional job in releasing the first open source instruct model framework. And now we’re training new models to be able to instruct them across modalities, as well, based on some of this data. 
So, I think that’s where the sustainable edge is: a mixture of content and mixture of experience. And the content, to give you an example: we have a deal with Eros in Bollywood in India, which is the Netflix of India, 200 million daily active users.
All the Bollywood assets are going to be converted by us. And then all the music will pretty much sound the same, but it’s like… that data will eventually be converted, we’re just doing it five years before anyone else otherwise would have.
Lukas:
Sorry, when you say “converted” — converted into what?
Emad:
So, you take all the Bollywood music, and then you have a text conditioned audio model that can generate any Bollywood music. And that doesn’t need to be open source as a business thing. But then we can use the open source Dance Diffusion models and the new text condition ones we’re working on to be the framework for that. 
It’s like, you go and you do a MySQL database with someone, and they load their data into it, right? And they’re like, “Okay, well, I’m paying you to implement this” because that’s MySQL’s model, or PostgreSQL’s model or any of these other open source database providers or service providers. And that commercial software model is very well established.
There’s an extra wrinkle in this, in that they load their data into a model that then converts it into a couple of gigabytes that they can then use for their internal processes and then external things.
The extra wrinkle is that it’s hard. It’s hard to train these models. Even to fine tune these models isn’t that easy. We’ll make it easier. And the pace of model development means that they have to retrain every so often, as well.
Until, I think, in image, you get to a steady state in two years. In video, probably three to four years. Audio is probably about two years, as well, for having a standard model in the space.
Lukas:
But what are you doing for the Bollywood application today? What’s the conversion that’s happening?
Emad:
Oh, it’s just like interest, right? Like, Bollywood is just… well, we can’t discuss it because we haven’t announced it, but basically, it’s more TikTok-type stuff and Snapchat-type stuff. And the things that you’ve seen with the use of Stable Diffusion right now is image-based, quite static — but it’s inevitable that entire audio and movies will be created using this technology. Not zero-shot, but in a pipeline of different things. 
This is what we’ve seen with, like… you take EbSynth, Koe and Stable Diffusion, and you can map a monster onto your face with the full thing. That’s the type of thing that we’re thinking around this.
Most of the Bollywood stuff is going to be used internally now to save costs on production. And then over the next few years, you will see it go from cost to product savings to new revenue streams as people have new and interactive experiences across modalities. 
How Stability AI thinks about research and productLukas:
So when you think about your own internal allocation of resources — humans, right — you have about a hundred people, you said? Is that right?
Emad:
Yeah.
Lukas:
How do you break down, like, who works on the foundation models versus who works on the commercialization? Or is that even the right way to think about what you’re doing?
Emad:
We split it into two, basically, whereby the researchers — who are open source researchers — actually have in their contracts they can open source anything they create unless we specifically agree otherwise. And they’re given a lot of independence and a lot of free reign to make mistakes.
So we can say we went overboard on compute, but that’s what allowed us to experiment with different things. And we’ll continue to ramp that up because a lot of researchers are constrained by compute and other resources. So, it’s like one training run and they’re done, or they’ve only got like a 50% buffer or something like that.
We thought that was the wrong way to have breakthroughs. 
Separate from that is the product and deployment teams — like, the customer solutions teams — because we don’t want product to influence research too much.
Like, people are aligned in that they want to create a great business so it could be self-sustaining. But when you have product influencing research, you get bad outcomes. So the product team does its own thing. They work closely with the research team.
And they have discussions to influence at a high level, but there’s no forcing function. You know? So it’s not like you have to have a model ready by this deadline in order for this product release. Because if you do that, you will never have proper research.
So, that’s one of the ways that we split it out, and there’s infrastructure that supports all of them.
Lukas:
Do you worry about someone else coming along and taking your open source models and then building their own rival applications to yours?
Emad:
I really hope that other people release more open source models! That means that I don’t have to, right? 
Lukas:
True!
Emad:
Because again, our role is to help grow these communities, and it’s to provide the support for people doing that. So, if someone wants to come along and create their own model, we can provide compute for them. 
There’s a lot of different entities that we’re providing compute for — who people would see as competitors — because I think this whole market just grows massively. Like, with Midjourney as an example on the art side, I gave a grant for the first A100s for the beta.
And I said, when Stable Diffusion launched, they would be better than we are. You know? And it’s fantastic they are. Other people have had issues with quotas and other things. I’ve stepped in to try and help them, even though they might be viewed as competitors on the API. So again, I think the whole market will just grow massively. 
The key potential displacement point for us is basically another company coming and doing exactly what we do and supporting the community in this very strange way, and being decentralized and having this division.
But then it’s like, why wouldn’t you just stick with us? I think our replacement cost is quite high. And the role of our company will change in the coming years.
So now we’re a catalyst to make sure and force people to go open, as a forcing function. In a few years time, there’ll be more of a services company that is building Indian-level models for the Indians and Filipinos and other things, and for the largest content providers.
And then I hope, over time, we move into being an AI platform — just making AI easy and accessible for everyone. Because all the models will be pushed to the edge. I think they’ll get smaller and smaller and smaller, and you’re seeing custom silicon in like an iPhone and all these other architectures, whereby a lot of these models will just be a few hundred megabytes big. And you’ve got your own model, and I’ve got my own model, and we’re interacting with big models in the cloud. 
I think that’s a really interesting flip of the internet. And that’s where we’re aiming for. I don’t think anyone I’ve ever seen is really doing the same. And even if they are, they might as well join us. We’re cool. We’re fun.
Moving from larger models to better datasetsLukas:
Overall, it sounds like you think models in the cloud will get bigger and bigger, but there’ll be smaller versions of them for ease of deployment and cost. Is that a fair statement?
Emad:
No. I think that if you look at the Chinchilla scaling paper, it basically sounds like more epochs of training, it actually means better data when you dis-aggregate it. I think data quality will become essential.
I think the models will become relatively small. But then on the edge, they become even smaller. So it’ll be a hybridized experience.
Like, when you use the Neural Filters in Photoshop, there’s a point of processing in the cloud and then it remains this render processing on your computer, right? Kind of this hybridized experience — or Microsoft Flight Simulator — will become quite commonplace for the running efficiently of these models.
But I don’t think that models will continue to scale. Like, we’ll see a trillian parameter model or something like that. But instead, I think the most MOE approach — where you have multiple models that are good at various things — will be key to this.
Like, right now, on the Stable Diffusion example, you’re seeing people using DreamBooth to create a GTA model, an Elden Ring model or something like that. That’s an optimal way, rather than having potentially one model that can do everything.
But we’re not quite sure. And DreamBooth maybe isn’t the best way to do it. Maybe it’s hypernetworks or something else. But I think different models for different things in your own personal model — like, a million models — is the better way than one model that can do everything, even though that’s very attractive because it’s like, yeah, let’s just chuck it in. And we’ve seen these development of skills and you’ve scaled up.
So, I think scale is everything. Now, I think data quality will be everything and model usage for instruct models will be everything. And the value is going to shift there, even as compute becomes plentiful to allow for ridiculous scale.
Lukas:
So you predict a reverse of the current trends of people building bigger and bigger models. You actually think they’re going to start to get smaller.
Emad:
Well, I mean, like InstructGPT is, at 1.3 billion parameters, as performant as GPT-3, right? Similarly if you look at FLAN-T5 and some of these other models that Google have released recently, the most performant models out there. 
Because, like, these are big neurons. You don’t need all of that stuff. Similarly, with the compute scarcity, relatively speaking, we just chucked a lot of random data into these things.
But if you think of these models, like, a bit like the human brain, what’s better: just a diet of, like, every piece of media out there or just the media that you need? Yeah? And what does that look like for these models? We don’t know yet. 
Also, it’s moving so quickly that we haven’t been able to keep up. Like, a year ago, if I told you the image models would be like they are now, you’d be like, “No way.” Like, even I can’t believe it, right?
And so this opens up a big question. Like, why is an image model… why is Stable Diffusion two gigabytes and 890 million parameters, whereas you’ve got 175 billion parameters of GPT-3? You know?
What’s the amount of information they can convey? Does it make sense that text is so much bigger than image?
Lukas:
I don’t know. I mean, it seems plausible that it’s bigger than image. I mean, my understanding was that these models — at least the language models — generally get better on a broad set of benchmarks as the model size grows. But I mean, certainly other things matter.
Emad:
No, they do. And again, this has been shown. But then, like I said, the Chinchilla paper showed that they also get better as you train them more, for similar parameters. So, a 67-billion-parameter, five-times-trained model can outperform a 180-billion-parameter model effectively.
But then you see other things. Like, with image models it’s the same. Google has a different type of model called Parti, whereby they scaled it to 20 billion parameters, and it learned language and things like that on the way. But, like I said, Stable Diffusion being this performant at just a couple of gigabytes and 890 million parameters makes you question, “What happens if we start optimizing the data?” 
Because we just chucked in an unfiltered dataset, relatively speaking — some of the bad stuff removed — just two billion images into that.
What’s the minimum number of images to have Stable Diffusion quality output? Is it 12 million? The model that [??] released in December last year, CC12M — that was used for the original version of Midjourney and a lot of stuff — was only 12 million images.
How many images do you need? How much text do you need? And then what effect does that have on the size of the models?
I think it’s all scaling laws anymore. Even as, like I said, the compute becomes available now to scale infinitely. Like, some of the clusters I see being built are insane.
Lukas:
It’s sort of surprising — it’s interesting — your insight, maybe, as you put it earlier, was that people really needed this massive compute to make it broadly available.
But then it’s an interesting contrast to your current prediction that the models will become smaller and more specific. Does that make you have any plans to sort of change resource allocation or the kinds of compute that you want to get ready for researchers?
Emad:
Yeah, I think we don’t basically don’t need to infinitely scale compute anymore. It becomes, then, about dataset acquisition, and we’re building out a couple-of-dozen-people data team to provide the right data for open source research.
You can think data quality is underestimated in terms of its importance right now for these models because people are like, “Scale is all you need. Stack more layers.” And it was difficult to build a cluster, even in the thousands of A100s, just because there wasn’t availability.
But now, you look at next year. I know of three 20,000 H100 clusters that are being built. An H100 is probably about three times as performant as an A100, so that’s like 60,000 A100s. Like, 15 times bigger than our cluster. They can train a GPT-3 probably in six hours or something like that, one of these clusters.
So the computer’s no longer really a bottleneck, but I think what we’ll see, again, is that people will take the standardized models and customize them down and then have a load of different models, and maybe there’ll be one or two more big models.
But, I think it’s not about big models anymore. It’s about optimal models, and we don’t know what an optimal foundation model is, across the data training and other architectural parameters yet because we’ve been so constrained by compute, data, and talent. And each of those is being unlocked right now.
Lukas:
That’s really cool. What kinds of datasets are you thinking about building?
Emad:
So, like, we’re talking to national governments about, like, national broadcaster data. You’ve got really interesting, highly structured things there that are high quality versus crawls of the internet. And these are public goods that should be available, right?
Lukas:
Sorry, what would that be? I’m not familiar with it.
Emad:
Well, so, you have PBS in the US, right? Like, their data should be available for model creation for academia, right?
Lukas:
Oh, I see. So, you would just acquire that dataset or somehow get a license to make it available?
Emad:
Exactly. To researchers initially, and then hopefully more people because, again, this is public. It’s paid for by the people. So it should be available to the people in various ways.
So if you’re training a model on all the PBS radio station work, and they’ve all got, like, transcripts, you could do it in various different ways. You could create synthetic datasets off that.
So looking at some of these media datasets has been quite interesting to us. But then in other areas, it’s about more than that. So like OpenBioML, we’re doing the usual protein folding, some DNA stuff, and supporting things there. 
But in bio ML, there’s just a lack of quality data. So one of the things we’ll probably do soon, we’re just deciding on this, is a prize to basically identify what dataset should be built and then bringing in external funders to help build those datasets. Protein folding was quite good because there was a great dataset, and there was an objective function of quality. And so people could build around that.
So, you have OpenFold, you have [??] that we’re doing and other things to make that more and more efficient. Other things in bio ML don’t have that. 
Within the language thing, we’re doing the Pile Version 2’s. The Pile Version 1 from Eleuther was very widely used, and the Pile Version 2 is much bigger. With images, we had LAION. The largest image dataset was 100 million images — YFCC100M, which was the Flickr dataset from 2013.
LAION did LAION-400M — which is 400 million images and text pairs — last year, and that was used by Google and Meta and a whole bunch of others in their models. That’s how good it was, because Google and Meta and others are actually constrained about using their user data because of FCC regulations and other things, weirdly enough. Now they’ve done LAION-5B, which is 5 billion image parameters — actually 5.8 — and they’re going to go even bigger. 
So, it’s creating these big open source datasets, replacing a lot of the scraped lower-quality stuff with some of this public sector data, encouraging others to contribute to it, and then building great datasets for every modality so that everyone, again, is on the same page.
I think we’ve got to the point now where the communities that we support and our own internal teams are building better datasets, in some cases, than even private companies have access to.
Emad's thoughts on time series and TransformersLukas:
Yeah. I think one of the disconnects that we see talking to a lot of researchers and companies — of course, there’s a lot of overlap in applications, and deep learning is incredibly practical in lots of ways. But I think a lot of companies are looking for more research around time series and structured data. Do you think about investing in that realm at all?
Emad:
We’ve had some approaches for time series analysis and things like that. I’m not sure these foundation models are the best things for that, to be honest, because I view them more like principle-based analysis in the brain.
Like, with my son — with his autism, ASD — the main thing about that is there’s typically a GABA glutamate imbalance. GABA calms you down, like when you pop a Valium, and glutamate excites you. There’s too much noise. 
And then once you calm down that noise, you do repetitive trial teaching so that you can rebuild things like words, because if there’s too much noise you can’t learn the connections between concepts and words. Like, a cup is a World Cup, cup your hands, all the different cup meanings. And then you rebuild that. These models are the same, in that they can figure out the latest spaces or hidden meanings of connections between different labeled datasets. 
And with time series and things like that, I’m not sure this is the appropriate thing for that. Again, we’re funding a little bit of research in that area. But I think that a lot more of the classical ML things are a lot better to do that, because you typically don’t do out-of-sample stuff there.
And, like, looking at hedge fund stuff, you are typically inferencing and extrapolating versus trying to do first principles analysis of, like, “What is a Van Gogh painting mixed with a Banksy painting,” and these types of things.
But, again, I think 80% of research now in AI — I think this is in the AI index report that was released by Stanford — is in foundation models. So we’re one area of funding of this and, again, quite focused around media and language. There’s just a whole world of funding around this area.
So if it is useful for time series, I’m sure we’ll find out sooner rather than later. Or maybe we won’t. Maybe they’ll just take it, run a hedge fund and be like, “Hahahaha, get all the money!”
Lukas:
Do you have an opinion on other architectures? Are you seeing anything?
I feel like it’s amazing, the convergence around transformers and so many different applications. Do you see any signs of that changing, or no?
Emad:
Potentially there are some promising things that I’ve seen. You know, you don’t necessarily need attention, as some recent papers have shown — I’m trying to remember which ones. And there’s some Attention Free Transformers stuff being done with one of the projects that we’re sporting around RWKV on the language model side.
But I think transformers are probably going to be the primary way of things for the next couple of years, at least, just because they got momentum and they have talent. And again, the commonality of architectures around this, you’re like, “Hey, let’s just chuck it at this or that,” and you’re like, “It works.” 
And we’re just scratching the surface. Like — for those who don’t know — for images, last year we had… well, the big breakthrough in January/February by Ryan Murdock and Katherine Crowson and some others was to take the open source CLIP model that OpenAI released in a generative model of VQGAN — that was Robin Rombach who did that one with his team, CompVis — and having a generative model and a model that takes image to text and bouncing them back and forth across each other to guide the output to get more and more coherent stuff.
In December, Katherine postulated CLIP conditioning would be the best way… taking a CLIP model, the language model, and a diffusion generative model and combining them together, and somehow it learned the stuff.
Then Google, with the Imagen team, took a language model T5-XXL — that was a pure language model — and mixed together with the diffusion model, and somehow it learned how to write images, and it got even better. Everyone was like, “Wait, what?”
We still don’t exactly know how these things work, to be honest, and the potential of extending these. So I think transformers have a long way to go. 
But again, like, there’s a paper that — I don’t know if you saw it, the number of papers on arXiv — it’s literally an exponential with a 24 month doubling on ML. It’s just going crazy everywhere. Who knows what people are going to come up with.
The interest in this area compared to basically the rest of the global economy means there will be more and more resources just deployed towards this, because it's finally actually showing usefulness. It’s just… where that usefulness and value will lie, nobody really knows. Until then, just take some data and chuck it into the H100s, stir it up, and see what pops out the other side.
Why Stability AI is focusing on media applications firstLukas:
It seems a little surprising that you have this amazing company that does all this cutting edge research in ML and model generation, and the first really big application is generating media.
I never would have thought that a priori. Do you have other areas that you expect to take off or that you’re looking into?
Emad:
You shouldn’t underestimate media.
The easiest way for us to communicate is doing what we’re doing now. We’re having a chat with our words. The next hardest is writing each other emails or chats. To write a really good one is very hard.
Like, “I made this message long because I could not spare the effort to make it shorter,” I think someone once said. 
Lukas:
Right.
Emad:
The hardest thing for us to do is communicate visually as a species. This is why artists are great. PowerPoints, we’ve all been there and stuck there. With the combination of a language model, a vision model, a language generation model, and a code model — you don’t need PowerPoint anymore. You can speak and create beautiful slides every time.
With art and visual communication, anyone now… my mom can create memes and send it to me about why I don't call her enough in an instant.
Like, humanity can finally communicate both through text now, with these language models — and you’ve seen how things like Copy.ai and Sudowrite and Jasper have made that easier — and now visually, as well. And the next step will be 3D.
That’s a change in the way humanity communicates, which is a huge deal. 
Again, language was valuable, but it was already getting there. You already had help. Your Gmail suggesting to tell him to bugger off in your replies or whatever.
Now it’s the next step there, which is image, and then 3D, and things like that will follow. That’s valuable because, again, we have to look at where the money is.
The previous iteration of the web was all about AI being used to target you ads. Now it’s about something else, where you’re moving from maybe consumption to creation. My focus has been in this area as a main driver there. 
But, in terms of impact and global stuff, the ability to switch between structured and unstructured data dynamically at a human level because it understands the principles when combined with, like, retrieval augmentation and other things to check for factual accuracy, it’s such a huge deal because it means that you can write reports, you can do legal stuff, you can get rid of bureaucracy.
It’s the first technology that enables so many things because it’s so general that we’re not sure where the value will be. But, I do see the value in anyone being able to express themselves and communicate better. I think that we shouldn’t underestimate that particular aspect of things.
The ethics and regulation of open source, large modelsLukas:
I also wanted to ask you, and you’ve talked about this a fair amount, but I’d love to hear directly. You made this decision to make all your models really open, in contrast to what OpenAI and others were doing, which I think people got really excited about because it sort of felt like… I think with some of the earlier models, there were these gatekeepers. Like, no one could really access them. Some models, like, really no one except people at the company could access.
I remember the reason that some of these models didn’t get opened up was said to be ethical concerns at the time. Do you think that there’s any merit to that argument? Do you think about that at all — like, models being used to trick people, or spam people, or things like that?
Emad:
Well, I think it’s a valid point of view. Basically, the logic there is similar to the logic of orthodox and ultra-orthodox religions, which say anything that leads to a sin in itself is sinful, and so just in case.
But it’s understandable because these models are so powerful that you move from a risk-minimization framework where you’ve got an expected utility — “What’s the positives? What’s the negatives?” And you try to figure that out roughly, right — to a regret min and max. “If I release this model and something goes wrong, my company could get blown up.” I minimize my maximum regret. And we don’t know what it can be used for, because it can be used for anything.
However, I think the last few years have shown this: GPT-2, too powerful to release. GPT-Neo and the other things come along. The world hasn’t ended. Stable Diffusion has been in the hands of 4chan now for 10 weeks, and all they’ve basically created is like Cronenbergs that have given themselves nightmares. Like, it’s not great at creating these things.
The bad guys already have the technology. The nation states… Russia has tens of thousands of A100s, right? And the people can’t run them. So they can build it. And we don’t have immunity to this new alien technology being out there. Because, ultimately, we live in a society that regulates against stuff. 
So if you are creating bad things, you’ll be regulated against. If you are using it for bad purposes, again, the means of distribution, or the social networks, have rules and regulations in place.
Because what you’re really trying to regulate is not content. Because bad content is bad. You’re trying to regulate behavior, and that’s about who’s allowed within these communities and not allowed within these communities. And all of this stuff gets mixed up.
Then the other aspect of it is this AI safety alignment issue of the technology killing us all. I will say quite clearly, I think that GPT-4, when it comes, will be more dangerous than GPT-4chan.
Because a model like GPT-4chan that was trained by Yannic on 4chan that produces just pure all rubbish isn’t really going to go anywhere. It’s just going to produce pure all rubbish a bit easier. Whereas a GPT-4 — which, God knows what it will be, but I’m sure they’ll do an amazing piece of work — the large models that they’re creating now are getting to human-level. And we don’t know how exactly they work. And they’re being created by unregulated entities with these models that are powerful as any technology out there.
Small models are not the issue, being widely used in the communities regulating it. Big models are the issue. And we should have more oversight on that just in case some of this AI alignment stuff turns to be correct and these things are dangerous, which I think they probably are.
Lukas:
But you believe that these small models are also very powerful. So why would the regulations be different for the size of the model?
Emad:
Oh, because they’re not open, right? So when they’re open, everyone can check it. So right now everyone’s poking around and saying, “Oh, those artists. Are they going to be compensated on LAION and this and that?” And we’re like, “Cool. Let’s have that discussion in the open space. What’s the best mechanism to do this?” 
We’ve got a $200,000 deep fake detection prize coming up. We’ll give it to the best implementation of open source deep fake detection. It’s available for everyone, and everyone can be a part of it. Whereas the big guys, there is no control. 
Like, again, the example I gave a bit earlier. Imagine that Apple or Amazon or Google or someone integrated emotional text-to-speech into their models, right? So Siri suddenly has a very alluring-type voice and whispers to you that you should be buying stuff. You’ll probably buy it more. Is that going to be regulated? It’s not currently, and it won’t be in time.
Whereas putting these models out into the open will get people to think about, “Actually, that’s something that probably should be regulated.” And if something is regulated, that is fine because it’s a democratic process. 
Whereas companies using this technology to manipulate us — literally, because that’s the advertising model — I don’t think is appropriate. And again, it’s not just Western influences and deep fakes and elections and stuff like that because when you look at that, there is a herd immunity thing, not a COVID-type thing and lots of work in COVID. 
People understand this technology will mean that people will be more discerning over curated outputs, and then it will be a mixture of this detection technology. And then, for example, we’re part of contentauthenticity.org, where all our future models will have EXIF files — well, special metadata files — showing that they are generated by default on the package.
Now, will people choose to use them or not? They may choose not to use them, in which case they won’t have a tick next to them, right? So there are all sorts of ways to do this, but the reality is that again, this is a complex debate that cannot be decided basically in San Francisco.
It’s something that is important because there’s technology inevitably around the world.
And if you actually poke people, and you say, “Okay, so you don’t want this technology to be used by Indians,” they’re like, “Well, of course we do!” “When?” “When it’s safe to.” “Who decides that?” “We do.” “So they’re not smart enough to decide it?” “No, they need to be educated.” And then it gets really bad, right?
But again, I think it’s understandable because it’s scary, and cool, and scary all at the same time.
Lukas:
Are there any applications currently of the models that you’ve built that make you uncomfortable that you would like to try to prevent?
Emad:
There was an example of a DreamBooth model being trained on a specific artist’s style. And so it was like a cute, Teen Titans-type artist, and it was announced and released as that artist’s model. But they had nothing to do with it.
I felt uncomfortable with that because I don’t think that styles can be copyrighted, but it was, like, almost this co-opting of the name of that artist to do this. Like, eventually it got changed after discussion. There was a piece about that. 
We’re entering some of these gray areas where we have to decide these things, and we have to figure out things like attribution mechanisms and other stuff.
DeepFaceLab has existed for years now. It has 35,000 GitHub stars for doing deep fakes at high quality. Maybe with this technology you can use it a bit easier, but that’s the inevitable pace of it. I think we have to figure out some of the things around attribution, around giving back, and around making sure that people’s things are used appropriately, right?
Because — in general, with attribution and copyright and things like this — these models do not create replicas when they’re doing the training, if you look at how a diffusion model works in particular. They just learn principles. Again, styles cannot be copyrighted, so it’s very difficult to do that.
But when it comes down to the individual basis, I’m still struggling a bit with how do we prevent that from happening, and people co-opting other people’s things, other than in a court of law.
Is there any automated system? Because you have the ethical, moral, and legal. Community typically enforces moral. Ethical is a more individual thing, and we have a creative open air license for that. And legal is obviously a whole other thing. We don’t want things to get down to legal.
It’s like, how can you encourage community norms? I’d say that’s probably the primary one here that just made me a bit uneasy.
Lukas:
I see, interesting. Do you do any — like, in your APIs that you offer — do you put restrictions in there that you don’t have in the open source models used just directly?
Emad:
No, 100%. So again, it’s regional-specific and it’s general, and it’s very safe for work, shall we say, because again — it’s a private implementation of an open API.
Even with the models… like, Stable Diffusion ships with a safety filter that’s primarily pornographic/nudity based, just in case you’ve got an output that you didn’t like. Like, the new versions of it will be more accurate to reflecting what you want, and again, trained on potentially safer datasets, etc. But there’s obviously a different bar for a private implementation.
Again, our basic thing is that these models should be released open as benchmark models with safety around it. So, like I said, there was a safety filter. If you trip the safety filter in the open source version, it shows you a picture of Rick Astley, and you can adjust the safety filter or you can turn it off. 
And then there’s an ethical use license. Any other suggestions for improvements there, we’d love to know. 
And again, I expect that this technology will proliferate, because we catalyzed it. There were the contributions from Patrick at Runway, from LMU CompVis team, and others, and it was led by those two developers.
There’ll be a variety of models of different types being created by a variety of entities. And some of it will be safe for work, some of it will be not safe for work, but I think we need to try and figure out some standardization norms around this as this technology starts to proliferate.
But again, that should be a communal process.
Community structures and governanceLukas:
You know, you keep mentioning these communal processes. And I’m curious: what happens when the community has, like, deep disagreement with itself? I imagine that happens all the time. Like, how do you resolve a community where people might have really different senses of what’s moral and draw lines in different places?
Has that happened yet in your community? And how do you expect to…
Emad:
100%. It happened in the wake of the Stable Diffusion release. People were like, “This can be used for not-safe-for-work, and we don’t feel comfortable with that and supporting that internally within Stability.”
And so we had a discussion as a team, and we decided not to release any more not-safe-for-work models as Stability itself. Some people weren’t happy with that. Most people were fine with that, but that was easier because it was a team decision. 
On a community basis, that comes under governance structures. So right now, one of the things we’re doing is we’re looking at a EleutherAI, and we want to spin that out into an independent community, because it’s got lots of different entities and lots of different points of view.
What is the appropriate governance structure with it? Is it Linux Foundation, PyTorch? It has a lot of OSS things. It’s a bit different because these technologies are not like… what can you do with Linux, really, right?
Lukas:
Yeah, exactly.
Emad:
Whereas, what can you do with the most advanced language model in the world? It’s a lot more complicated and needs a lot more voices there, and that’s why we’re taking some time just trying to say, this is a governance structure in day one. But we need to make it adaptive because we’re not sure exactly where this stuff will go. 
Right now, we as Stability have a lot of control over GPU access and a lot of this stuff. It’s the spice. That shouldn’t be the case going forward, because no one entity — whether it’s us, OpenAI, DeepMind or another — should have control over this technology that’s a common good. 
So, again, we want to be contributors to, like, an independent not-for-profit, as it were, as opposed to controlling this technology, and then have our part in supporting and boosting it being open source.
I think eventually what will happen is if people really disagree, they’ll just fork. We’ve seen that in various communities. Just fork it, right? It’s the beauty of open source.
Lukas:
Yeah.
Emad:
And you can go and do your own thing.
Lukas:
Although I imagine it might be easier to fork a model because one or two people could like take it in a different direction.
Emad:
Yeah. I mean, you can fine tune models. You can fork models.
I think the key thing here is benchmark model. That’s a lot of compute up front, right? And then fine tuning and running it is relatively little compute. This is the opposite of the current paradigm of Google or Facebook, which is relatively little compute to get it into database structure, and most of the compute is done at time on inference.
So you can take a Stable Diffusion model right now and you can train it on your face with 10 images or 100 images and then boom, you’ve got your own like Lukas model that can put Lukas in anything, right?
Lukas:
Yeah, that’s super cool.
Emad:
That’s a flipping of the entire paradigm. But that isn’t a forking of the community.
A community fork will be disagreements over safe-for-work or not-safe-for-work as the datasets, “crawled or licensed,” or things like that. And I imagine we will see different communities around this, around some of these key questions.
Lukas:
Although what’s tricky maybe about this, and a little different than other communities, is you’re holding this very valuable resource in terms of compute. 
So at the end of the day, you will have to arbitrate more aggressively, maybe. Like, for sure, anyone could easily fork stuff, but then they would have to potentially ask you to get the compute resources to really make a meaningful fork, right?
Emad:
Yeah, 100%. Right now we have a lot of control, because we’re the fastest supplier of compute. But a part of what we’re trying to do as we spin these off independently is make it so they can access their own compute and also stimulate some of these national clusters to be more open. So it doesn’t take six to 12 months to get A100 or H100 access anymore.
I think, again, it deserves to be a bit more diverse. So multiple parties at the table as opposed to centralized. And this is a deliberate action by us to move towards more and more distributed end decentralization, both from an ethical and moral perspective. But then, also, like I said, from a business perspective, it works for us as well.
Because if we’re considered to be in control of everything, like, we don’t know what’s going to happen there. And it’s really a lot of effort to coordinate an entire community, but likely won’t be positive, because it’s going to be a lot if this goes to a 100 million, a billion people, as we expect. Coordinating all of those. Instead, it should be an independent entity doing that where all the voices can be heard.
And we’ve got our own part to play within that.
So, we go from being the main provider of compute, to being a provider of compute, to — hopefully — all the computers provided by the world effectively do this properly because it is a public good. And that’s good for us because it saves our costs, right? The open source models get created without cost to us.
Lukas:
So you imagine a world where a huge fraction of the world’s population is training models. Did I understand that right?
Emad:
No. I think everyone in the world will use these models. I reckon there will be, like, thousands of developers creating these models to certain standards established by the various communities and others in interrelation with each other.
So you will have standard benchmark models like, Red Hat version seven or something like that, or Ubuntu 20. Like, there will be regular releases of these models. It will be independent. The countries and others will provide the compute for it. We’ll be one of the voices at the table doing our little bit. And then people will build on those benchmark models and fine tune them for themselves.
So, on the Apple architecture, like I said, there is a neural engine that’s not really used. Others are having these same foundation model engines that are coming through. So I think in five to 10 years, you will have AI at the edge, AI in the cloud, and the hybrid interaction of those two will be super powerful across modalities.
This is also one reason why we are fully multimodal. If people are like, “Why don’t you just focus on image?” Because you don’t know where the learnings will come from or the value across all of these.
So, it makes sense for us to be that layer one infrastructure layer there to get things going and then have a business model on scaling this.
Using AI to improve and democratize educationLukas:
Yeah, that makes sense.
I want to make sure I asked you about education. I mean, that comes up every time we talk, it comes up in every interview. It’s obviously something that you’re super passionate about. How does education fit into Stability?
Emad:
A large part of Stability, for my own personal focus, is around the rights of children, because a lot of ethics is complex. And things like that. But we all agree that children don’t have agency, and so they have rights.
I’m not talking about the effect of altruism in million years from now. I’m talking about kids right now, today. And I was like, “If I go to the future and bring back technology to make kids lives better, what would I do?” I’d allow them to create any image and use these tools, allow them to do code. You know, the type of stuff that Amjad at Replit does. I would allow them to communicate and be educated and have healthcare.
So with the education thing, it was like first proving that an app on a tablet could actually make a difference, which we’ve done now through the RCTs. Now, it’s about bringing the world together and saying, “What’s the best darn experience we can have to teach these kids?”
Because it doesn’t make sense that we teach arithmetic in a different way across every single country. And we don’t know what the best way to teach linear algebra is.
But then having an AI model that teaches the kids and learns from the kids at scale — because you do entire countries at once — is the best data in the world for creating national-level models.
So, if you want to create a Malawi model, you need to capture the Malawian culture and all the contexts. And if you’re trained by little Malawian kids, that’s a national-level resource. So this is what I discussed in [??].
Like, we’re not feeding AI models the right things. We’re feeding them a mishmash of stuff, but if we actually intentionally create data that teaches them to learn, that’s going to be the best models out there.
And similarly, like I said, the discussion that we’ve had about AI models going to the edge, having control over the hardware, software, and deployment means that we can standardize these tablets to be little AI machines, which will be amazing, because they’ll have a richer experience than anyone else.
And I personally think, like — I don’t know if you’ve got kids Lukas — but 13 months, one hour a day, you learn literacy and numeracy is good for any kid anywhere in the world. In a refugee camp, where people earn a few bucks a day at best, like I think Malawi’s like $5 to $10 a month. It’s crazy, especially when you’ve got one teacher per 400 kids.
How else are you going to educate them other than this technology? How else are you going to do it other than creating an open source standard that’s scalable and working with the World Bank and others to scale it?
I think this technology has a huge role to play in education. I think that incorporating into the West will be incredibly difficult and an uphill battle. Taking it where ROI is the largest in terms of emerging markets and places like that is going to be the best. And then we’ll create a system that’s better for everyone. 
Because again, we have to decide what should be open and a public good. This is not from a business perspective, but from a societal perspective, is what should be closed?
Should the tools to allow anyone to be creative, anyone to be educated, and other things like that be run by private companies? Probably not. They should be a public good. Should they be United Nations and other bureaucratic hell holes? Probably not.
So, with this technology coming right now, there’s a little window where we can create better, more adaptive systems and bring them to the people where it can have the most value, and that’s what Stability is focused on.
Because I think they could build a real infrastructure for the next generation.
Lukas:
Just to be concrete about this, you’re imagining making a tablet that has an AI teacher that’s literally talking to students and teaching them things like linear algebra?
Emad:
Yep, I want to call it “One AI Per Child,” but others are against that. But that’s the concept. You have an AI that helps you.
Because what is AI but information classification? So what’s the information that can help that kid, be it in Malawi or Brooklyn, to the next part of their journey? And then having a standardized architecture for that so you can take what works in Malawi and you and apply it to Ethiopia, apply it to Benin, apply it anywhere. Makes sense.
And the output data of that is customized datasets that are ideal for local language models, and local image models, and local video models if you execute correctly.
So, this is why I think we are not OpenAI or DeepMind. We don’t train giant models. The entire focus is AI that’s accessible for everyone. It’s emerging markets and creativity. These are our two focuses.
Again, like I don’t really care about AGI, except for it not killing us. I don’t want to create a generalized intelligence. I want to create specific intelligences that are widely available so we close the traditional divide and makes people lives better. That’s the key focus and lodestar of what we do.
Lukas:
That totally resonates with me, but don’t you feel like that the trends lately have been creating better specific intelligences through creating better general intelligences?
Like, I’ve been watching, the last 20 years of machine learning seems like more and more general purpose things that are then fine tuned on specific applications. Do you expect that trend to change?
Emad:
I think it’s an arc, right? So it was “scaling is all you need and more layers,” and now it’s better datasets, right? And so as you have this adaptation, I think the intelligence goes to the edge.
I think instructs and the combination of reinforcement learning and deep learning is the next big trend that we’re seeing start to accelerate. And again, that’s why we’ve got CarperAI as our representative contrastive learning lab.
I think it’ll be loads of models. Because these big models were there, but they weren’t really used, right? Now they’re being used. So Stable Diffusion is being used probably by, what — it’s being used by millions of people each day. As it gets better and as people release more models, this technology will be used by more and more people, be it private or public.
And so I think that then it becomes about inference and cost, because if you got a model that’s open source, and 80% is a closed model — because open source models will always be worse than closed source models, because you can always just take it and make it closed and trade it on better data — then that creates a different paradigm.
And again, I think it was this breakthrough point whereby “stack more layers” became less effective as you went up. Now it’s a case of “make the layers more effective,” as it were, and figure out how do we optimize these models if we can start doing A/B tests and training 10 of them at once. Where are the key optimization points here?
I think that the optimization points will be a model that’s used by a million people, versus a model that’s used by an internal team. A million people will always win, because people will figure out all sorts of tricks.
Like DreamBooth training, so that’s where you take a few pictures of yourself and it’s fine tuning for the image model. That was 48 gigabytes — when it first came out — of VRAM requirement. After three weeks, it was eight gigabytes by the community building on it.
And having that, and having hundreds and hundreds of developers hacking away at these things and figuring out how to put them into process, as opposed to zero-shot — these won’t be the best for zero-shot, but they will be more useful because they’re in pipelines.
I think that we’ve shown that with Stable Diffusion versus other image things which are within their thing. But now we just have to upgrade the models again.
Emad's thoughts on autism and machine learningLukas:
I have one more question that I didn’t actually prepare, and I’m curious if you have thoughts on: which is that, you’ve talked about your autistic son a few times, and I actually have a little sister that’s autistic. Autism has come up in many of these interviews that I’ve done often, like autistic family members.
I’m curious, do you see any connection between autism and machine learning?
Emad:
100%, and this is why I really love transformer-based architecture. Because, what I did with my son in terms of repurposing drugs for him — and we’ll do a full formal thing about this in the next year or two where we’ll share all the learnings — is about reducing the noise and getting him to pay attention by reducing the imbalance.
So there’s too much glutamate, making him excited and not enough GABA calming him down. And then having things like applied behavioral analysis, where he does rapid iterations to learn that a “cup” means things in various things with a variable reward schedule where he gets rewarded at random so he’s more motivated to rebuild these things.
It’s similar for a stroke victim and other things, but, again, you look at what these machine models do with transform-based architectures. Attention is all you need. They pay attention to the important parts and that interconnection of creating latent spaces or hidden layers of meaning is exactly the same almost. Well, it’s not the same, but the same principle as what we do for rebuilding the language capabilities of our kids.
And so this is one of the things that really drew it to me, and I was like, “I kind of get that.” Like, I have Aspergers myself, so I had to rebuild and refigure out a lot of stuff. Principle-based approaches.
That’s why I was like, it’s almost like type-one versus type-two thinking. Retrieval versus instinct. A combination of those is the most powerful combination we’ve ever had as humanity. And again, I think that it’ll really be able to help with this.
The other aspect of it is personalized medicine and education and other stuff. We don’t have enough teachers. We don’t have enough doctors. These technologies are reaching human level in very narrow fields. What if we could put this on tablets out there?
“One AI Per Child” doesn’t just mean, like, something… it’s literally an AI that can help them in everything if they’ve got special needs or if they’re neurotypical or anything like that and personalize the stuff for them because our education system treats everyone like a number. It’s like ergodic versus the non-ergodicity of humans. Like, tossing a thousand coins at once is the same as tossing one coin a thousand times.
The reality is we are all unique, but we didn’t have the tools to personalize until now. This is the first technology that could do that.
So in doing that, we can figure out systematic diseases and conditions like autism, like COVID and others. This is why I focused on COVID. This was a multi-systemic disease that modern science wouldn’t be able to deal with.
Like, why do you have massive ferritin levels and other things in the blood? Is it serotonin syndrome? Is it this or that? The first principles analysis of COVID is even still lacking today.
Thankfully, we found treatment and, again, models are one science, but information isn’t getting to where it’s needed on a personalized basis. And, again, we can build systems for that.
But AI models are only one part of that. It’s more classical open source AI for the rest of it. So, yeah, I think there are parallels to this. And of course, being in our industry, it is very, very prevalent, right? It’s like a double-edged sword.
Lukas:
Well, I’m curious, do you think your Asperger’s has given you some advantages in building this really unique company?
Emad:
Yeah, no, 100%. Like, my biggest skill is mechanism design. I know how to convince governments and multilaterals and others. Like, Stability has huge international support because I’ve positioned it just right at the right time.
And my Asperger’s and ADHD typically balance each other out, I like to say. So you’ve got to focus on what you’re good at, and that’s what I’m good at.
That’s my job here: to absorb the hate and to also do the big things while letting the real heroes who are the developers, and the community, get on with things.
Also, it allows me to have a different perspective in that most companies would try to control this. But, really, we are just trying to capitalize it and get it out there because, I think, again — from a mechanism design perspective and morally — that is the right thing to do.
The importance of data and latencyLukas:
Interesting. Well, we always end with two questions, and I want to make sure I get them in. The second to last is pretty open-ended, but usually we ask what’s a topic in machine learning that you think is underrated.
You’ve mentioned a whole bunch, but is there anything else that you think is deserving of more study than it gets right now?
Emad:
Machine learning. I think it’s really data, to be honest. It’s like, you can say classical AI was largely data science, but the role in data in these models is vastly… just not looked at at all. I think that you can do 10 or 100 times less data for better outcomes on these models once we really look at it and how the data impacts the latency and some of this other stuff.
So, like I said, we’re building a team for that. And other people have been doing data cleaning, but I don’t think that’s enough. I think we’ll see some remarkable things advance in that aspect.
Lukas:
It’s so funny because my last company that I ran for 10 years was data collection, and we always found, actually, data cleaning was the most important thing that anyone could do to make their models better, but we could never convince people to do as much data cleaning as we thought they should.
Emad:
Everyone’s like, it’s cooler to stack more layers, right?
Lukas:
Yeah.
Emad:
It’s data cleaning, data structure. There’s a whole bunch in there.
Lukas:
The last question that we always ask is what’s a hard part about taking a model and actually turning it into a product?
You’ve obviously just recently created some products built on top of these big models. I’m curious, outside of the training of the model, what’s been maybe some unexpected challenges in making the whole product work cohesively?
Emad:
We have DreamStudio Lite and DreamStudio Pro coming up very soon. I think, probably, the key challenge is just getting it responsive enough to have, really, that user experience that is seamless.
We’ve gone to sub one second now on inference, but that was very difficult to do. We had to do a lot of optimization there because, again, these are — even if this is relatively small, it’s still a large model, right?
The second part, I think, is around some of the fine tuning and creating custom models. That’s a pretty different take on things. I think there’s a lot of work that’s been going on into where do we actually store the models and keep them, and the user data aspects of them becomes a very curious thing.
I think the most important thing is just having the snappy consumer feedback loops for these large models that we'll maintain, especially because we’re doing animation, which people don’t want to wait around for. They either wait a long time, or they don’t want to wait at all. Like, “Why isn’t it real time?” Because normally this would take like three weeks, you know?
OutroLukas:
That does sound challenging! Well, thank you so much. I really appreciate it. That was a fun interview.
Emad:
No problem, Lukas. Cheers, buddy.
﻿
﻿
Add a comment
Tags: Podcast, Gradient Dissent, Articles, Stable Diffusion
Iterate on AI agents and models faster. Try Weights & Biases today.