Skip to main content

Cristóbal Valenzuela — The Next Generation of Content Creation and AI

Cris gives a demo of Runway, a new video editing platform that uses machine learning to make content creation easier, and discusses the future of computation and creativity.
Created on January 11|Last edited on January 23



About this episode

Cristóbal Valenzuela is co-founder and CEO of Runway ML, a startup that's building the future of AI-powered content creation tools. Runway's research areas include diffusion systems for image generation.
Cris gives a demo of Runway's video editing platform. Then, he shares how his interest in combining technology with creativity led to Runway, and where he thinks the world of computation and content might be headed to next. Cris and Lukas also discuss Runway's tech stack and research.

AI Film Festival:


Runway’s AI Film Festival is accepting submissions through January 23. They are looking for art and artists that are at the forefront of AI filmmaking. Submissions should be between 1-10 minutes long, and a core component of the film should include generative content.

Connect with Cris and Runway:

Listen



Timestamps

0:00 Intro
40:01 Outro

Transcript

Intro

Cris:
I think a big mistake of research — specifically in the area of computer creativity — is this idea that you're going to automate it entirely. You see one-click off solutions to do X, Y, or Z. I think that's missing the bigger picture of how most creative workflows actually work.
That probably means that you've never actually worked with an agency where the client was asking you to change things every single hour, or make it bigger, make it smaller, right?
Lukas:
You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald.
Cris Valenzuela is an artist, and technologist, and entrepreneur, and CEO and founder of a company called Runway, which is a maker of ML-powered video editing software. But I feel that description doesn't even do justice to how incredible and innovative his product is.
This interview actually starts off with a live demo of his product. I really recommend switching to video if you're listening to this on audio only, because his demo is absolutely incredible.

How Runway uses ML to improve video editing

Lukas:
Well, all right, Cris, we don't normally do this, but I thought it would be fun to start with a product demo if you're down for it. You have such a cool, compelling product. Would you be up for that?
Cris:
Sure. What do you want me to demo? There's a lot I can do. I want to make sure I can focus on what you want to see.
Lukas:
Well, this is an ML podcast. So I think people would probably be interested in the most flashy ML features. How about that?
Cris:
In short, Runway is a full video creation suite. It allows you to do things that you might be able to do in more traditional video editing software. The main difference is that everything that runs behind the scenes...so, most of the core components of Runway are ML-driven.
The reason for that, it has two main kind of modes or uniqueness about making everything ML-based. One is, it helps editors, and content creators, and video makers automate and simplify really time-consuming and expensive processes when making video or content. There are a lot of stuff that you're doing in traditional software that are very repetitive in nature, that are very time-consuming or expensive.
Runway aims basically to simplify and reduce the time of doing this stuff. If you have a video you want to edit, an idea you want to execute, spending the time, and the minutes, and the hours, and sometimes days on this very boring stuff is not the thing that you really want to do. So we build algorithms and systems that help you just do that in a very easy way.
And then there's another aspect of Runway, that it's not only about automation, but it's about generation. We build models, and algorithms, and systems that allow our users and customers to create content on demand.
And everything...baseline for us is that everything happens on the browser. It's web-based and cloud native, which means that you don't rely any more on native computers, or native applications, or desktop compute. You have access to our GPU cluster on-demand, and you can render videos on 4k, 6k pretty much in real time. Plus you can do all of this AI stuff also in real time as well.
A lot of the folks are using Runway now — CBS, The Late Night Show with Colbert, or the folks who edit Top Gear, or sometimes creators who do stuff for Alicia Keys or for just TikTok or movies — they're all leveraging these AI-things via this web-based cloud based editor.
So that's a short, five-minute intro, what the product does and how ML or AI plays a role in the product itself. But I'm happy to now show you how everything goes together and the experience of using the editor, if that makes sense.
Lukas:
Please, yeah.
Cris:
Cool. Any questions before we do that? I can double down, or if you want to me to clarify?
Lukas:
Well, I actually didn't realize that professional video teams like The Colbert Show use Runway. Do they use it for all of their video processing or is there a certain part where they they use it? How does that work?
Cris:
It depends. Some editors and some folks are using it as an end-to-end tool to create videos. Some other folks use a combination of different softwares to make something.
The folks who we use it for movies sometimes add in Nuke or Flame. We have a big Flame community, so Runway becomes a part of that workflow.
It's replacing either something you do on a very manual basis. It's sometimes replacing a contractor you hired to make that work for you, or it's sometimes replacing your own work of trying to do it yourself in this old software.
But you still use other aspects of it, or other software to combine [with] it. It really depends on the type of content you have and the level of outcomes that you that you need. But we do have folks that use it as an end-to-end content creation and editing tool.
Lukas:
Cool. Well, I mean the extent of my video editing is basically modifying videos of my daughter to take out the boring parts and send them to my parents. That's as far as I go.
Maybe you could sort of give me a little bit of an overview of the cool stuff you can do with Runway.
Cris:
Totally. You can do all of that in Runway on the browser which is...you might be...you might start using Runway for that.
The one thing I would emphasize is, everything is running on the cloud, on the web. You can just open any project with a URL. You can also create teams, and you have this baseline collaboration aspect that just runs out-of-the-box.
Cool. Anything else? No, just go demo?

A demo of Runway's video editing capabilities

Lukas:
Yeah, let's see a demo. Totally, yeah. Show me the cool stuff.
Cris:
Perfect.
So, this is what Runway looks like. If you're ever edited video before, it's a very common interface.
We have tracks on the bottom. We have a multi-editing system with audio tracks, and keyframe animations, and text layers, and image support. You can preview your assets on the main window and have a bunch of effects and filters on the right. Again, everything running pretty much on the cloud in real time.
The idea here is that there are a lot of things that you can do that are very similar to stuff that you can do in other applications, plus there are things that you can't do anywhere else. Let me give you an example of something that a lot of folks are using Runway for.
I'm going to start with a fresh composition here. I'm going to click one of the demo assets that I have here. I'm going to click this. I have a surfer, right? On that shot, let's say I want to apply some sort of effect or transformation to the background of this shot. Or I want to maybe replace the person here and take it somewhere else.
The way we do that today would be a combination of frame-by-frame editing, where you're basically segmenting and creating an outline of your subject, and every single frame you move you have to do it one more time.
For that, we built our video object segmentation model — which we actually published a blog post and a paper around it — that allows you to do real-time video segmentation. In film, this is actually called rotoscoping.
You can just literally go here, guide the model with some sort of input reference. I tell the model this is what I want to rotoscope, and it can go as deep as I need. I can select the whole surf layer here at deeper...more control over it. Once the model has a good understanding of what you want to do, it would propagate that single keyframe or single layer to all the frames of video in real time.
You get a pretty smooth, consistent segmentation mask that you can either export as a single layer, or export as a PNG layer, or you can use...go back to your editing timeline and start modifying. You said you want to cut it, you want to compose it, you want to do some sort of transformation...from here, you can do that directly from here.
Let's say I have my baseline — or my base video — here, I have my mask on top of that, and now I can just literally move it around like this. I have two layers, right, with a surfer.
So, something that looks very simple and in traditional software may take you a couple of hours of work, here you can do pretty much in real time. Again, it's something that most editors know how to do, but it just takes them a lot of time to actually do.
Lukas:
And did you just run that in the browser?
Cris:
Yeah.
Lukas:
That segmentation mask, it figured out in the browser and it's calculating all...it doesn't go to the server?
Cris:
No, it goes to the server. Yeah, there's an inference pipeline that we built that processes real-time videos and allows you to do those things. The compute part is everything running on the cloud.
You just see the previews and sometimes — depending on your connection — you can see a downsampled version of it, so it runs really smoothly and plays really nicely. Also, for every single video there's a few layers that we run, that help either guide something like a segmentation mask.
For instance, we get depth maps and we estimate depth maps for every single video layer. You can also export these depth maps as independent layers and use them for specific workflows. That's also something very useful for folks to leverage.
So you have this and you can export this. Behind the scenes, we're using this for a bunch of things.
Lukas:
Cool.
Cris:
Those are one of the things that you can do. You can go very complex on stuff. Let's say, instead of the surfer, I just want the — let me refresh this — I just want the background. I don't want the surfer.
I can inpaint or remove that surfer from the shot. So I'm just gonna paint over it. Again, I'm giving model one single keyframe layer, and the model is able to propagate those consistently for the entirety of the video.
That's also something we — as a product philosophy — really want to think about. Which is, you need to have some layer of control of input.
The hard part of that should just be handled by the model itself, but there's always some level of human-in-the-loop process, where you're guiding the model. You're telling it, "Hey, this is what I want to remove. Just go ahead and do the hard work of actually doing that for the whole video sequence."
Lukas:
Wow, that's really amazing. That's like magic, right there. The surfer’s really just gone.
Cris:
Yeah. That's something we see a lot, when people find out about it, or when they start using it. "Magic" is a word we hear a lot.
It's something that...again, if you're editing or you've worked in film or content before, you know how hard, and time-consuming, just painful it is. Just seeing it work so instantaneously really triggers that idea of magic in everyone's minds. Which is something for...that's great, because we've really thought of the product as something very magical to use.
So, there's stuff like that. There are a few things like green screen and inpainting — which I'm showing you now — plus motion tracking, that we consider as baseline models in a Runway.
Those are just...you can use them as unique tools, as I'm showing you right now. You can also combine them to create all sorts of interesting workflows and dynamics.
There's the idea of, "You want to transform or generate this video, and take this surfer into another location," you can actually generate the background, and have the camera track the position of the object in real time, and then apply the background that you just generated in a consistent manner, so everything looks really smooth.
The way you do that is by combining all of these models in real time, behind the scenes.
You might have seen some of those demos on Twitter, which we've been announcing and releasing. This is a demo of running a few of those underlying models, combined.
There's a segmentation model that's rotoscoping the tennis player in real time. There's a motion-tracking model that's tracking the camera movement, and then there's an image-generation model behind the scenes that is generating the image in real time. Those are all composed at the same time. Does that make sense?
Lukas:
Yeah, yeah. Totally.
Cris:
Those are, I would say, underlying baseline models and then you can combine them in all sorts of interesting and different ways.
Lukas:
Totally.
Alright, well, thanks for the demo. That was so cool. We'll switch to the interview format. Although now I really want to modify this video in all kinds of crazy ways.
Cris:
We should replace the background with some stuff while we're talking
Lukas:
Totally. Get this microphone out.

How Cris entered the machine learning space

Lukas:
One question I really wanted to ask you is, I think your background is actually not in machine learning originally, right?
I always think it's really interesting how people enter the machine learning space. I'd just love to hear your story, a little bit, of how you ended up running this super cool machine learning company. It seems you're very technically deep, also. And so how you managed to get that depth mid-career.
Cris:
Totally. Long story short, I'm originally from Chile. I studied econ in Chile and I was working on something completely unrelated. But it was 2016 or 2017, I think, and I just randomly fell into a rabbit hole of ML- and AI-generated art.
It was very early days of Deep Dream and ConvNets and AlexNet, and people were trying to make sense of how to use this new stuff in the context of art making. There were some people like Mike Tyka, and Mario Klingemann, and Gene Kogan who were posting these very mind-blowing demos. That now feel things that you can run on your iPhone on real time.
But around that time it was someone...I remember Kyle McDonald — which is an artist — who was walking around with his laptop, just showing people a livestream of a camera. You had basically...I think with an ImageNet model running in real time, and just describing what it saw. And it just blew my mind.
Again, it's 2016. Now it's pretty obvious, but around that time it was pretty special. I just went into a rabbit hole of that for too long. It was too much, I was just fascinated by it. I actually decided to quit my job, I decided to leave everything I had. I got a scholarship to study at NYU and just spent two years just really going very deep into this.
Specifically in the context of, I would say, creativity. My area of interest was the idea of computational creativity. How do you use technology? How do you use deep learning or ML for really creative tool-making and art-making? That two-year-long research process and exploration ended up with Runway.
Runway was my thesis at school. It was a very different version of what you see now. But the main idea was very much pretty much the same.
It's like, "Hey, ML and AI are basically a new compute platform. They offer new ways of either manipulating or creating content. And so there needs to be some sort of new tool-making suite that leverages all of this, and allows people to tap into those kinds of systems in a very accessible and easy way."
The first version of Runway was a layer of abstraction on top of Docker, where you could run different algorithms and different models in real time on this Electron app. You could click and run models in real time and connect those models via either sockets, or UDP, or a web server to Unity or Photoshop.
We started building all these plugins where you can do the stuff that you are able to see now on Twitter. Like, "Here, I built a Photoshop or Figma plugin that does image generation." We were building all that stuff running Docker models in your computer locally, and you can stream those. It was 2018, 2019.
Lukas:
Interesting. It must have been a much more technical audience at the time then, right? If you have to run Docker on your local machine. That's not something everyone can do, right?
Cris:
Totally, totally. I think that that also tells a lot about how much progress the field has made, and how mainstream and how more accessible things have become.
Trying to put this set of new platforms and compute ideas for creators, and video makers, and filmmakers required you to know how to install CUDA and manage cuDNN. I don't know if it's just too much. But people were still wanting to do it. There were some folks who were like, "Hey, this is really unique. I want to understand how to use this."
But then we realized it wasn't enough. You need to go [to] higher layers of abstraction on top of that to really enable creative folks to play with this, without having to spend months trying to set up their GPU machines.
Runway has really evolved, and we have a really experiment-driven thesis and way of working on the product. But it's all about trying ideas and testing them out with people really fast.
We're building something that hasn't been done before. And so it's really easy to get sidetracked into things that you think are going to work, or ideas that you think are going to be impactful. But since you're working with new stuff all the time, being close to your user base for us has been kind of really, really important.
Every time we iterate on the product, I think one consistent line of evolution has been this idea of simplifying...making higher abstraction layers on top of it.
The first versions of rotoscoping or inpainting required you to select the underlying model architecture, and understanding what a mask was, and [how] propagation works.
If you're really a filmmaker, you don't care about any of the stuff. You just want to kick once, and you want to get a really good result.
For us, it's "How do you build from there, using what we're building behind the scenes?"

Cris' thoughts on the future of ML for creative use cases

Lukas:
Were you surprised how well these approaches have worked to generate images? It sounds you started your work in 2017, 2018. The space has changed so much.
Do you feel you saw it coming, or have things unfolded differently than you thought?
Cris:
I mean, things have definitely accelerated. But I think our thesis — when we started Runway three and a half years ago — was pretty much the same. It was, we're entering literally a new paradigm of computation and content. We're not going to be...we're soon going to be able to generate every single piece of content and multimedia content that we see online.
I've been demo-ing generating models for creative use cases for the last three years. What I was showing three years ago, people were like...it was like, "Hey, this is how it works. This is how you train a model. This is what the outcome of the model is."
Of course, at that time, it was a blurry 100x100 pixels image. Some sort of representation of what you were describing. Most people took it as a joke, like, "Oh yeah, cool. Very cool. Cool thing." Or as a toy, like, "That's a fun thing, right? You kind of use it once. But of course, I will never use this in production."
I remember speaking with this huge...one of the biggest ad agencies in the world, and I was presenting to other executives. Here's the future of content, type anything you want. And something blurry came out and they're like, "Cool, not for now."
And they reached three weeks ago being like, "Hey, how many licenses can we get for this, tomorrow?" Because the models are going just so much better, that it's obvious. It's transforming their industries and a lot other things.
I think what has changed for us is pretty much the speed. Now we're entering a really nice moment where things are converging, and there's a good understanding of what's going to be possible, and where things are going. Scaling laws are getting to a good point.
And so continuing the same, but the thesis of the company was always built on that this will happen, and it's happening sooner rather than later.
Lukas:
Do you have a perspective on if this acceleration will continue, or if we just are seeing a breakthrough, and then we're going to need new breakthroughs to get to the next level of quality?
Cris:
Sure. I think there's definitely more compute that needs to be added to this, more data sets. I think we're still scratching the surface of what it will become.
There's still this...I was discussing this with a friend the other day, this idea of a curiosity phase where people are entering the realm of what's possible and coming up with all these solutions and ideas, but there's still a difference between those concepts, and explorations, and ideas and meaningful products that are long-term built upon those.
What I'm interested in seeing is how much of those ideas will actually convert over time, over meaningful products. I think that conversion of products is not just pure research or pure new models, there needs to be a layer of infrastructure to support those things.
It's great that you can run 1 single model to 1 single thing on X percent. But if you're trying to do that scale on a real-time basis for 10 people, that then use it on a team and depend on it for their work, then there's a slightly different thing.
But I think we're about to see way more stuff around video, specifically. I think image might be solved in a couple of more months and video is starting to now catch up with that. It's a really exciting time for that.
Lukas:
What does something being solved mean to you? Like, you could just get any image that you would ever want or imagine?
Cris:
Yeah, that's a good one. That's a good question.
I would say that I would consider being solved [as] being able to translate something like words or a description into a meaningful image or content that pretty much matches where you're trying to...what you're imagining. And if it doesn't, you're able to control really quickly and easily to get to the point where you can arrive at your final idea.
That's why the combination of models really makes sense. It's going to be hard to have a full model that does exactly what you want. For instance, for image generation.
I think it's a combination of, you have a model that does the first model, which is you generate something. There's no pixels, you generate the pixels. Second step is, you're able to quickly modify it, or inpainting, or grade it in some way, and start it in some other way.
But that whole thing just happens in a few seconds or a few minutes, right?
If you speak with anyone in the industry, VFX, or ad agencies or content creation, post-production companies, these are stuff these guys do all the time. This is what they do for a living, right? They're able to create content out of nothing.
The thing is just it's really expensive. It's really, really expensive. And it involves a lot of time and rendering and skilled people to get to that point.
I think for me, "solved" is, anyone can have access to that professional-level grade VFX-type of content from their computers and from a browser.
Lukas:
Do you ever think about making a version of Photoshop, instead of a video editing software? If you think images are closer to being solved. Certainly I can't go into Photoshop and get exactly the image I want.
I love to play with all the image generation tools out there. But I do think they're amazing at first, but then you kind of hit this point where if you really want the image to look like you want, it gets kind of frustrating.
It seems there's also room for an image version of what you're doing. Is that something you'd consider doing? Or, why not make that?
Cris:
Totally. Yeah. The answer is absolutely.
I think, a few things. One, I think we're converging more to this idea of multi-modal systems where you can transfer between images, and videos, and audio. I think the idea that we've been...we built software to deal with each media independently.
There's audio editing software, and video editing software, and image editing software, and text-based...you have models that can quickly translate between all of those.
Content — let's say video — it's a combination of different things. You have images, you have videos, you have audio, you have voice. All of those things are now possible.
I think for us, when I think about the product philosophy of Runway, it's less about, "How do you build a better Photoshop or a better Premiere?" Fundamentally, these models are just allowing you to do the things that none of those others can do.
If you think about marginal integrations of those things...yeah, you build a better Photoshop that has a better paintbrush, or a better contact server tool. But ultimately, when you combine them in new ways, you create a new thing. It's completely new.
It's not Photoshop, it's just a new way of making videos, and editing images, and editing audio. All in one, single component or tool. For me, what's really interesting is the multi-modal aspect of things, and translating also into those.
And 3D, for instance, it's one of the filters...you're going to start to see a lot of translation between images and videos on 3D.
Lukas:
Totally.
So, I have to ask you your thoughts on deep fakes and things like that. I'm sure everyone asks you that, but I'm really curious what you think about that.
Do you think that you would want to put in limitations into your software to not allow certain things? Do you think this is about to change the way we view videos, as this technology gets more standardized and available to everyone?
Cris:
For sure. As [with] every major technology breakthrough, there's always social concerns about how it might be misused or used in not the right, intended ways. It's a good exercise to look at history to see what has happened before.
There's this really good YouTube video about Photoshop when it was first released, I would think about the early 90s. They were like...it's kind of a late night show, and they're discussing the ethical implications of manipulating images in magazines.
And they're like, should we allow to manipulate images and put them in magazines? Half of the panel was like, "No, we shouldn't." It breaks the essence of what photography is, right?
20 years after that, it makes no sense to think about not doing something like that, right? There's always an adaptation process, I would say, where people need to...we need to collectively ask, "Hey, how is it going to be used?"
But I think ultimately, you understand what the limitations are, and you also fine-tune your eyes and your understanding of the world to make sense of that thing. Now everyone knows that "Photoshop" is a verb that you can use to describe something that's manipulated.
You do that same exercise, and you go back in time, and you see the same. When film just started to appear, there was this story, interesting story about...one of the first films that were made is a train arriving to a station.
They were like, projecting that on a room. When people saw the train coming to a station, everyone ran away because they thought a train was coming to a station, literally.
But then you make sense of it, and you're like, "Yeah, this is not true. I understand that this is an actual representation of something." Ultimately, I think with AI and with generated content, we'll enter a similar phase, where it's going to become commonplace and something people are familiar with.
Of course, there's going to be misuses and bad uses. Of course, people can use Photoshop for all sort of evil ways. But the 99% of people are just like, their lives have been changed forever in a positive way because of this.

Runway's tech stack

Lukas:
Interesting.
Well, look, I'd love to hear more about your tech stack. This is a show for ML nerds of all types. I think you're doing pretty hardcore ML at scale.
What have been the challenges of making this work, making the interface as responsive as it was? What were the key things to scale up your models?
Cris:
Sure. There's a lot of things that we had to kind of come up [with] creatively, to make this work in real time.
On the one hand — on the ML side — we mostly use PyTorch for all of our models. We have a cluster — basically, an AWS cluster — that scales based on compute and demand, where we're running all those models for training. We use sometimes Lighting and, of course, Weights & Biases to follow up and understand better what's working in our model training.
Serving, we optimize for different GPU levels or compute platforms, depending on availability. We've made some systems to scale up depending on demand. On the frontend side of things, everything's Typescript and React-based. There are some WebGL acceleration stuff we're doing to make things really smooth.
And then the inference pipeline, where we're writing everything in C++ to make it super, super efficient and fast, specifically since you're decoding and encoding videos in real time. We also built this streaming system that passes frames or video frames through different models to do the things that I just showed you. And so we also had to come up creatively with that.
That's kind of a big picture of our tech stack.
Lukas:
One challenge that I'm seeing some of our customers run into — as these models kind of get bigger and more important — is that the actual serving cost of the application increases. Is that an issue for you? Do you do things like quantization? Is lowering your inference costs an important project for you all?
Cris:
For sure. Yeah, for sure. I mean, we're running...our biggest cost right now is AWS, GPU costs, and inference costs, and serving these models.
There are two main areas for sure. We have an HPC, we're doing large-scale training of language models and video models. That takes a lot of resources and time. But just serving on...I would say the tradeoff between precision and speed really matters.
Quantizing models is great. But also you need to make sure that you're not affecting the quality of the model because if you're affecting something on a pixel level, it might change the result from being okay to bad. And that might mean user churning. And so, if you're going to spend a few more seconds rendering, that might actually be better. There's always a tradeoff of how much.
But yeah, we always try to figure out what's the right balance there. We're still exploring some stuff on the browser. I think the browser is becoming really powerful.
The only constraint about the browser is just memory and RAM. And you get...it's a sandbox, so you can't really do a lot of things specifically with video. But you can run some stuff on the browser. And so we would send some things specifically, and convert some things, and make them smooth enough.
But I think we're not 100% there yet.
Lukas:
But you're also training your own large language models and large image models. That sounds like training would be a major cost for you as well.
Cris:
Yeah, for sure. Retraining some stuff to make sure it works in the domain of what we have is one of our core competences. Now we're training...starting a huge job on our HPC. That's going to take a big percentage of our costs for the next few months.

Creativity, and keeping humans in the loop

Lukas:
Wow.
I have to ask. That language interface that you showed me was so compelling and cool. But I have been seeing language interfaces for the past 20 years, and the challenge with these language interfaces is when they don't work, they're just enraging. Actually, you sort of addressed that. Showing how it creates these things, and you can undo them, and you can kind of modify them.
Do you feel that that kind of conversational interface is at the point where, for you, it's an interface that you really want to use?
Cris:
I like to think [of] it as a tool. It's not the sole answer to everything you need. This is not going to be a replacement for all of the workflows in making content, video, images, or sound, or whatever it is. It's just a speed up in the way you can do those kind of things.
I think the sweet spot is a combination of both. Being able to have that constant feedback loop with the system, where you're stating something out [and] the system is reacting in some way that matches your idea. And then you have that level of control so you're going the direction you want and doing what you want. Or, if it's not working, you just do it yourself, right?
I think a big mistake of research — specifically in the area of computer creativity — is this idea that you're going to automate it entirely. You see one-click off solutions to do X, Y, or Z. I think that's missing the bigger picture of how most creative workflows actually work.
That probably means that you've never actually worked with an agency where the client was asking you to change things every single hour, or make it bigger, make it smaller, right?
Lukas:
Right.
Cris:
It's hard for me to imagine a world where you have a one-click off solution for everything. That feels boring, to be honest. You want to have that control.
I think language interfaces are a huge step towards accelerating the speed at which you can execute. Are they the final answer for everything? I'm not sure, but they do make you move faster on your ideas.
Lukas:
Did I understand you right that you want to build your own large language model? I would assume you would take one of the many off-the-shelf language models today. Are you actually training your own?
Cris:
Yeah, I think it's...we are, but it's also the fact that ML...the infra for models and models themselves are becoming commodities.
It's great for companies like us, because some stuff we kind of need to build on our own. There's a lot of things in Runway that you won't find anywhere else.
But there's a lot of stuff, large language models that you can just use off the shelf. You have all these companies offering similar services. It's a great...as a consumer of those, if we want to use those, it's just a cost situation where whoever offers the best model, we'll use.
And to a point, it might make sense to do our own. So yeah, sometimes we don't have to do everything ourselves. You can just buy it off the shelf. But some other times, you just need to do it because it doesn't exist.
Lukas:
Sorry, large language models you think you might do it yourself, even?
Cris:
We're doing a combination of both. We're using APIs but also re-training some of our own.
Lukas:
I see, I see. Have you experimented with all the large models out there? Do you do you have a favorite of the existing offerings?
Cris:
I think GPT-3 works. I think, actually, the model is Davinci. It's probably GPT-4 by now. I think OpenAI has been making-
Lukas:
-right, right.
Cris:
-that silently behind the scenes, it works really well. That's the one I'd say we're experimenting with the most, and we get the best results.

The potential of audio generation and new mental models

Lukas:
Cool. Well, look, we always end with two questions. I want to make sure I get them in.
The second-to-last question is, what is a topic that you don't get to work on, that you wish you had more time to work on? Or, what's something that's sort of underrated for you in machine learning right now?
I realize it's a funny question to ask an obsessed ML founder. But I’ll ask it anyway.
Cris:
I think, audio generation. I think it's catching up now, but it's not...no one really has been paying a lot of attention.
There's some really interesting open source models from Tacotron to a few things out there. I think that's going to be really, really transformative for a bunch of applications. We're already kind of stepping into some stuff there.
But, it's hard to focus as an industry — or as a research community — in a lot of things at the same time. And now that image understanding has kind of been solved away, people are moving to other specific fields. I think one of the ones that are going to start seeing very soon is audio generation.
So yeah, excited for that for sure.
Lukas:
Yeah, I totally agree. Do you have a favorite model out there? We just recently talked to Dance Diffusion, or HarmonAI, that was doing some cool audio generation stuff.
Cris:
Yeah, there's one — let me search for it — that just blew my mind. tortoise-tts, I don't know if you've seen that one.
Lukas:
No.
Cris:
Yeah. tortoise-tts is, I think, the work of just one single folk, James Betker. It works really well and he's been...someone used it to create the Lex Fridman...generative podcast. I'll share with you the audio.
It's a whole podcast series that goes every week, where everything is generated. The script is generated by GPT-3 and the audio is generated by tortoise. And you can hear it's like, it's a podcast. You can't really tell.
Yeah, really excited for stuff like that.
Lukas:
Cool.
The final question is for you, what's been the hardest part about getting the actual ML to work in the real world? Going from these ideas of models or research to deployed and working for users.
Cris:
I think these models — and things like image generation and video generation — require a different mental model of how you can leverage this in creative ways. I think a big mistake has been to try to use existing principles of image or video generation and patch them with this stuff.
I think, ultimately, you need to think about it in very different ways. Navigating a latent space is not the same as editing an image, right?
What are the metaphors and the abstractions they need to have? We've come up with those before, in the software pipeline that we have right now. You have a brush, and a paint bucket, and a context or world tool, and you're editing stuff.
But when you have large language models that are able to translate ideas into content, and you navigate and move across specific space or vector direction in ways you want, you need new metaphors and you need new abstractions.
What's been really interesting and challenging is, what are those metaphors? What are those interfaces? How do you make sure the systems you're building are really expressive?
I think two things that drive a lot of what we do are control and expressiveness. "Control" as in you, as a creator, want to have full control over your making. That's really important. How do you make it, so you also are expressive? You can move in specific ways as you are intending to do.
So yeah, that's also really...it's really exciting and passionate for us to invent some of those stuff.

Outro

Lukas:
Well, it’s really impressive what you did. Thanks so much for the interview.
Cris:
Of course, thanks so much for hosting me.
Lukas:
If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.
Iterate on AI agents and models faster. Try Weights & Biases today.