Skip to main content

Imagen Video: Google Shows Off Text-To-Video AI Only A Week After Meta AI's

Google has revealed Imagen Video, a new text-to-video generation model, coming quickly after Meta AI's recent Make-A-Video reveal.
Created on October 6|Last edited on October 7
In an unexpected turn of events, one week after Meta AI showed off their impressive Make-A-Video project, Google has revealed Imagen Video, an extension of their Imagen model, for text-to-video generation.


How does Imagen Video work?

The Imagen video model is effectively split into two parts: First, the initial video generation model, and second, an array of super-resolution models.
The video generation portion of the model employs the user-input text prompt to create a small 16-frame, 24x48-pixel resolution video. Frames are generated using spatial convolution and self-attention layers, as well as a temporal convolution / self-attention layer to make sure each frame is related to the one before and after and produces proper motion.
The generated frame data then moves through an array of spatial and temporal super-resolution models to upscale the video in both pixel resolution and frame density. At the end of the process, Image Video can create a 128-frame, 1280x768-pixel resolution video.


Phenaki for creating long-form video

Another project closely tied with Imagen Video, and also created by Google researchers, is Phenaki, a model which is able to make full and coherent videos of unlimited length based on long paragraphs of input text.
Currently, Phenaki is limited in resolution image quality, but the researchers behind both projects will be working to incorporate Imagen Video into Phenaki to combine the high-resolution output of Imagen Video with the incredible long-form length and coherency of Phenaki.

A boom for text-to-video generation?

In the early boom of text-to-image generation, DALL·E 2 was shortly followed by Google's Imagen and Meta AI's Make-A-Scene. With Craiyon serving as the first easy to play with image generator model for the public, text-to-image generation moved machine learning into the vocabulary of the average person.
A week ago, Meta AI revealed Make-A-Video, realizing the idea of text-to-video generation for AI, and now, with Google's Imagen Video, we could be seeing an opportunity for someone to create the Craiyon, or DALL·E-Mini, for text-to-video generation models, let-alone a Stable Diffusion for text-to-video.
Of course, text-to-video requires significantly more processing power than text-to-image, so it might be a bit less feasible for the average enthusiast to jump right into creating video content with AI.

Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.