Creating videos from static images with Stable Video Diffusion
The model known for generating images has been upgraded to handle video! We will cover the basics of the model, and also generate some sample videos!
Created on April 4|Last edited on May 14
Comment
Latent diffusion models have emerged as a powerhouse for high-quality image synthesis. Rooted in stochastic processes, these models gradually transform noise into structured data, akin to an artist starting with a blank canvas.
This unique approach has positioned diffusion models at the forefront of image generation tasks. Researchers have taken this idea a step further by adapting these models to generate videos.
We'll walk through how to use a popular model, Stable Video Diffusion, and run a few experiments. We'll turn this:

Here's what we'll cover:
The basics: Latent diffusion models Adapting latent diffusion models to videoTemporal layers The prediction modelThe interpolation modelThe temporally aware decoderThe upscalerTraining and InferenceThe code Our first experimentExperiment 2 Summary
The basics: Latent diffusion models
Latent diffusion models (LDM) are a type of generative model that operate in a compressed, or latent, space of data representations, significantly reducing computational costs. And autoencoders are central to LDMs: they compress input data into a latent representation and then reconstruct it.
The diffusion process, which gradually adds noise to and then denoises these latent representations, is trained to generate new data samples (latent representations that can be decoded into images). By working in the latent space, LDMs achieve efficient learning and high-quality generation, making them powerful tools for tasks like image synthesis, where they can produce detailed images from textual descriptions or modify existing images in nuanced ways.
Adapting latent diffusion models to video
The leap from static images to dynamic video synthesis introduces the dimension of time, requiring the model not only to understand spatial content but also how scenes evolve.
To address this, researchers have innovated by integrating temporal layers into latent diffusion models. These layers are designed to capture the essence of motion and change across frames, enabling the model to generate sequences of images that are coherent not just visually but temporally. This adaptation involves a sophisticated understanding of temporal dynamics, akin to teaching the model the principles of cinematography.
Temporal layers
Temporal layers are added to the architecture of a U-Net block (the core architecture in the LDM) in video synthesis models to imbue the network with the ability to understand and generate temporally coherent sequences.
In image synthesis, standard U-Net architectures process each image independently, without accounting for the sequence or the passage of time. However, video data is inherently temporal, with each frame connected to the next in a sequence that conveys motion and change over time.
The diagram below shows how these temporal layers are added to the model, so when a batch of images are passed to the model, the spatial layers work as they normally would, but the hidden dimensions across batches are reshaped and fed into temporal layers which learn how to adjust the individual hidden dimensions for each image to be more temporally aligned.

The prediction model
The prediction model is central to extending LDMs for video generation. In fact, it's the core model of interest in Stable Video Diffusion. It shifts the focus from generating individual frames in isolation to predicting future frames based on a given context, such as an initial frame or sequence.
This generation process is conditioned on an encoded image corresponding to either 0, 1, or 2 initial frames, which allows the model to generate future frames in reference to the initial frames. The frames generated are referred to as “key frames” which are frames with more temporal spacing that a regular video.
In later stages of the inference process, another model is used to “interpolate” between these key frames, and we will cover this interpolation model in the next section.
The interpolation model
To achieve high frame rates essential for smooth video playback, an interpolation model is employed. This model is adept at generating intermediate frames between key frames, effectively increasing the temporal resolution of the video. It's akin to filling in the gaps, ensuring that transitions between frames are seamless and that motion appears fluid.
This model's ability to interpolate frames is fundamental to enhancing the realism of synthesized videos, making the difference between a choppy sequence and a lifelike portrayal of motion.
The temporally aware decoder
A temporally aware decoder plays a pivotal role in translating latent representations back into pixel space while ensuring temporal consistency across frames.
Fine-tuning the decoder with video data, alongside a temporal discriminator, ensures that the generated sequences don’t just resemble a collection of independent frames but rather a smooth, cohesive video. This aspect of the model is crucial for overcoming the flickering artifacts typical of frame-by-frame generation, instead producing videos with fluid motion that captures the natural progression of time.
The upscaler
Video synthesis at high resolutions demands an upscaler model capable of enhancing the spatial resolution of generated videos without compromising temporal coherence. This component of the LDM architecture scales up the video's resolution, ensuring that the details are crisp and clear, even at large display sizes. The upscaler operates by enhancing the detail in each frame while maintaining consistency across frames, a critical factor for ensuring that the high-resolution output remains faithful to the source material in motion and appearance.
Training and Inference
The training process of these advanced video synthesis models involves several stages, each tailored to instill a deep understanding of both spatial content and temporal dynamics. Initially, models are trained on images to learn high-quality spatial representations. Subsequent stages involve fine-tuning with video data to introduce temporal coherence, training prediction models for generating sequences, and enhancing frame resolution with upscalers.
During inference, the model starts with a noise distribution, progressively refining it through the reverse diffusion process conditioned on temporal information. This iterative denoising, guided by the learned temporal dynamics and spatial details, culminates in the generation of coherent video sequences from noise. It’s important to note that the generation process is optionally conditioned on an initial 1 or 2 frames in order to generate videos conditioned some contextual image(s).
The code
Ok, now that we have covered some of the theoretical details of the model, we're ready to move on to some coding! We will run inference with the model, and generate some videos.
Gathering data
Before you get started with the Stable Video Diffusion (SVD) model, gather a set of images that will serve as the initial condition for the video generation process. Considering the model's ability to capture dynamic motion, select images where significant movement or flow between frames is expected or implied. These images could be of athletes in motion, vehicles in transit, or any subject that implies movement, providing a rich dataset for generating synthetic videos.
Here are some of the images I used for testing the model:
Run: nemesis-tanagra-11
1
Once you have your conditioning images, you can explore the model's adjustable parameters that influence the generated videos. Two such parameters are particularly impactful:
motion_bucket_id: This parameter determines the motion profile of the generated video. It's part of a predefined set of motion buckets that the model can draw upon to infuse the right kind of movement into the video. Each bucket corresponds to a certain type of motion, with higher IDs typically representing more intense or faster movements. By increasing the motion_bucket_id, you encourage the model to generate videos with more pronounced motion, which can be particularly useful when simulating actions or events that naturally involve rapid movement.
noise_aug_strength: This controls the level of stochasticity or randomness introduced into the conditioning image. A higher noise_aug_strength leads to videos that may diverge more from the initial image, incorporating more novel elements or variations. This can be especially useful when you desire a higher degree of unpredictability or creativity in the generated videos. Additionally, it's noted that increasing noise augmentation can indirectly increase the perception of motion in the video, making the scenes appear more dynamic.
Our first experiment
To start, we'll vary the motion_bucket_id while leaving the noise_aug_strength=0.1 as the starting point to observe the model's behavior in generating videos from images with different motion buckets. This setup aims to explore how the Stable Video Diffusion model interprets and generates motion from static images, and to assess the realism and fluidity of the generated movement.
Here is the code to run inference down below:
from diffusers.utils import load_image, export_to_videofrom diffusers import StableVideoDiffusionPipelineimport torchimport osimport numpy as npimport wandbfrom PIL import Image# Initialize the video diffusion pipelinepipe = StableVideoDiffusionPipeline.from_pretrained("stabilityai/stable-video-diffusion-img2vid", torch_dtype=torch.float16, variant="fp16")pipe.enable_model_cpu_offload()# List of image URLs# List of image URLsimage_urls = ["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png","https://rocklandpeakperformance.com//wp-content/uploads/2019/12/Swing-Cycle-5.jpg","https://image-cdn.hypb.st/https%3A%2F%2Fhypebeast.com%2Fimage%2F2017%2F11%2Fsteph-curry-teaching-basketball-masterclass-11.jpeg?cbr=1&q=90","https://images.sidearmdev.com/resize?url=https%3A%2F%2Fdxbhsrqyrr690.cloudfront.net%2Fsidearm.nextgen.sites%2Ftamu.sidearmsports.com%2Fimages%2F2020%2F6%2F15%2FQGASRJSCZLRYJHI.20121125034733.jpg&height=300","https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQwYGAUp0TJvO8iF-9tHtTuqrt0oeqonL_1CtZFKg7UOw&s","https://middleofnowheregaming.files.wordpress.com/2015/07/rocketleague-2015-07-13-16-39-09-68.jpg"]motion_buckets = [120, 180, 240, 300, 360] # Example motion bucket IDsnoise_aug_strength = 0.1 # Keeping noise augmentation strength constant# Iterate over each motion bucket firstfor bucket in motion_buckets:# Initialize Weights & Biases run for each bucketwandb.init(entity='byyoung3', project='mlnews2', name=f'bucket_{bucket}', reinit=True, group='exp_1')# Then, process each image for the current motion bucketfor i, url in enumerate(image_urls):# Load the conditioning imageimage = load_image(url).convert("RGB") # Ensure image is in RGB formatimage = image.resize((1024, 576))# Generate video frames with the current motion bucketgenerator = torch.manual_seed(42)result = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=bucket, noise_aug_strength=noise_aug_strength)# Convert frames to numpy array and transpose the axes to (time, channel, height, width)frames_np = np.stack([np.array(frame) for frame in result.frames[0]])frames_np = frames_np.transpose((0, 3, 1, 2)) # Transpose to (time, channels, height, width)# Log video to wandb with specific motion bucket IDwandb.log({f"image_{i}": wandb.Video(frames_np, fps=7, format="mp4")})# Finish the wandb run for the current motion bucketwandb.finish()
This code iterates over images, converting each into a video based on predefined motion patterns, or "buckets." For each image, after loading and resizing, it generates video frames using a specific motion bucket setting and a constant noise augmentation strength. A generator with a fixed seed ensures consistency in the video creation process. The frames are then converted into a numpy array, with axes transposed to match the required format for video logging. Each video is logged to Weights & Biases with a unique identifier for the image and bucket. After processing all images in a bucket, the code concludes the logging session, moving on to the next motion bucket.
Here are the results for my run:
Run set
6
Run set
6
Run set
6
Run set
6
Run set
6
Run set
6
Experiment 2
Experiment 2 is designed to delve deeper into the effects of noise augmentation on video generation, while keeping the motion intensity constant. By varying the noise_aug_strength from 0.1 to 0.4, we aim to understand how different levels of noise influence the visual characteristics of the generated videos, especially in terms of motion clarity, texture fidelity, and overall visual coherence. This experimental setup provides a focused lens on the role of noise in the generative process
Here is the code for experiment 2:
from diffusers.utils import load_image, export_to_videofrom diffusers import StableVideoDiffusionPipelineimport torchimport osimport numpy as npimport wandbfrom PIL import Image# Initialize the video diffusion pipelinepipe = StableVideoDiffusionPipeline.from_pretrained("stabilityai/stable-video-diffusion-img2vid", torch_dtype=torch.float16, variant="fp16")pipe.enable_model_cpu_offload()# List of image URLs# List of image URLsimage_urls = ["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png","https://rocklandpeakperformance.com//wp-content/uploads/2019/12/Swing-Cycle-5.jpg","https://image-cdn.hypb.st/https%3A%2F%2Fhypebeast.com%2Fimage%2F2017%2F11%2Fsteph-curry-teaching-basketball-masterclass-11.jpeg?cbr=1&q=90","https://images.sidearmdev.com/resize?url=https%3A%2F%2Fdxbhsrqyrr690.cloudfront.net%2Fsidearm.nextgen.sites%2Ftamu.sidearmsports.com%2Fimages%2F2020%2F6%2F15%2FQGASRJSCZLRYJHI.20121125034733.jpg&height=300","https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQwYGAUp0TJvO8iF-9tHtTuqrt0oeqonL_1CtZFKg7UOw&s","https://middleofnowheregaming.files.wordpress.com/2015/07/rocketleague-2015-07-13-16-39-09-68.jpg"]noise_aug_strengths = [0.1, 0.2, 0.3, 0.4]motion_bucket_id = 120# Iterate over each noise augmentation strengthfor noise_aug_strength in noise_aug_strengths:# Initialize Weights & Biases run for each noise strengthwandb.init(entity='byyoung3', project='mlnews2', name=f'noise_{noise_aug_strength}', reinit=True, group='exp_2')# Process each image for the current noise augmentation strengthfor i, url in enumerate(image_urls):# Load the conditioning imageimage = load_image(url).convert("RGB") # Ensure image is in RGB formatimage = image.resize((1024, 576))# Generate video frames with the current noise augmentation strengthgenerator = torch.manual_seed(42)result = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=motion_bucket_id, noise_aug_strength=noise_aug_strength)# Convert frames to numpy array and transpose the axes to (time, channel, height, width)frames_np = np.stack([np.array(frame) for frame in result.frames[0]])frames_np = frames_np.transpose((0, 3, 1, 2)) # Transpose to (time, channels, height, width)# Log video to wandb with specific noise augmentation strengthwandb.log({f"image_{i}": wandb.Video(frames_np, fps=7, format="mp4")})# Finish the wandb run for the current noise augmentation strengthwandb.finish()
Here are the results:
Run set
4
Run set
4
Run set
4
Run set
4
Run set
4
Run set
4
Now, unfortunately these results were a bit less impressive than some of the videos that OpenAI's Sora model has produced. That's OK! Open source has been routinely running about ~6-12 months behind OpenAI for a while now, and many of the details around the Sora model are public, so expect to see more impressive generative videos in the future. I'd say there's a good chance Meta or Mistral AI probably has an awesome video model training right now, so just be patient! Additionally, now is the perfect time to prepare custom datasets for your specific application, and when the new open source drop, you will be ready to go.
Summary
The adaptation of latent diffusion models for video synthesis represents a remarkable convergence of spatial understanding and temporal intuition. By integrating temporal layers, prediction models, and specialized decoders, alongside interpolation and upscaling techniques, researchers have paved the way for generating high-resolution videos that are not only visually compelling but also temporally coherent. This evolution of diffusion models showcases the extraordinary potential of generative models in bridging the gap between static imagery and dynamic video content, opening new avenues for creative and practical applications in digital media. I hope you enjoyed this work!
Related Articles
Some Details on OpenAI's Sora and Diffusion Transformers
Combining some of the most promising architectures, OpenAI shows off its newest model!
Würstchen: An Efficient Architecture for Text-to-Image Diffusion Models
This article provides a short tutorial on how to run experiments with Würstchen — an efficient text-to-image Diffusion model for text-conditional image generation.
A Gentle Introduction to Diffusion
Powering some of the world's most advanced image generation models, we will dive into the theory and code of how diffusion models work, and how to train one of our own.
Next Frame Prediction Using Diffusion: The fastai Approach
In this article, we look at how to use diffusion models to predict the next frame on a sequence of images, and we iterate fast over the MovingMNIST dataset.
Resources
Run set
4
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.