Text2Bricks: Fine-tuning Open-Sora in 1,000 GPU Hours
Fine-tuning a model to LEGO-ify video clips
Created on May 20|Last edited on July 2
Comment
Text-to-video models have opened up a world of possibilities for developers and content creators. However, proprietary models can be difficult to access or may produce unsatisfactory results for specific needs.
Fine-tuning an open-source model with your own data allows you to improve its ability to generate videos tailored to your projects, such as creating unique styles or enhancing quality for specific subjects. For example, recreating classic movie scenes in a unique artistic style, which you can see in this game we made.
In this article, we'll explore the technical steps involved in fine-tuning a Open-Sora 1.1 Stage3 model to create stop motion animations. At Lambda, we've released two models:
- lambdalabs/text2bricks-360p-64f: trained in 1,000 GPU hours (NVIDIA H100) and generated 360p videos up to 64 frames.
- lambdalabs/text2bricks-360p-32f: trained in 170 GPU (NVIDIA H100) hours and generated 360p videos up to 32 frames.
We have released the code (our fork of Open-Sora), dataset, and models (32f and 64f). You can play with the 64f model with this gradio demo. Also, here is an awesome blog from the creators of the original Open-Sora blog for anyone else who missed it: https://hpc-ai.com/blog/open-sora
Here are a couple of example outputs of our fine-tuned model:
Setup
Hardware: Our training infrastructure is a 32-GPU Lambda 1-Click Cluster. It has four NVIDIA HGX H100 servers powered by 8 x NVIDIA H100 SXM Tensor Core GPUs each and connected by NVIDIA Quantum-2 400 Gb/s InfiniBand networking. The node-to-node bandwidth is 3200 Gb/s, enabling distributed training scales linearly across multi-nodes. The cluster also comes with a pay-as-you-go Lambda Cloud shared filesystem storage, which allows data, code and python environment to be shared across all the nodes. You can find more information about 1-Click Clusters in this blog post.
Software: The 32-GPU cluster comes with NVIDIA Driver pre-installed. We wrote a tutorial for creating a Conda environment to manage the rest of the dependencies for using Open-Sora, including NVIDIA CUDA, NVIDIA NCCL, PyTorch, Transformers, Diffusers, Flash-Attention, and NVIDIA Apex. We put the Conda environment on the shared filesystem storage, so that it can be activated across all the nodes.
This 32-NVIDIA GPU cluster delivers training throughput of 97,200 video clips per hour (360p and 32 frames per video).
Data
Data Source: Our dataset has videos sourced from a few popular YouTube channels, including MICHAELHICKOXFilms, LEGO Land, FK Films and LEGOSTOP Films. These videos are high quality stop animations created with LEGO® bricks. The full dataset is available on HuggingFace.
We provide a script to help you create your own customized dataset from YouTube urls. The data process pipeline follow Open-Sora's guideline which first cut the videos into clips of 15-200 frames, and then annotated them using a vision language model. In total we have 24k 720p/16:9 video clips. Open-Sora also recommends using static images to help the model learn object appearance in finer details. To include images to our dataset, we simply collect the middle frames of the video clips.
Data Caption: We use GPT-4o with specific prompt to annotate the video clips. This is our prompt:
A stop motion lego video is given by providing three frames in chronological order, each pulled frame from 20%, 50%, and 80% through the full video. Describe this video and its style to generate a description.If the three frames included do not give you enough context or information to describe the scene, say 'Not enough information'.If the three frames all appear identical, say 'Single image'.If the three frames depict very little movement, say 'No movement'.Do not use the term video or images or frames in the description. Do not describe each frame/image individually in the description.Do not use the word lego or stop motion animations in your descriptions. Always provide descriptions for lego stop motion videos but do not use the word lego or mention that the world is blocky.Pay attention to all objects in the video. The description should be useful for AI to re-generate the video.The description should be less than six sentences.
We also gave the GPT-4o a few prompts from OpenAI's Sora demo as few-shot examples, including "A stylish woman confidently and casually walks down a Tokyo street". "Mammoths tread through a snowy meadow", and "Big Sur". The middle frames of the video are also captioned using GPT-4o, with the prompt adjusted for image data.
Despite being generated by the latest and most advanced GPT model, the caption can still contains errors. This can be seem from the example below, where the bold annotations are wrong. This highlights the difficulty of obtaining high-quality data labels in this specific subject domains.

A character with a shocked expression is seated inside what appears to be a bathroom, with its expression progressively changing to one that is more relaxed and content. To the character's side, there is a brown cabinet and a white object that resembles a sink. Adjacent to the character lies a floor that transitions from a blue to a green surface, with an item resembling a briefcase or a satchel cast aside on the ground. The overall setting conveys a simplistic indoor scene where a figure experiences a rapid shift in emotions while in a seated position.
Model
Pre-trained Model: We use the latest Open-Sora model (released on 2024.04.25) due to its flexibility of continuing training with different spatio-temporal resolutions and aspect ratios. Our plan is to finetune the pre-trained OpenSora-STDiT-v2-stage3 model with the BrickFilm dataset so it can generate videos in the similar style. The configuration and commands to train the model can be found in this guide.
Our first successful model (text2bricks-360p-64f) produces 360p videos with up to 64 frames. It cost a total training time of 1017.6 H100 hours on this H100-powered platform. Here is the breakdown:
- Stage One (160 H100 hours): we first focus on generating videos of 360p resolution and 16 frames. To make finetuning more stable, we use 500 cosine warmup steps before keeping the learning rate constant at 1e-5. Doing so gradually "rebuilds" the optimizer status, and avoids catastrophic model behavior at the beginning of the training.
- Stage Two (857.6 H100 hours): we add image dataset, as well as 32 frames and 64 frames to the bucket configure.
We also conduct a different model (text2bricks-360p-32f) that trained with only 169.6 H100 hours using one cycle learning rate schedule. It produced comparable results for 360p resolution and up to 32 frames.
- Stage One (67.84 H100 hours): We first increased the learning rate from 1e-7 to 1e-4 with 1500 cosine warmup steps.
- Stage Two (101.76 H100 hours): We then decrease the learning rate to 1e-5 with 2500 cosine annealing steps.
Results
The panels below show how the model's outputs evolved during the fine-tuning stages. We fixed the random seed to ensure apple-to-apple comparisons.
Pre-trained Model
text2bricks-360p-64f: Stage One (360p / 16 frames)
text2bricks-360p-64f: Stage Two (360p / 64 frames)
text2bricks-360p-32f: Stage Two (360p / 32 frames)
Metrics
Training Metrics
We observed that during the fine-tuning process, loss does not decrease. However, our logged validation results (the results above) show that the model has not collapsed. Instead, the quality of the generated images gradually improves. This indicates that the model is enhancing its performance in ways not directly reflected by the loss value.
System Metrics
Through Weights & Biases's monitoring panel, we observed very low CPU usage for this fine-tuning task. In contrast, the GPUs were consistently busy processing data. Despite occasional drops due to evaluation and checkpointing, the GPUs ran at full capacity in both compute and memory.
This highlights the importance of efficient scaling in training foundational models. It showcases the benefits of Lambda 1-Click Clusters, featuring interconnected NVIDIA H100 Tensor Core GPUs with NVIDIA Quantum-2 400Gb/s InfiniBand networking.
Future Work
Despite the promising results, there are several areas for improvement in the current model:
Temporal Consistency in Longer Sequences: We observe weaker temporal consistency in longer sequence outputs. The ST-DiT-2 architecture employs an attention mechanism in spatial and temporal dimensions as separate steps. While this significantly reduces computation costs, it may limit attention to a “local” context window, leading to drifting in generated videos. Enhancing the integration of spatial and temporal attention could address this issue.
Noise in Condition-Free Generation: Noisy outputs are observed in condition-free generation (set cfg=0). This indicates that the model's learning of brick animation representations can still be improved. Possible solutions include expanding the dataset further and looking for ways to let the model learn the representation more efficiently.
Resolution and Frame Count: Pushing the output beyond 360p and 64 frames would be an exciting direction for future development. Achieving higher resolution and longer sequences will enhance the utility and applicability of the model.
Dataset: Both the quality and the quantity of our dataset can be improved. Stay tuned for our future releases!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.