Unconditional Image Generation Using HuggingFace Diffusers
In this article, we explore how to train unconditional image generation models using HuggingFace Diffusers and we will track these experiments and compare the results usingWeights & Biases.
Created on November 9|Last edited on January 17
Comment
In this article, we will look at a simple and flexible pipeline for training diffusion models for unconditional image generation, primarily using the Diffusers and Accelerate libraries built by HuggingFace. We'll also explore how we can track these experiments and explore and compare the results of our experiments using Weights & Biases.
Lastly, we'll briefly examine what diffusion models are and how they are different from existing techniques for image generation, such as Generative Adversarial Networks (GANs), Variational AutoEncoders (VAE), Normalizing Flows, etc.
If you'd like to follow along with the code in a colab, just click the following link, and we've got you covered:
And, if you'd like to take a look at any of our previous work on image generation, you'll find some of that here:
How to Implement Deep Convolutional Generative Adversarial Networks (DCGAN) in Tensorflow
In this short tutorial, we explore how to implement Deep Convolutional Generative Adversarial Networks in Tensorflow, with a Colab to help you follow along.
How to Implement Deep Convolutional Generative Adversarial Networks (DCGANs) in PyTorch
A short tutorial about implementing Deep Convolutional Generative Adversarial Networks in PyTorch, with a Colab to help you follow along.
Disentangling Variational Autoencoders
Princeton University, School of Architecture
NEU 560 - Final Project
December 2020
An Introduction to VAE-GANs
VAE-GAN was introduced for simultaneously learning to encode, generating and comparing dataset samples. In this blog, we explore VAE-GANs and the paper that introduced them : Autoencoding beyond pixels using a learned similarity metric.
Towards Deep Generative Modeling With Weight & Biases
In this article, we'll learn about Autoencoders and Variational Autoencoders and then dive into Generative Adversarial Modeling.
Measuring Mode Collapse in GANs Using Weights & Biases
In this article, we evaluate and quantitatively measure the GAN failure case of mode collapse — when the model fails to generate diverse enough outputs.
Table of Contents:
Table of Contents: What Are Diffusion Models?What Is Diffusers?What is Accelerate?Implementing a Simple Training PipelineBuilding the Input PipelineBuilding the Model:The Training PipelineThe Training LoopResultsSources & Further Reading
What Are Diffusion Models?
On the most basic level, diffusion models work by adding noise to training data and then denoising that data to recover it. So, think of adding static to a picture of a dog, then removing that static and creating a different–but related–dog. Let's get a bit more technical, though:
Diffusion models are a family of latent variable based deep generative models that can synthesize diverse and high-quality images (and other forms of data as well, such as audio) that are inspired by considerations from nonequilibrium thermodynamics.
These models have actually been around for a while–or at least their underlying principles have been. However, only recently have we been able to harness their potential for training deep generative models that can generate high-quality images while offering desirable properties such as distribution coverage, a stationary training objective, and easy scalability. Researchers from OpenAI have shown that Diffusion Models Beat GANs on image synthesis in terms of Frechet Inception Distance score (you can read more about FID score in this report).
Before trying to understand how diffusion models are trained to generate images, let's try to understand the concept of diffusion from nonequilibrium thermodynamics in physics:
The thermodynamic equilibrium of a thermodynamic system is the state or condition of the system when its properties do not change with time and that can be changed to another condition only at the expense of effects on other systems.
This is perhaps easiest explained by an example. Let's say that we have a can of air freshener. It has a lot of perfumed aerosol molecules under high pressure. Once you spray it in your room, initially, there is a high concentration of aerosol in the air. But the aerosol eventually spreads across the whole room from the point where it was sprayed to a lower concentration. This evens out its concentration across the room, achieving thermodynamic equilibrium. This process of transitioning into a state of thermodynamic equilibrium is called diffusion.
The family of deep generative models as proposed by the paper Denoising Diffusion Probabilistic Models, is, in fact, inspired by this thermodynamic phenomenon! A probabilistic diffusion model is basically a Markov Chain where noise is added to the data at each step of the chain. This process is referred to as the forward diffusion process, where we add a certain amount of noise to the data (an input image) sequentially. At the end of this Markov Chain, the input signal is completely destroyed, and we get an image that is purely noise.
A Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
💡
Now that the input signal has been reduced to noise after a finite number of steps, we can train a neural network to reverse this forward diffusion process. This reverse diffusion process involves the network being applied at each step to generate the image from the current time step to the previous one. The neural network can be trained by optimizing the negative log-likelihood of the training data.

A graphical representation of the forward and reverse diffusion process
What Is Diffusers?
Diffusers is a library built by HuggingFace that provides pre-trained diffusion models and serves as a modular toolbox for the training and inference of such models.

More precisely, Diffusers offer:
- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code. For more information, check out the official docs for Pipelines.
- Various noise schedulers that can be used interchangeably for the preferred speed vs. quality trade-off in inference. For more information, check out the official docs for Schedulers.
- Multiple types of models such as UNet, can be used as building blocks in an end-to-end diffusion system. For more information, check out the official docs for Models.
- Training examples to show how to train the most popular diffusion model tasks. For more information, check out the official docs for Training.
What is Accelerate?
Accelerate is another library built by HuggingFace. It enables us to use the same PyTorch code that could be run across any distributed configuration by adding just four lines of code! Accelerate makes training and inference at scale made simple, efficient, and adaptable.
Implementing a Simple Training Pipeline
Now that we have got an idea of the underlying principles governing our unconditional image generation diffusion model and the libraries we would be using to build the pipeline, let's jump straight into the code:
Building the Input Pipeline
We'll be using the HuggingFace Datasets library which enables easy accessing and sharing of datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
Using the datasets.load_dataset function we can easily load our dataset either from a local image directory or from the HuggingFace hub. For the scope of this report, we would be using the huggan/flowers-102-categories dataset of flower images from HuggingFace Hub.
We'll apply augmentations and transforms for pre-processing the dataset using the transforms function. After that, we convert it into a standard PyTorch Dataloader using torch.utils.data.DataLoader.
def transforms(examples):augmentations = Compose([Resize(config.resolution, interpolation=InterpolationMode.BILINEAR),CenterCrop(config.resolution), RandomHorizontalFlip(),ToTensor(), Normalize([0.5], [0.5]),])# Pre-processingimages = [augmentations(image.convert("RGB")) for image in examples["image"]]return {"input": images}def build_dataloader():dataset = (load_dataset(config.dataset_name, config.dataset_config_name,cache_dir=config.cache_dir, split="train",)if config.dataset_name is not None elseload_dataset("imagefolder", data_dir=config.train_data_dir,cache_dir=config.cache_dir, split="train"))dataset.set_transform(transforms)return torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size,shuffle=True,num_workers=config.dataloader_num_workers)
Building the Model:
First, let's define the neural network that we would train to generate images during the aforementioned reverse diffusion process. We would use a U-Net backbone similar to an unmasked PixelCNN++ similar to the one used in the paper Denoising Diffusion Probabilistic Models.
from diffusers import UNet2DModeldef build_unet_model():return UNet2DModel(sample_size=config.resolution,in_channels=3,out_channels=3,layers_per_block=2,block_out_channels=(128, 128, 256, 256, 512, 512),down_block_types=("DownBlock2D", "DownBlock2D", "DownBlock2D", "DownBlock2D","AttnDownBlock2D", "DownBlock2D",),up_block_types=("UpBlock2D", "AttnUpBlock2D","UpBlock2D", UpBlock2D", "UpBlock2D", "UpBlock2D",),)
The Training Pipeline
Step 1: We initialize accelerate.Accelerator, which is the main force behind utilizing all the possible options for distributed training!
accelerator = Accelerator(gradient_accumulation_steps=config.gradient_accumulation_steps,mixed_precision=config.mixed_precision,log_with="wandb",logging_dir=config.logging_dir,)
Step 2: We now initialize the train data loader and the UNet model.
# Initialize Train Dataloadertrain_dataloader = build_dataloader()# Initialize Modelmodel = build_unet_model()noise_scheduler = DDPMScheduler(num_train_timesteps=config.num_train_timesteps) if config.diffusion_pipeline == "ddpm" else DDIMScheduler(num_train_timesteps=config.num_train_timesteps)
Step 3: Defining the Noise Scheduler
We now define the noise scheduler for our diffusion process. The schedule functions from Diffusers take in the output of a trained model, a sample which the diffusion process is iterating, and a timestep to return a denoised sample.
Schedulers in the Diffusers library define the methodology for iteratively adding noise to an image or for updating a sample based on model outputs.
- adding noise in different manners represents the algorithmic processes to train a diffusion model by adding noise to images.
- for inference, the scheduler defines how to update a sample based on an output from a pre-trained model.
Schedulers are often defined by a noise schedule and an update rule to solve the differential equation solution.
We either define the DDPMScheduler (Denoising Diffusion Probabilistic Model) or the DDIMScheduler (Denoising Diffusion Implicit Models) for our experiments.
Step 4: We define the AdamW optimizer and use the diffusers.optimization.get_scheduler utility from Diffusers to create our learning rate scheduler.
# Initialize AdamW optimizeroptimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate,betas=(config.adam_beta1, config.adam_beta2),weight_decay=config.adam_weight_decay,eps=config.adam_epsilon,)# Initialize Learning Rate Schedulerlr_scheduler = get_scheduler(config.lr_scheduler, optimizer=optimizer,num_warmup_steps=config.lr_warmup_steps,num_training_steps=(len(train_dataloader) * config.num_epochs) // config.gradient_accumulation_steps,)
Step 5: Now we prepare all our objects that are to be used for distributed training using mixed precision using the accelerator.prepare method. This method not only prepares the dataloader, model, optimizer and lr schedulers for handling mixed precision but also accelerator device placement automatically.
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(model, optimizer, train_dataloader, lr_scheduler)
Step 6: Exponential Moving Average
When training a model, it is often beneficial to maintain moving averages of the trained parameters. Evaluations that use averaged parameters sometimes produce significantly better results than the final trained values.
ema_model = EMAModel(model,inv_gamma=config.ema_inv_gamma,power=config.ema_power,max_value=config.ema_max_decay)
Step 7: Initializing Experiment Tracker
Now, we initialize the Weights & Biases tracker for Accelerate using the function accelerator.init_trackers which initializes a run for the specified tracker. We also initialize a Weights & Biases Table to log the generated images in an epoch-wise manner, using which we could compare and summarize the performance of our model at the end of training.
if accelerator.is_main_process:accelerator.init_trackers(project_name=config.wandb_project,init_kwargs={"wandb": {'entity': config.wandb_entity,'config': config.to_dict()}})wandb_table = wandb.Table(columns=['Epoch', 'Step', 'Generated-Images'])
The Training Loop
Now, let's dissect the code that goes inside our training loop:
The Training Loop
0
Results
Let us now take a look at our training metrics and some cool generated images!
Results
2
Sources & Further Reading
- The code used in this report has mostly been derived from the official diffuser example for unconditional image generation.
- The ideas discussed in this report are borrowed mostly from the following papers:
- To get a better understanding of the underlying theories for the diffusion model, you can check out the following sources:
- To check out implementations of diffusion models from scratch, you can check out the following sources:
How To Train a Conditional Diffusion Model From Scratch
In this article, we look at how to train a conditional diffusion model and find out what you can learn by doing so, using W&B to log and track our experiments.
Making My Kid a Jedi Master With Stable Diffusion and Dreambooth
In this article, we'll explore how to teach and fine-tune Stable Diffusion to transform my son into his favorite Star Wars character using Dreambooth.
Improving Generative Images with Instructions: Prompt-to-Prompt Image Editing with Cross Attention Control
A primer on text-driven image editing for large-scale text-based image synthesis models like Stable Diffusion & Imagen
A Technical Guide to Diffusion Models for Audio Generation
Diffusion models are jumping from images to audio. Here's a look at their history, their architecture, and how they're being applied to this new domain
A Gentle Introduction to Dance Diffusion
Diffusion models are everywhere for images, but have yet to gain real traction in audio generation. That's changing thanks to Harmonai.
Running Stable Diffusion on an Apple M1 Mac With HuggingFace Diffusers
In this article, we look at running Stable Diffusion on an M1 Mac with HuggingFace diffusers, highlighting the advantages — and the things to watch out for.
Stable Diffusion Settings and Storing Your Images
In this article, we explore the impact of different settings used for the Stable Diffusion model and how you can store your generated images for quick reference.
Add a comment
Tags: Articles, HuggingFace, Computer Vision, GenAI, Intermediate, Panels, Plots, Experiment, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.