Skip to main content

Unconditional Image Generation Using HuggingFace Diffusers

In this article, we explore how to train unconditional image generation models using HuggingFace Diffusers and we will track these experiments and compare the results usingWeights & Biases.
Created on November 9|Last edited on January 17
In this article, we will look at a simple and flexible pipeline for training diffusion models for unconditional image generation, primarily using the Diffusers and Accelerate libraries built by HuggingFace. We'll also explore how we can track these experiments and explore and compare the results of our experiments using Weights & Biases.
Lastly, we'll briefly examine what diffusion models are and how they are different from existing techniques for image generation, such as Generative Adversarial Networks (GANs), Variational AutoEncoders (VAE), Normalizing Flows, etc.
If you'd like to follow along with the code in a colab, just click the following link, and we've got you covered:



And, if you'd like to take a look at any of our previous work on image generation, you'll find some of that here:


Table of Contents:





What Are Diffusion Models?

On the most basic level, diffusion models work by adding noise to training data and then denoising that data to recover it. So, think of adding static to a picture of a dog, then removing that static and creating a different–but related–dog. Let's get a bit more technical, though:
Diffusion models are a family of latent variable based deep generative models that can synthesize diverse and high-quality images (and other forms of data as well, such as audio) that are inspired by considerations from nonequilibrium thermodynamics.
These models have actually been around for a while–or at least their underlying principles have been. However, only recently have we been able to harness their potential for training deep generative models that can generate high-quality images while offering desirable properties such as distribution coverage, a stationary training objective, and easy scalability. Researchers from OpenAI have shown that Diffusion Models Beat GANs on image synthesis in terms of Frechet Inception Distance score (you can read more about FID score in this report).
Before trying to understand how diffusion models are trained to generate images, let's try to understand the concept of diffusion from nonequilibrium thermodynamics in physics:
The thermodynamic equilibrium of a thermodynamic system is the state or condition of the system when its properties do not change with time and that can be changed to another condition only at the expense of effects on other systems.
This is perhaps easiest explained by an example. Let's say that we have a can of air freshener. It has a lot of perfumed aerosol molecules under high pressure. Once you spray it in your room, initially, there is a high concentration of aerosol in the air. But the aerosol eventually spreads across the whole room from the point where it was sprayed to a lower concentration. This evens out its concentration across the room, achieving thermodynamic equilibrium. This process of transitioning into a state of thermodynamic equilibrium is called diffusion.
The family of deep generative models as proposed by the paper Denoising Diffusion Probabilistic Models, is, in fact, inspired by this thermodynamic phenomenon! A probabilistic diffusion model is basically a Markov Chain where noise is added to the data at each step of the chain. This process is referred to as the forward diffusion process, where we add a certain amount of noise to the data (an input image) sequentially. At the end of this Markov Chain, the input signal is completely destroyed, and we get an image that is purely noise.
A Markov Chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
💡
Now that the input signal has been reduced to noise after a finite number of steps, we can train a neural network to reverse this forward diffusion process. This reverse diffusion process involves the network being applied at each step to generate the image from the current time step to the previous one. The neural network can be trained by optimizing the negative log-likelihood of the training data.
A graphical representation of the forward and reverse diffusion process



What Is Diffusers?

Diffusers is a library built by HuggingFace that provides pre-trained diffusion models and serves as a modular toolbox for the training and inference of such models.

More precisely, Diffusers offer:
  • State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code. For more information, check out the official docs for Pipelines.
  • Various noise schedulers that can be used interchangeably for the preferred speed vs. quality trade-off in inference. For more information, check out the official docs for Schedulers.
  • Multiple types of models such as UNet, can be used as building blocks in an end-to-end diffusion system. For more information, check out the official docs for Models.
  • Training examples to show how to train the most popular diffusion model tasks. For more information, check out the official docs for Training.

What is Accelerate?

Accelerate is another library built by HuggingFace. It enables us to use the same PyTorch code that could be run across any distributed configuration by adding just four lines of code! Accelerate makes training and inference at scale made simple, efficient, and adaptable.

Implementing a Simple Training Pipeline

Now that we have got an idea of the underlying principles governing our unconditional image generation diffusion model and the libraries we would be using to build the pipeline, let's jump straight into the code:

Building the Input Pipeline

We'll be using the HuggingFace Datasets library which enables easy accessing and sharing of datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
Using the datasets.load_dataset function we can easily load our dataset either from a local image directory or from the HuggingFace hub. For the scope of this report, we would be using the huggan/flowers-102-categories dataset of flower images from HuggingFace Hub.
We'll apply augmentations and transforms for pre-processing the dataset using the transforms function. After that, we convert it into a standard PyTorch Dataloader using torch.utils.data.DataLoader.
def transforms(examples):
augmentations = Compose(
[
Resize(config.resolution, interpolation=InterpolationMode.BILINEAR),
CenterCrop(config.resolution), RandomHorizontalFlip(),
ToTensor(), Normalize([0.5], [0.5]),
]
)
# Pre-processing
images = [augmentations(image.convert("RGB")) for image in examples["image"]]
return {"input": images}


def build_dataloader():
dataset = (
load_dataset(
config.dataset_name, config.dataset_config_name,
cache_dir=config.cache_dir, split="train",
)
if config.dataset_name is not None else
load_dataset(
"imagefolder", data_dir=config.train_data_dir,
cache_dir=config.cache_dir, split="train"
)
)

dataset.set_transform(transforms)
return torch.utils.data.DataLoader(
dataset, batch_size=config.train_batch_size,
shuffle=True,
num_workers=config.dataloader_num_workers
)


Building the Model:

First, let's define the neural network that we would train to generate images during the aforementioned reverse diffusion process. We would use a U-Net backbone similar to an unmasked PixelCNN++ similar to the one used in the paper Denoising Diffusion Probabilistic Models.
We would create the UNet model using diffusers.UNet2DModel.



from diffusers import UNet2DModel


def build_unet_model():
return UNet2DModel(
sample_size=config.resolution,
in_channels=3,
out_channels=3,
layers_per_block=2,
block_out_channels=(128, 128, 256, 256, 512, 512),
down_block_types=(
"DownBlock2D", "DownBlock2D", "DownBlock2D", "DownBlock2D",
"AttnDownBlock2D", "DownBlock2D",
),
up_block_types=(
"UpBlock2D", "AttnUpBlock2D",
"UpBlock2D", UpBlock2D", "UpBlock2D", "UpBlock2D",
),
)


The Training Pipeline


Step 1: We initialize accelerate.Accelerator, which is the main force behind utilizing all the possible options for distributed training!
accelerator = Accelerator(
gradient_accumulation_steps=config.gradient_accumulation_steps,
mixed_precision=config.mixed_precision,
log_with="wandb",
logging_dir=config.logging_dir,
)
Step 2: We now initialize the train data loader and the UNet model.
# Initialize Train Dataloader
train_dataloader = build_dataloader()

# Initialize Model
model = build_unet_model()

noise_scheduler = DDPMScheduler(
num_train_timesteps=config.num_train_timesteps
) if config.diffusion_pipeline == "ddpm" else DDIMScheduler(
num_train_timesteps=config.num_train_timesteps
)
Step 3: Defining the Noise Scheduler
We now define the noise scheduler for our diffusion process. The schedule functions from Diffusers take in the output of a trained model, a sample which the diffusion process is iterating, and a timestep to return a denoised sample.


Schedulers in the Diffusers library define the methodology for iteratively adding noise to an image or for updating a sample based on model outputs.
  • adding noise in different manners represents the algorithmic processes to train a diffusion model by adding noise to images.
  • for inference, the scheduler defines how to update a sample based on an output from a pre-trained model.
Schedulers are often defined by a noise schedule and an update rule to solve the differential equation solution.
We either define the DDPMScheduler (Denoising Diffusion Probabilistic Model) or the DDIMScheduler (Denoising Diffusion Implicit Models) for our experiments.

Step 4: We define the AdamW optimizer and use the diffusers.optimization.get_scheduler utility from Diffusers to create our learning rate scheduler.
# Initialize AdamW optimizer
optimizer = torch.optim.AdamW(
model.parameters(), lr=config.learning_rate,
betas=(config.adam_beta1, config.adam_beta2),
weight_decay=config.adam_weight_decay,
eps=config.adam_epsilon,
)

# Initialize Learning Rate Scheduler
lr_scheduler = get_scheduler(
config.lr_scheduler, optimizer=optimizer,
num_warmup_steps=config.lr_warmup_steps,
num_training_steps=(
len(train_dataloader) * config.num_epochs
) // config.gradient_accumulation_steps,
)

Step 5: Now we prepare all our objects that are to be used for distributed training using mixed precision using the accelerator.prepare method. This method not only prepares the dataloader, model, optimizer and lr schedulers for handling mixed precision but also accelerator device placement automatically.
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, lr_scheduler
)

Step 6: Exponential Moving Average
When training a model, it is often beneficial to maintain moving averages of the trained parameters. Evaluations that use averaged parameters sometimes produce significantly better results than the final trained values.
We use the diffusers.training_utils.EMAModel to convert our model's weights to EMA.
ema_model = EMAModel(
model,
inv_gamma=config.ema_inv_gamma,
power=config.ema_power,
max_value=config.ema_max_decay
)

Step 7: Initializing Experiment Tracker
Now, we initialize the Weights & Biases tracker for Accelerate using the function accelerator.init_trackers which initializes a run for the specified tracker. We also initialize a Weights & Biases Table to log the generated images in an epoch-wise manner, using which we could compare and summarize the performance of our model at the end of training.

if accelerator.is_main_process:
accelerator.init_trackers(
project_name=config.wandb_project,
init_kwargs={
"wandb": {
'entity': config.wandb_entity,
'config': config.to_dict()
}
}
)
wandb_table = wandb.Table(
columns=['Epoch', 'Step', 'Generated-Images']
)


The Training Loop

Now, let's dissect the code that goes inside our training loop:

The Training Loop
0




Results

Let us now take a look at our training metrics and some cool generated images!

Results
2




Sources & Further Reading


Iterate on AI agents and models faster. Try Weights & Biases today.