Skip to main content

A Gentle Introduction to Diffusion

Powering some of the world's most advanced image generation models, we will dive into the theory and code of how diffusion models work, and how to train one of our own.
Created on December 30|Last edited on January 5
In this article, we'll be digging into how exactly diffusion models work. The goal is to focus more on the overall theory of how these models function—as well as some of the code that implements them—with less of the mathematical derivations. We'll also build a simple diffusion model and train it on the Stanford Cars dataset.

Here's what we'll be covering:

Table of Contents



What is a Diffusion Model?

Diffusion models work by first introducing noise into structured data, like images or audio, and then incrementally refining this noise. The model learns to reverse the noise addition, 'denoising', to produce coherent and realistic outputs.
Their ability to generate high-quality results has made them prominent in tasks like image synthesis. Stable Diffusion and DALL-E 2 exemplify this technique, harnessing it to create diverse and sophisticated images.

The Two Stages of Diffusion Models

The diffusion models work in two stages. The first stage involves a “forward” process, and the second stage involves a “reverse” process. In the forward process, Gaussian noise is added to the image until it is essentially pure noise (think a static-filled image on an old TV). The reverse process involves reversing this noise, which is accomplished using a neural network to approximate the noise that was added in the forward process. Assuming the network can predict the added noise, it can essentially be subtracted from the noised image, and a new image is generated. 

The Forward Process  

So we must first add noise to the data over several steps. Technically the process of adding noise is an incremental process, but due to a few mathematical properties, we can calculate the noise to be added to the image in closed form without the need to generate the noise in a sequential process (meaning we can calculate the noise added to the image at any step directly, instead of adding the noise overall several steps).
The ability to calculate the noise in closed form makes the forward process much more efficient. As can be seen below, the image gets slightly noisier with every time step, until it reaches the stage of being pure noise. The function q is used to denote the forward pass. As t grows larger, so does the amount of noise added to the image. The function p is the neural network that will be approximating the noise added at that particular time step, thus allowing us to remove the noise to create a new image.


Before going too much further, a crucial detail to discuss is something called the variance schedule. The variance schedule defines how much noise is added at each step of the forward process. The schedule is typically a predetermined sequence of variances that dictate the noise level added to the image at each time step. This schedule is crucial because it directly influences the effectiveness of the noise prediction by the neural network during the reverse process. A well-tuned variance schedule ensures that the noise addition is gradual and consistent, allowing the neural network to learn more effectively how to reverse this process. 
In practical terms, the variance schedule balances between adding too much noise too quickly (which can make it difficult for the network to learn the reverse process) and adding too little noise (which might not provide enough challenge for the network to learn effectively).

The Reverse Process

After we've added noise to the image, we're ready to reverse the process and predict the noise. I find the math derivations involved in the reverse process to be quite complex in comparison to the actual code implementation, so I will avoid going in into too many details about this derivation (though I'll link several resources down below for the interested reader/mathematician). Traditionally, diffusion models are optimized to find a “variational lower bound." A direct explanation to the loss for our implementation is simply the L2 loss between the predicted and the actual noise.  

The Model 

For our diffusion model, we will use a U-Net. The U-Net is a type of convolutional neural network that was originally designed for biomedical image segmentation. This design allows U-Net to capture both high-level and low-level details through a series of down-convolutional followed by up-convolutional layers, with residual connections between the intermediate down and up convolutional layers.
In diffusion models, U-Net is used due to its ability to operate at multiple resolutions or scales. This is particularly useful in diffusion processes because they require the model to iteratively refine the generated data, adding details in each step. The multi-scale representation learned by the U-Net is adept at handling the various levels of details needed as the data transitions from a noisy state to a coherent structure.
Below is a visualization for the U-net, which involves downsampling the image with convolution and scaling the features back up to the original size. Note that this also involves concatenating the intermediate downscaled features with the features in the upscaling portion of the model. Something interesting about this architecture (and other autoencoder architectures I’ve had success with) is that there are no fully connected layers (outside of time step embeddings, which we will cover later).

An example of a U-Net

The Code

Now, let's implement a simple version of this model. This won't generate samples as impressive as models like Stable Diffusion. However, its simplistic nature will allow for a concise explanation of the inner workings of the model. I want to give a lot of credit to DeepFindr on YouTube for their implementation, as it was extremely helpful for this tutorial.
In order to implement this architecture, we can break down each downscale and upscale block of the U-Net into a single torch module that can be instantiated dynamically with different sizes, which will make our implementation of the U-Net much cleaner.
class Block(nn.Module):
def __init__(self, in_ch, out_ch, time_emb_dim, up=False):
super().__init__()
self.time_mlp = nn.Linear(time_emb_dim, out_ch)
if up:
self.conv1 = nn.Conv2d(2*in_ch, out_ch, 3, padding=1)
self.transform = nn.ConvTranspose2d(out_ch, out_ch, 4, 2, 1)
else:
self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
self.transform = nn.Conv2d(out_ch, out_ch, 4, 2, 1)
self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
self.bnorm1 = nn.BatchNorm2d(out_ch)
self.bnorm2 = nn.BatchNorm2d(out_ch)
self.relu = nn.ReLU()

def forward(self, x, t, ):
# First Conv
h = self.bnorm1(self.relu(self.conv1(x)))
# Time embedding
time_emb = self.relu(self.time_mlp(t))
# Extend last 2 dimensions
time_emb = time_emb[(..., ) + (None, ) * 2]
# Add time channel
h = h + time_emb
# Second Conv
h = self.bnorm2(self.relu(self.conv2(h)))
# Down or Upsample
return self.transform(h)
A small detail we haven't mentioned yet is the time step embeddings that diffusion models use. Similar to how the attention mechanism has no mechanism for understanding the position of a token within a sequence of tokens, our U-net has no inherent knowledge of which noise step we are decoding, which is a critical piece of information for the model to know in order for the model to accurately predict the noise at a certain time step.
In order to solve this issue, we will use create a positional embedding that produces a unique vector for each step of the diffusion process. Here's the code:
class SinusoidalPositionEmbeddings(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim

def forward(self, time):
device = time.device
half_dim = self.dim // 2
embeddings = math.log(10000) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
embeddings = time[:, None] * embeddings[None, :]
embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
return embeddings
Now that we've created the U-net blocks, as well as our time step embedding module, we are ready to define our U-net model, which will combine the previous modules.
class Unet(nn.Module):
def __init__(self):
super().__init__()
image_channels = 3
down_channels = (64, 128, 512, 1023, 1600)
up_channels = (1600, 1023, 512, 128, 64)
out_dim = 3
time_emb_dim = 32

# Time embedding
self.time_mlp = nn.Sequential(
SinusoidalPositionEmbeddings(time_emb_dim),
nn.Linear(time_emb_dim, time_emb_dim),
nn.ReLU()
)

# Initial projection
self.conv0 = nn.Conv2d(image_channels, down_channels[0], 3, padding=1)

# Downsample
self.downs = nn.ModuleList([Block(down_channels[i], down_channels[i+1], \
time_emb_dim) \
for i in range(len(down_channels)-1)])
# Upsample
self.ups = nn.ModuleList([Block(up_channels[i], up_channels[i+1], \
time_emb_dim, up=True) \
for i in range(len(up_channels)-1)])
self.output = nn.Conv2d(up_channels[-1], out_dim, 1)

def forward(self, x, timestep):
# Embedd time
t = self.time_mlp(timestep)
# Initial conv
x = self.conv0(x)
# Unet
residual_inputs = []
for down in self.downs:
x = down(x, t)
residual_inputs.append(x)
for up in self.ups:
residual_x = residual_inputs.pop()
# Add residual x as additional channels
x = torch.cat((x, residual_x), dim=1)
x = up(x, t)
return self.output(x)
Now that we have a model, we need a loss function. We can use a simple L1 loss between the 'actual' noise and the predicted noise!
def get_loss(model, x_0, t):
x_noisy, noise = forward_diffusion_sample(x_0, t, device)
noise_pred = model(x_noisy, t)
return F.l1_loss(noise, noise_pred)
The get_loss function in a diffusion model is responsible for training the model to accurately predict the noise added to an image at various time steps. It starts by creating a noisy version of the input image and recording the exact noise added. This is done through the forward_diffusion_sample function, which applies the diffusion process (using a closed-form equation) to the original image x_0 based on a specific time step.
This function calculates the mean absolute error between the actual noise and the predicted noise, providing a quantitative measure of the model's performance in replicating the diffusion process. The goal during training is to minimize this loss, thereby improving the model's ability to reverse the diffusion process.

The Data

For our dataset, we will use the Stanford Cars dataset, with consists of over 16 thousand car images of several different sub-types of cars. You'll need to download the dataset from this Kaggle link and then extract it. We will use this Torch dataset object to utilize our dataset:
class StanfordCars(torch.utils.data.Dataset):
def __init__(self, root_path, split='train', transform=None):
self.root_path = root_path
self.transform = transform
self.split = split
self._load_images()

def _load_images(self):
split_path = os.path.join(self.root_path, f'cars_{self.split}/cars_{self.split}')
if not os.path.exists(split_path):
raise ValueError(f"Path not found: {split_path}")

self.images = [os.path.join(split_path, file) for file in os.listdir(split_path) if file.endswith(('.jpg', '.jpeg', '.png'))]

def __len__(self):
return len(self.images)

def __getitem__(self, index):
image_file = self.images[index]
image = Image.open(image_file).convert("RGB")
if self.transform:
image = self.transform(image)
return image
I'll go ahead and apply some preprocessing, which is shown below:
def load_transformed_dataset():
data_transforms = [
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(), # Scales data into [0,1]
transforms.Lambda(lambda t: (t * 2) - 1) # Scale between [-1, 1]
]
data_transform = transforms.Compose(data_transforms)
train = StanfordCars(root_path="YOUR_PATH_TO_DS",transform=data_transform)
test = StanfordCars(root_path="YOUR_PATH_TO_DS", transform=data_transform, split='test')
return torch.utils.data.ConcatDataset([train, test])


Training

For the sake of brevity, I will skip over the functions that generate noised samples for the forward process used in training, as well as the sampling functionality for inference, as both involve somewhat advanced math derivations in order to really make sense of them. If you are interested in the details, check out the repo for the tutorial as well as the papers linked below.
Now it’s time to write our training loop, which will sample batches of images, apply noise to the images corresponding to a random time step, and then the model will attempt to de-noise the image by predicting the noise that was originally added. We will also use W&B logging to log the images at each save step, as well as the average and cumulative loss for each epoch.
wandb.init(project="diff_models", entity="byyoung3")

for epoch in range(epochs):
cumulative_loss = 0.0
num_steps = 0

for step, batch in enumerate(d_loader):
optimizer.zero_grad()
t = torch.randint(0, T, (BATCH_SIZE,), device=device).long()
loss = get_loss(model, batch, t)
loss.backward()
optimizer.step()
cumulative_loss += loss.item()
num_steps += 1

if epoch % 5 == 0 and step == 0:
print(f"Epoch {epoch} | step {step:03d} Loss: {loss.item()} ")

image_path = sample_plot_image(epoch)
wandb.log({"samples": [wandb.Image(image_path, caption=f"Epoch {epoch}")]}, step=epoch)

model_save_path = os.path.join(models_dir, f"model_epoch_{epoch}.pt")
torch.save(model.state_dict(), model_save_path)
print(f"Saved model state to {model_save_path}")

# Log cumulative loss at the end of each epoch
average_loss = cumulative_loss / num_steps
wandb.log({"cumulative_loss": cumulative_loss, "average_loss": average_loss}, step=epoch)

One of the cool features of Weights and Biases is that it allows me to log resulting images for various time steps! Below is the code that does this:
wandb.log({"samples": [wandb.Image(image_path, caption=f"Epoch {epoch}")]}, step=epoch)
Below are the results from my training run. By about epoch 180, I was able to get some results that resemble a car. I was a bit limited in terms of compute. However, scaling up the model as well as the data will likely lead to much more impressive results.


Run: stilted-cherry-10
1



Run: stilted-cherry-10
1


Overall, I hope you enjoyed this introductory tutorial on Diffusion Models. It's clear that diffusion models work in a very unique way compared to other forms of AI like supervised learning, and it's amazing to see the results not only from large models but also from small models like the one we trained today.
Here's the repo for the project. If you have any questions or comments, feel free to leave them in the comments below!

W&B Reports:



Papers:

Articles:

Videos:



















Iterate on AI agents and models faster. Try Weights & Biases today.