New Techniques for Generating Images With Text

In this article, we'll look at Image Generation with CLOOB Conditioned Latent Denoising Diffusion Generative Adversarial Networks (GANs), or CCLDDG, for short.
Jonathan Whitaker
Created on May 9|Last edited on February 13
Comment
I've been running a free AI Art Course for the past few months, and a paper that I threw in as a last-minute additional reference ended up totally nerd-sniping me into spending this past week on a side quest to learn about and implementing the tongue-twisting title of this post: a CLOOB-Conditioned Latent Denoising Diffusion GAN. Hopefully, by the end of this report, those words will seem less like random gibberish.
Image source: https://hojonathanho.github.io/diffusion/﻿
Diffusion models are in the news at the moment, thanks to the amazing text-to-image capabilities of OpenAI's DALLE 2 system. The central idea is fairly simple: First, start with an image and repeatedly add a small amount of noise. Then train a model to 'undo' this process. (see AIAIART #7 for a more thorough explanation). They make fairly good (if slow) generative models, and there have been various modifications/additions that either add functionality or address various shortcomings of this approach. A few key ones:
Conditioning these models on text (with a few extra tricks like classifier-free guidance) turns them into text-to-image systems. See the GLIDE paper, for example.
Advanced sampling methods like DDIM can cut down the number of steps required to generate an image from thousands to as low as 25 or 50. This is important since, normally, it is very slow to generate images with these models.
Working in the latent space of an autoencoder (rather than directly operating on pixels) cuts down the compute required to train these models, with the tradeoff that autoencoders might introduce visible artifacts.
With these ideas combined, we get fantastic models like CompVis' latent diffusion project. But even with fancy sampling methods, diffusion models are still a little slow to sample. Enter Denoising Diffusion GANs from NVLabs:
From https://arxiv.org/abs/2112.07804﻿
They introduce a technique that uses a much smaller number of steps (say, 4 vs. 4000) and adds a discriminator to the training process, making it quite similar to a traditional GAN. The claim is that this approach results in a 'best of both worlds' model that has the mode coverage and high quality of a diffusion model while keeping the rapid generation capabilities of a GAN. 
I liked the idea and decided to try extending it with some of the other modifications that have worked for vanilla diffusion models - namely latent diffusion and text conditioning. This report shows some preliminary results.
ImplementationI decided to implement my own version of a Denoising Diffusion GAN based on the paper alone, but starting with some model definitions borrowed from LabML's annotated diffusion model and then modifying them for a few earlier experiments. To keep things organized, I packaged things like the model definitions and data-related setup into an nbdev project. 
For those who haven't come across it before, nbdev (nbdev.fast.ai) lets you build libraries using Jupyter notebooks while providing a host of extra features like automatic testing, continuous integration, auto-generated docs, and seamless export from the notebooks into a 'proper' python module. It's a great way to keep things neat, organized, and version-controlled - I recommend giving it a go! 
The result is a GitHub repository that I can clone to whatever machine I'm working on and a documentation website showing the code alongside tests, demonstrations, and examples that further explain how the different components work. 
The training code looks somewhat like a regular GAN training loop (in fact, I closely followed the structure of the PyTorch DCGAN tutorial). You can try out a minimal example in this Colab notebook or brave the train_cclddg.py file on Github if you don't mind work-in-progress code. 
I deviated from the paper by adding an optional 'reconstruction loss' for the generator in addition to the discriminator loss, and I skipped extras like the regularization term for the discriminator. Here are some charts from a typical training run:﻿﻿
﻿
﻿
Over time, the model gets fairly good at reconstructing an image (or rather, the latent representation of an image) given a noised version. The Generator tries to fool the Discriminator, and the two stay in tension... Or, sometimes, go crazy! 
This is GAN training, and while papers and tutorials always make it look easy, it doesn't take much to end up in a situation where things go wonky, and your losses shoot up. 
This is one reason why logging all experiments with something like Weights and Biases is so useful in a case like this - at the very least, you'll be able to see when things go wrong and how a given run might differ from one that went better or worse.
﻿
﻿
ResultsHere are some images generated by a model that was trained for half an hour on images of celebrity faces:
﻿
﻿
These are not the most perfect, high-resolution images you'll see. But there is something interesting going on here that I'd like to highlight, which is, in essence, the key party trick of this particular model: we can steer the generation by feeding in some text as conditioning information! And if you feel that exclamation point is unjustified, remember that this dataset has NO captions! 
NB: We'll dig into this in the next section, but before we move on, it's worth taking a brief aside here to remember that these datasets and models can contain all sorts of biases and issues. If you're making or using a model that uses data from the internet to decide what 'a doctor' or 'a criminal' looks like, you better take some time to think about how this might be harmful. Please use these things responsibly.
CLOOB As the Magic IngredientSo how does this work? Well, CLOOB is a model that has been trained to map both images and text into a shared embedding space. By using a CLOOB embedding as conditioning, we can choose to feed in either an image or some text to the model at any point, with the hope that either one will provide some useful information that the model can use as it tries to reconstruct an image from noise and fool the discriminator as it does so. 
Images based on the prompt 'Autumn watercolor', made using another model that uses CLOOB embeddings as conditioning.
An upshot is that we can train entirely on images, using their CLOOB embeddings as conditioning. And then, at inference time, embed a text prompt and use that to generate something which (ideally) matches the description. I've explored this idea before for CLOOB-Conditioned Latent Diffusion, building on work by @JDP and others - you can read about that project here (some outputs pictured above).
When some captions are available, we can randomly alternate between using the CLOOB embeddings of the text and images during training. Here are some outputs from a run where I trained a CCLDDG  on the first million or so images from the Conceptual Captions 12M dataset:
﻿
project("johnowhitaker", "cclddg_cc12m").artifactVersion("run-3v2drayc-Examples", "3fc617db39d02a9f0970").file("Examples.table.json")
 - 9 of 9
Prompt
Examples (cfg_scale from 0 to 2)
1
2
3
4
5
6
7
8
9
Nothing to rival DALL-E 2, but some of these concepts are starting to show up in the generated images despite the fact that this is a model trained from scratch in an hour or two on a single GPU. And it's fast - the images above are generated with a model that uses only four or five steps, and since all the diffusion happens in the latent space, we can process a lot of images at once. A 256px image is shrunk down to a 32x32x4 tensor by the autoencoder, so we can fit hundreds in a single batch. 
ConclusionThis was a fun little project. I don't know that this particular model will end up being particularly good at anything. Still, I get the feeling that 'a few more papers down the line' we'll reach a point where it is easier than ever to train powerful multi-model models to generate whatever you can imagine. And it's nice to think that, while a lovely avo armchair remains out of reach, we mere mortals can still dip our toes into training something like this and get some abstract outputs to be proud of.
If you have questions on the project that aren't covered in this post, feel free to reach out to me on Twitter or join us in the AIAIART course discord. I hope this has inspired you to mess around with the cool ideas floating about in the world of generative modeling! 
-Jonathan Whitaker
﻿
Add a comment
Tags: GenAI, GAN, Articles, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.