Skip to main content

GPT-Draw: pretraining GPT-2 for text-to-image

Using GPT-2 to generate images
Created on September 11|Last edited on February 17
A popular method for text-to-image is using diffusion. It is the way Stable Diffusion, DALL-E 2, and plenty of other models generate images. There's also another way that doesn't use diffusion, autoregressive text-to-image, in which a GPT-like transformer model generates the image patch by patch.
The goal of GPT-Draw is to not build a model specifically for text-to-image, but instead take an existing model meant for text generation and adapt it to images.
How GPT-Draw generates images (slowed down)
GPT-Draw currently generates images at a resolution of 256x256.

Dataset

Conceptual Captions dataset comprising approximately 2.7 million text-image pairs.

Model architecture

GPT-Draw's architecture is inspired by DALL-E but on a smaller scale. Given a dataset of text-image pairs, the images are first encoded using VQGAN, a pre-trained model that can be found online. The string representations of these image encodings are then concatenated with their respective captions. The resulting sequence is tokenized and fed into GPT-2, treating the entire thing as a language modeling task.

VQGAN

GPT-Draw employs a pre-trained f16 VQGAN model, which features a codebook size of 16,384 and encodes 256x256 images into a sequence of 256 tokens.

GPT-2

GPT-2 forms the core of the text-to-image generator.
However, some modifications have been made:
  • Increased attention heads
  • Removed dropout
  • Reduced context length
The result of combining the prompt tokens with the VQGAN codes is a sequence of 320 tokens.

Training

All the images were pre-encoded with VQGAN to speed up data loading.
A few settings:
  • Batch size: Autoregressive text-to-image models benefit from large batch sizes. GPT-Draw was trained with a per-device batch size of 16, along with 64 gradient accumulation steps. This resulted in an effective batch size of 1024.
  • Optimizer: AdamW optimizer with default settings
  • Learning rate: A constant learning rate of 1e-3 with a linear warm-up for the first 500 steps.
  • Hardware and duration: Training such models demands substantial computational resources. The model was trained for 1 epoch, equivalent to 2690 iterations. This means that during training, it saw around 2,355,200 samples. This process took approximately 60 hours to complete on a single 12GB RTX 3060 GPU.

Run set
7


Conclusion

While this project uses a very small amount of computation compared to the top image generators, it shows promising potential at its scale.
I plan to further refine and expand this work:
  • Different datasets: The LAION dataset seemed to contain a lot of digital graphics, which is reflected in the generated images. Experimenting with various datasets like Conceptual Captions, YFCC100M, and others can lead to improved performance and diversity in generated images.
  • Better image encoder: Using SBER-MoVQGAN can result in higher quality than the original VQGAN.
  • Larger batch size: Due to limited hardware, the batch size used was only half of what was planned. Increasing the batch size to 2048 may yield better results and faster convergence.
  • Learning rate: Further tuning the learning rate and using a decay.
  • Optimizer: Evaluate the use of Adafactor for better GPU memory efficiency.
  • Extended training: Once I gain access to more powerful hardware.
Model and demo for the early version of GPT-Draw (v0.2) are released.
"happy businessman at office in meeting stock photography"