Skip to main content

Outperforming Diffusion Models with GigaGAN

Created on March 14|Last edited on March 14
Can GANs still compete with diffusion models? Yes!
A while ago, I wrote about StyleGAN-T and how their results, though competitive with diffusion models, still fell short. And now, just a few papers down the line, GigaGAN outperforms these diffusion models!
The premise of their paper is to investigate whether GANs are still competitive and if not, what might be stumping their progress.

TL; DR


They experimented with StyleGAN2 and discovered it wasn't scalable. GANs work well for single objects or multiple objects but not for synthesizing millions of different real-world objects and scenes. Their hypothesis was that this was due to the convolutional layers in the GAN having to learn how to capture important semantic information from countless objects and pictures. To tackle this problem, they introduced the architecture shown above.
GigaGAN is broken into:
  • Text branch
    • a fixed-weight CLIP model + a few additional learnable attention layers
  • Style Mapping Network
    • an MLP network that controls what styles should be passed into the synthesis network
  • Generator (synthesis network)
    • they found that using convolutional layers interspersed with self and cross attention greatly helps
    • this network takes as input the text conditioning and the style embedding vector; the intuition behind this is that the input text acts as a switch, selectively picking which parts of the intermediate feature maps to attend to while the style embedding vector is responsible for adding the style specified in the input text
    • to add extra expressivity to the convolutional layers, they use something called sample-adaptive kernel selection which gives the convolutional layers text conditioning via the style embedding vector
  • Discriminator
    • as the generator (and perhaps upsampler here, though I'm not sure) produces a selection of images of different scales, the discriminator accepts this image pyramid and predicts whether an image is real or fake given a text condition at every stage, making the discriminator a multi-scale, multi-output (MS-IO) model
  • Upsampler
    • they use a GAN-based superresolution model that they believe could serve as an aid for upsampling images in diffusion-based methods


Results

They have an abundance of test images, but I found this one to be eye-catching!

As for performance, their experiments are very promising!

Their model boasts not only better FID performance, but also a much faster inference time due to how GANs are a single forward pass (instead of sequential like diffusion models). This inference is also in part due to the much smaller size of the model. Sitting at 1B parameters, it is much more compact compared to popular text-to-image models like DALL-E, DALL-E 2, and Imagen.
To find out more visit the reference below!

References

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.