StyleGAN-T, GANs for Text2Image
Competing against diffusion models with GANs.
Created on January 26|Last edited on January 26
Comment

StyleGAN-T, though not better in performance to some other diffusion models and papers, brings GANs back into the competitive spotlight. Their model, StyleGAN-T, is a redesigned version of StyleGAN-XL. They made a variety of changes:
- drop StyleGAN3 layers that preserve equivariance
- StyleGAN2 backbone for synthesis
- incorporate residual convolution blocks, allowing for 2.3x to 4.5x model scale-up
- introducing stronger text conditioning (latent vector z seemed to dominate over the text input)
- DINO pretrained ViT-S for feature network in disciminator
- multi-headed, simple discriminator blocks
- classifier guidance to provide extra gradients during training
- many more!
The results they achieved were comparable to diffusion models like Stable Diffusion, GLIDE, DALL-E 2, etc. In particular, they achieved better FID than the diffusion models they compared with on MS COCO 64x64. In other cases, StyleGAN-T performed a bit worse than the diffusion models. The takeaway here is that StyleGAN-T, and in large, GANs can go head-to-head with diffusion models, though they do have their own limitations and much work has to be done to even consistently match diffusion model performance.
References
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.