StyleGAN-T, GANs for Text2Image

Competing against diffusion models with GANs.

Created on January 26|Last edited on January 26

Comment

﻿
﻿
StyleGAN-T, though not better in performance to some other diffusion models and papers, brings GANs back into the competitive spotlight. Their model, StyleGAN-T, is a redesigned version of StyleGAN-XL. They made a variety of changes:
drop StyleGAN3 layers that preserve equivariance 
StyleGAN2 backbone for synthesis
incorporate residual convolution blocks, allowing for 2.3x to 4.5x model scale-up
introducing stronger text conditioning (latent vector z seemed to dominate over the text input)
DINO pretrained ViT-S for feature network in disciminator
multi-headed, simple discriminator blocks 
classifier guidance to provide extra gradients during training
many more!
The results they achieved were comparable to diffusion models like Stable Diffusion, GLIDE, DALL-E 2, etc. In particular, they achieved better FID than the diffusion models they compared with on MS COCO 64x64. In other cases, StyleGAN-T performed a bit worse than the diffusion models. The takeaway here is that StyleGAN-T, and in large, GANs can go head-to-head with diffusion models, though they do have their own limitations and much work has to be done to even consistently match diffusion model performance. 
References
﻿https://arxiv.org/pdf/2301.09515v1.pdf﻿
﻿https://sites.google.com/view/stylegan-t/﻿
﻿https://github.com/autonomousvision/stylegan-t﻿
﻿

Add a comment

Tags: ML News

Iterate on AI agents and models faster. Try Weights & Biases today.