Building An Image Encoder With ViT-VQGAN
In this article, we attempt to reproduce the findings from ViT-VQGAN for image encoding and experiment with further adaptations, using W&B to track our results.
Created on August 9|Last edited on January 17
Comment
Image encoders compress an image into smaller dimensions, sometimes even quantized into a discrete space (such as the VQGAN from taming-transformers used in Craiyon).
In this article, we try to reproduce the results from ViT-VQGAN ("Vector-quantized Image Modeling with Improved VQGAN") and experiment with further adaptations.
Here's what we'll be covering:
Table of Contents
How Does ViT-VQGAN Work?Experimental SetupAdjusting Loss CoefficientsResultsFuture ExperimentsAcknowledgments
Let's dig in!
How Does ViT-VQGAN Work?

Source: Vector-quantized Image Modeling with Improved VQGAN
- Images are rearranged into patches of n * n (for example 16x16 for f16 models and 8x8 for f8 models), and a convolution on each patch or reshape + linear transformation can be applied
- Position embeddings are added to image embeddings
- A transformer model is applied (sequence of attention blocks + feed-forward blocks)
The resulting embeddings are then quantized through a codebook with a limited vocabulary (not required when using a KL loss). Note that for ViT-VQGAN we project to a lower dimension prior to quantization.
The decoder is based on another ViT model with a final projection to image space.
Experimental Setup
We perform several short experiments to understand the model better:
- Hardware: TPU v3-8
- Parallelism: 1 model per device + optimizer preconditioners sharded across devices
- Batch size: 256
- Learning rate: 1e-4 with warmup (we performed a quick search)
- Precision: computations in bfloat16, weights/gradients in float32
- Dropout: not used (should not be needed with large dataset)
- Weight decay: not used (same reason)
The loss can be a combination of multiple losses where each coefficient needs to be adjusted:
- L1 loss: we don't use it at the time, it's actually supposed to be a logits Laplace loss, but it would require outputting more channels and does not seem essential based on Parti's findings
- L2 loss
- Codebook losses: alignment + commitment loss, used for codebook learning and pushing encoder embeddings toward codebook
We also add a gradient penalty on the discriminator.
Possible model variants are:
- Patch size: we use 16x16 in all experiments for now
- Patch creation: convolution layer or reshape + feed-forward
- Number of layers
- Embedding size
- LayerNorm positions: post-LN or NormFormer
- Use of GLU variants
- Activation function
- Codebook dimension
- Convolution layers can be added in feed-forward or attention blocks
- Bias: we don't use any.
Adjusting Loss Coefficients
The generator (image encoder/decoder) loss is the sum of multiple losses that need to be scaled.
Here is the approach followed for the initial adjustment:
- You often want similar order of magnitude. Otherwise, some terms won't need to be learned at all.
- The codebook losses are lower in this case (and can be adjusted) but they're also a bit independent from the rest because mainly used to learn the codebook weights. Also it's common to use loss_e_latent = 0.25 loss_q_latent. Other methods exist (using codebook EMA, for example), but they have not been widely adopted and are not used in ViT-VQGAN.
- The loss associated with L2 (or L1, not used here) is only based on pixel reconstruction loss, which is the most important term but also leads to blurry images.
- The lpips loss is a perceptual loss and adds sharpness but also blocky artifacts.
- The StyleGAN loss is supposed to teach how to create more realistic images, also leading to sharper images but also some artifacts. It is more powerful than lpips but also more unstable.
The losses below have been rescaled (so they are already adjusted by their multiplication factor).
Run: lr 1e-4, high lpips/stylegan
1
For the discriminator losses, we do the following:
- the StyleGAN loss can be unstable, so we add a gradient penalty on the discriminator. We use the "R1 regularizer" from the paper, which is based on real images.
- We scale the gradient penalty disc_loss_gradient_penalty so it is lower than the vanilla discriminator loss disc_loss_stylegan but still non-negligible in total discriminator loss disc_loss.
- For StyleGAN to be useful, it is important to have a good discriminator. In this case, we want the vanilla StyleGAN loss in the discriminator train/disc_loss_stylegan to be lower than in the generator train/loss_stylegan / cost_stylegan_generator. In this particular example, we had cost_stylegan_generator = 0.005. If the discriminator has a high loss, the generator does not learn anything against it.
Run: lr 1e-4, high lpips/stylegan
1
Results
On these short experiments, I like to inspect samples manually to understand what's happening instead of just calculating metrics such as fid.
Gradient Penalty
Gradient penalty is essential to train the model.
Codebook dimension
A codebook dimension of 4 seems too low, but 8 could work.
Conv Patches
Using convolutions or not for the initial patches should not make a difference, but there's a small difference in our implementation where we have an additional feed-forward block in the decoder when not using convolutions (we could adjust it in a future version).
Strangely enough, the version with convolutions (which has 1 less feed-forward layer in the decoder) seems better.
Scaling lpips/StyleGAN
A higher lpips and StyleGAN loss seem to lead to better results.
Adding convolutions
Using extra-convolutions in feed-forward layers lead to better results.
GLU Variants
When using GLU, we also need to adjust hidden dimension to try to keep similar total of parameters. GLU lead to better results.
Rotary Positional Embeddings
Using rotary positional embeddings does not seem to make a big difference.
Tanh vs. GELU in FFN (ONGOING)
Using tanh instead of GELU in feed-forward blocks: TBC
Pre-LN vs. NormFormer (ONGOING)
TBC
Effect of StyleGAN loss
StyleGAN can create instabilities if its loss is too high in the generator.
High Loss On Codebook
The loss cannot be set too high on the codebook either and leads to slower training and instabilities.
Discriminator Learning Rate
Using too high a learning rate on the discriminator brings instability.
Interestingly there is not much variation in StyleGAN losses on a large range of learning rates (1e-4 to 3e-4).
Future Experiments
- Train an f16 model for longer based on the most promising config
- Train an f8 model
- evaluate codebook loss vs. KL loss, which does not let us quantize to a codebook but also avoids the related quantization loss
Acknowledgments
- Jiahui Yu for his advice on the implementation based on the work on ViT-VQGAN and Parti
- Phil Wang and Katherine Crowson for suggesting adding convolutions in the model.
Add a comment
Love this, great work :)
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.