Building An Image Encoder With ViT-VQGAN

In this article, we attempt to reproduce the findings from ViT-VQGAN for image encoding and experiment with further adaptations, using W&B to track our results.
Boris Dayma, Suraj Patil, Pedro Cuenca
Created on August 9|Last edited on January 17
Comment
Image encoders compress an image into smaller dimensions, sometimes even quantized into a discrete space (such as the VQGAN from taming-transformers used in Craiyon).
In this article, we try to reproduce the results from ViT-VQGAN ("Vector-quantized Image Modeling with Improved VQGAN") and experiment with further adaptations.
You can refer to our implementation in JAX.
Here's what we'll be covering: 
Table of ContentsHow Does ViT-VQGAN Work?Experimental SetupAdjusting Loss CoefficientsResultsFuture ExperimentsAcknowledgments
﻿
﻿
Let's dig in! 
How Does ViT-VQGAN Work?﻿
Source: Vector-quantized Image Modeling with Improved VQGAN
﻿
The first part of the ViT-VQGAN encoder follows a ViT model:
Images are rearranged into patches of n * n (for example 16x16 for f16 models and 8x8 for f8 models), and a convolution on each patch or reshape + linear transformation can be applied
Position embeddings are added to image embeddings
A transformer model is applied (sequence of attention blocks + feed-forward blocks)
The resulting embeddings are then quantized through a codebook with a limited vocabulary (not required when using a KL loss). Note that for ViT-VQGAN we project to a lower dimension prior to quantization.
The decoder is based on another ViT model with a final projection to image space.
Experimental SetupWe perform several short experiments to understand the model better:
Hardware: TPU v3-8
Optimizer: distributed shampoo﻿
Parallelism: 1 model per device + optimizer preconditioners sharded across devices
Batch size: 256
Learning rate: 1e-4 with warmup (we performed a quick search)
Precision: computations in bfloat16, weights/gradients in float32
Dropout: not used (should not be needed with large dataset)
Weight decay: not used (same reason)
The loss can be a combination of multiple losses where each coefficient needs to be adjusted:
L1 loss: we don't use it at the time, it's actually supposed to be a logits Laplace loss, but it would require outputting more channels and does not seem essential based on Parti's findings
L2 loss
Codebook losses: alignment + commitment loss, used for codebook learning and pushing encoder embeddings toward codebook
﻿LPIPS: perceptual metric using VGG features
﻿StyleGAN loss: GAN loss using the discriminator architecture from StyleGAN
We also add a gradient penalty on the discriminator.
Possible model variants are:
Patch size: we use 16x16 in all experiments for now
Patch creation: convolution layer or reshape + feed-forward
﻿Transformer architectures
Number of layers
Embedding size
LayerNorm positions: post-LN or NormFormer
Use of GLU variants
Activation function
Codebook dimension
Convolution layers can be added in feed-forward or attention blocks
Bias: we don't use any. 
Adjusting Loss CoefficientsThe generator (image encoder/decoder) loss is the sum of multiple losses that need to be scaled.
Here is the approach followed for the initial adjustment:
You often want similar order of magnitude. Otherwise, some terms won't need to be learned at all.
The codebook losses are lower in this case (and can be adjusted) but they're also a bit independent from the rest because mainly used to learn the codebook weights. Also it's common to use loss_e_latent = 0.25 loss_q_latent. Other methods exist (using codebook EMA, for example), but they have not been widely adopted and are not used in ViT-VQGAN.
The loss associated with L2 (or L1, not used here) is only based on pixel reconstruction loss, which is the most important term but also leads to blurry images.
The lpips loss is a perceptual loss and adds sharpness but also blocky artifacts.
The StyleGAN loss is supposed to teach how to create more realistic images, also leading to sharper images but also some artifacts. It is more powerful than lpips but also more unstable. 
The losses below have been rescaled (so they are already adjusted by their multiplication factor). 
﻿
Run: lr 1e-4, high lpips/stylegan1
﻿
﻿
For the discriminator losses, we do the following:
the StyleGAN loss can be unstable, so we add a gradient penalty on the discriminator. We use the "R1 regularizer" from the paper, which is based on real images.
We scale the gradient penalty disc_loss_gradient_penalty so it is lower than the vanilla discriminator loss disc_loss_stylegan but still non-negligible in total discriminator loss disc_loss.
For StyleGAN to be useful, it is important to have a good discriminator. In this case, we want the vanilla StyleGAN loss in the discriminator train/disc_loss_stylegan to be lower than in the generator train/loss_stylegan / cost_stylegan_generator. In this particular example, we had cost_stylegan_generator = 0.005. If the discriminator has a high loss, the generator does not learn anything against it.
﻿
Run: lr 1e-4, high lpips/stylegan1
﻿
ResultsOn these short experiments, I like to inspect samples manually to understand what's happening instead of just calculating metrics such as fid.
Gradient PenaltyGradient penalty is essential to train the model.
﻿
﻿
Codebook dimensionA codebook dimension of 4 seems too low, but 8 could work.
﻿
﻿
﻿
Conv PatchesUsing convolutions or not for the initial patches should not make a difference, but there's a small difference in our implementation where we have an additional feed-forward block in the decoder when not using convolutions (we could adjust it in a future version).
Strangely enough, the version with convolutions (which has 1 less feed-forward layer in the decoder) seems better.
﻿
﻿
Scaling lpips/StyleGANA higher lpips and StyleGAN loss seem to lead to better results.
﻿
﻿
﻿
Adding convolutionsUsing extra-convolutions in feed-forward layers lead to better results.
﻿
﻿
GLU VariantsWhen using GLU, we also need to adjust hidden dimension to try to keep similar total of parameters. GLU lead to better results.
﻿
﻿
Rotary Positional EmbeddingsUsing rotary positional embeddings does not seem to make a big difference.
﻿
﻿
Tanh vs. GELU in FFN (ONGOING)Using tanh instead of GELU in feed-forward blocks: TBC
﻿
﻿
Pre-LN vs. NormFormer (ONGOING)TBC
﻿
﻿
Effect of StyleGAN lossStyleGAN can create instabilities if its loss is too high in the generator.
﻿
﻿
High Loss On CodebookThe loss cannot be set too high on the codebook either and leads to slower training and instabilities.
﻿
﻿
Discriminator Learning RateUsing too high a learning rate on the discriminator brings instability.
Interestingly there is not much variation in StyleGAN losses on a large range of learning rates (1e-4 to 3e-4).
﻿
﻿
Future ExperimentsTrain an f16 model for longer based on the most promising config
Train an f8 model
evaluate codebook loss vs. KL loss, which does not let us quantize to a codebook but also avoids the related quantization loss
AcknowledgmentsJiahui Yu for his advice on the implementation based on the work on ViT-VQGAN and Parti
Phil Wang and Katherine Crowson for suggesting adding convolutions in the model.
﻿
Add a comment
Muhtasham Oblokulov • 3 years ago
Love this, great work :)
Tags: Articles, Computer Vision, Experiment, Advanced
Iterate on AI agents and models faster. Try Weights & Biases today.