An Introduction to VAE-GANs

VAE-GAN was introduced for simultaneously learning to encode, generating and comparing dataset samples. In this blog, we explore VAE-GANs and the paper that introduced them : Autoencoding beyond pixels using a learned similarity metric.
Shambhavi Mishra

Table of Contents (click to expand)

Introduction

The combination of a Generative Adversarial Network (GAN) and a Variational Autoencoder (VAE) has been explored in some recent state-of-the-art research. "Hierarchical Patch VAE-GAN" [1] generates a wide range of samples in both the image and video domains, while "f-VAEGAN-D2" [2] (a CVPR 2019 paper) utilizes VAE as well as GANs in a conditional generative model for any-shot learning. (It also demonstrates that the learned features are interpretable.)
In this blog, however, we'll discuss the canonical 2016 ICML paper that has been cited more than 1400 times! "Autoencoding beyond pixels using a learned similarity metric" [3] proposed the idea of combining those two architectures (VAE & GAN) into a new generative model, entitled, of course: VAE-GAN.
The major contributions of this paper are:
Let's dig in:

What is a VAE-GAN?

While a VAE learns to encode the given input (say, an image) and then reconstructs it from the encoding, a GAN works to generate new data which can't be distinguished from real data. The key behind their operation is that in case of an autoencoder, we use the latent representations generated by an encoder for various tasks.
However, in case of a GAN we work on the generated 'fake' images which are not so fake. And thanks to the joint effort of the discriminator which reprimands the generator till it learns to generate the 'real - fake' images!
Animated version of the model flow illustrated in the paper [3].
As illustrated above, we decide to collapse the decoder and the generator into one in case of a VAE-GAN!
The discriminator in a GAN is responsible to learn an efficient similarity metric so that it can differentiate between the real(s) and the fake(s). The authors use this characteristic of a discriminator to improve the reconstruction error for the VAE.

Architecture of a VAE-GAN

In the paper, the authors propose replacing the VAE reconstruction error term (expected log likelihood) with a reconstruction error expressed in the GAN discriminator. Since both the decoder of a VAE and generator of a GAN operate on the latent space z to produce the image x, a decoder is used instead of a generator.
The authors experiment with different generative models:
  1. Plain VAE with an element-wise Gaussian observation model.
  2. VAE with a learned distance.
  3. The combined VAE/GAN model.
  4. GAN
The architecture for all these experiments (Encoder, Decoder and Discriminator) was same.
The table below contains the architectural details for our Encoder, Decoder and the Discriminator.
Detailed Architecture for the VAE-GAN

A Glance at the Results from the Paper

The authors compare different architectures on the CelebA Dataset for generation as well as reconstruction.
Results from the VAE/GAN paper [3]

Exploring Visual Attributes

The authors also explore the latent space of the trained VAE-GAN. With an aim to learn a GAN for face images conditioned on facial attributes such as eyeglasses or bushy eyebrows.
Let's look at some results!
Using the VAE/GAN model to reconstruct dataset samples with visual attribute vectors added to their latent representations.
Another interesting result from the paper shows faces generated by conditioning on attribute vectors from the test set.
Do you notice how pointy the nose reconstructed by VAEGAN is! Those eyeglasses are so on-point as compared to the ones in VAE.

Let's give it a try!

We used a pretrained ResNet-34 model in our experiments and this GIF below is what we got in our results (they improve over time)!
We plot the F1 score in the panels below. Don't forget to check out the runs!
Annnnnnd, can we ever forget that our readers might want to get their hands dirty with some code?
Checkout this awesome github repository for an implementation!
Oh, you want to log your results to compare them with what we got? We got you covered.
Here's a colab of the code with WandB integration :

VAE-GAN with W&B

All you need to do is enter the entity in the cell below as your username and the key 👇🏻.
os.environ["WANDB_API_KEY"] = 'your_key'runs = wandb.init(project="vaegan-new", entity = 'your_username')

Conclusion

VAE-GAN is indeed a hybrid of VAE and GAN as the authors perform combined training of Dec as a both a VAE decoder and a GAN generator. The paper also introduces learned similarity measures as a promising step towards scaling up generative models to more complex data distributions.
VAE-GAN was also the a first attempt at unsupervised learning of encoder-decoder models. Another interesting aspect of this model was its ability to explore visual attributes in the high-level representation of the latent space. As discussed in the beginning of the blog, VAE-GAN is being utilized in different application domains.

References

  1. Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample
  2. f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning
  3. Autoencoding beyond pixels using a learned similarity metric
  4. Github Repository to explore the code!