Most of us are familiar with the concept of discriminative model – given an input, say an image, the discriminative model predicts, for instance, if it's a cat or a dog. Usually, in a discriminative model, each training example has a label and thus it’s synonymous with supervised learning.
Formally speaking, discriminative modeling estimates p(y|x) — the probability of a label y(cat or dog) given observation x(image).
On the other hand, the generative model describes how a dataset is generated, in terms of a probabilistic model. Using such a probabilistic model we can generate new data. Usually, a generative model is applied to an unlabeled training example (unsupervised Learning).
Formally speaking, generative modeling estimates p(x) — the probability of observing an observation x. In case of the labelled dataset, we can also build a generative model p(x|y) — the probability of the observation x given its label y.
This blog post is divided into two parts. The first part will discuss autoencoders and then variational autoencoders which are one of the most fundamental architectures for deep generative modeling.
The second part of this blog will deal with the theoretical underpinning of Generative Adversarial Network.
I build upon the keras-gan
repository, and instrument it with W&B to understand the underlying concepts through experimentation.
Let's get started :smile:
An autoencoder is a combination of an encoder function, h(x), which takes in input data and converts it into a different representation; and a decoder function, g(z), which takes this representation back to the original domain. Here z = h(x) is the latent representation of the input. Thus an encoder compresses high dimensional input space (the data) into low dimensional latent space (the representation). While the decoder decompresses the given representation back to the original domain.
(Fig 1: Simplistic overview of an autoencoder architecture, from Generative Deep Learning by David Foster
So, how can we use an autoencoder as a generative model?
As discussed, the latent space is the high-level representation of the training dataset. So if we take a sample from the distribution of this latent space, and pass it through the decoder we can generate new data. Autoencoders have many other interesting use cases. You can check out the introduction to image inpainting with deep learning by me and Sayak Paul which discusses the application of autoencoders for deep image inpainting.
Now let’s build our autoencoder and use Weights and Biases to visualize the latent space and see how the size of latent space results in decompression. For simplicity, we'll use the MNIST dataset composed of handwritten digits. To follow along check out this colab notebook, which implements the class `Autoencoder.
def build_encoder(self):
## ENCODER
encoder_input = keras.layers.Input(shape=self.input_shape)
x = keras.layers.Dense(self.intermediate_dim, activation='relu')(encoder_input)
## This is latent space
encoder_output = keras.layers.Dense(self.latent_space, activation='relu')(x)
self.encoder = keras.models.Model(inputs=[encoder_input], outputs=[encoder_output])
return encoder_input, encoder_output
For simplicity I am using Dense
layers with relu
activation. Given the simplicity of MNIST we will only use one intermediate layer. The encoder_output
is our latent representation, the latent size
for which is determined by its units. We will shortly look at the effect of the size of the latent space on the reconstruction. Notice that I am also saving an instance of encoder model in self.encoder.
Now let’s go through the build_decoder
method.
def build_decoder(self):
## DECODER
decoder_input = keras.layers.Input(shape=self.latent_space)
x = keras.layers.Dense(self.intermediate_dim, activation='relu')(decoder_input)
## This is reconstruction
decoder_output = keras.layers.Dense(self.original_dim, activation='sigmoid')(x)
self.decoder = keras.models.Model(inputs=[decoder_input], outputs=[decoder_output])
The decoder takes the latent representation as an input. Note that the decoder does not need to be symmetrical to the encoder, but the output shape should be similar to the input shape. Now that we have both the components of an autoencoder architecture let’s join them and train it. To join them we call the build_model
method of our Autoencoder
class.
def build_model(self):
## Initialize encoder model
encoder_input, encoder_output = self.build_encoder()
## Initialize decoder model
self.build_decoder()
## Join encoder and decoder
decoder_output = self.decoder(encoder_output)
## Build autoencoder model
return keras.models.Model(inputs=[encoder_input], outputs=[decoder_output])
We first initialize (build) the encoder
model followed by the decoder
model. We pass the encoder_output
which is the latent representation to the decoder
model. The output of the decoder, decoder_output
, is the reconstruction. Now we have the Keras implementation of our autoencoder
model. This model represents the flow of an image through the encoder and back out through the decoder.
ae = Autoencoder(input_shape=(28,28), latent_space=2)
model = ae.build_model()
model.compile('adam', 'mean_squared_error')
We compile our model with the adam
optimizer and use a simple loss function – mean_squared_error
. We use the WandbCallback
to log training metrics. We also use a custom callback ReconstructionLogger
which logs the reconstruction of one batch of test data at the end of each epoch. You can check out the implementation in the notebook. We implemented a similar logger in the introduction to image inpainting with deep learning report.
Now that we have trained our autoencoder, let’s analyze it's performance by looking at the images represented in the latent space. We set the latent size to 2 in this run to make it easy to analyze.
for img, label in tqdm(zip(example_images, example_labels)):
latent_val = ae.encoder.predict(img.reshape((1,)+img.shape))[0]
latent_val1.append(latent_val[0])
latent_val2.append(latent_val[1])
plt.figure(figsize=(8,8))
plt.scatter(latent_val1, latent_val2, cmap='rainbow', c= example_labels, alpha=0.5, s=2);
wandb.log({"chart": plt})
We will take n
number of test images and pass them through the encoder. This is the reason we saved an instance of the encoder model. Using wandb.log()
you can even log the plt
object, which is really cool.
From our observations of the latent space plot in Fig 3, we can point out some serious limitations:
How should we go about choosing random points in the latent space? From our first observation, the points are not close to zero. The distribution of these points is undefined. We could potentially sample a point from the entire 2D space but that would result in a bad generative model (as we'll see next).
From our 2nd observation, we can conclude that there will be a lack of diversity in the generated images. Some digits are represented in a larger 2D space compared to others. While randomly sampling from the latent space, the classes occupying a greater surface area will show up more often in the generated images.
From our 3rd observation, we find that even if the samples are generated from a ‘region’ the generated images are of varying quality.
We can argue that since the latent space only has two-points to work with, the autoencoder squashed the digit groups together with relatively small gaps and overlapping regions in some places. But even if you increase the latent space to 100 points the problem remains, as an autoencoder inherently doesn't know how to use the latent space to encode the digits with enough spaces between them to generate well-formed images. So how can we solve these problems? We will tackle this shortly.
First, let’s quickly see the effect of latent size on the reconstruction.
To do so we will use W&B’s Hyperparameter Sweeps to automatically run experiments. In our case, we are only interested in the effect of latent size. I highly recommend you to check out Running Hyperparameter Sweeps to Pick the Best Model by Sayak Paul for a simple introduction to this amazing tool. To follow along go to this colab notebook.
sweep_config = {
'method': 'grid',
'parameters': {
'latent_space': {
'values': [2, 10, 100]
}
}
}
We will set up our sweep_config
with the values of the latent_space
that we will be experimenting with. As you would expect, by increasing the latent size we can improve the quality of the reconstruction. Using the ReconstructionLogger
as a custom W&B callback I was able to log the reconstruction for each experiment. You can visit this sweep page to check out the experiment results.
To counter the issues of an autoencoder we must do something about how the encoder uses the latent space to encode the high-level features. Maybe we can take our earlier undefined distribution of the latent space, and define it in a way that facilitates better random samples? We will now try to understand how an autoencoder can be changed to a variational autoencoder to realize a truly generative model.
In an autoencoder, each image is mapped to point(s) in a latent space. In a variational autoencoder, each image is instead mapped to a multivariate normal distribution around point(s) in the latent space. A multivariate normal distribution is the generalization of a normal distribution to a higher dimension. In one dimension, it’s characterized by the famous bell curve. Mathematically a normal distribution, also called Gaussian distribution, is defined as –
Where μ
is the mean and σ
is the standard variation of the distribution. If μ=1
and σ=0
we have a standard normal distribution. For a multivariate normal distribution of k dimensions, the probability density function is defined as –
where x
and μ
are k-dimensional vectors and Σ
is the covariance matrix. For a more detailed explanation I would recommend checking out this and this. To visualize and compare the latent space of a variational autoencoder with that of an autoencoder we require a bi-variate (2 dimensional) normal distribution. For 2 dimensional distribution –
Before implementing a variational autoencoder let’s make few assumptions:
There is no correlation between any of the dimensions of the variational autoencoder. This implies that the input image is only to be mapped to individual high-level representations and not care about the covariance between the dimension. Loosely speaking we assume that these high-level features are independent. This is a really interesting phenomenon we achieve.
Because of this zero covariance, the covariance matrix is a diagonal. Thus we just have to map each input to a mean vector and a variance vector. Also since variance is always positive. we take the log of the variance so that it can take values in the range of (-∞, ∞) which is the natural output range of the neural network.
Now that we have established an understanding of some key concepts for variational autoencoders let’s implement one. There will only be two changes in our original implementation of Autoencoder
: the encoder and the loss function. In the accompanying notebook you will find the class VariationalAutoencoder
implementation. Let’s go through the build_encoder
method and break it down.
def build_encoder(self):
self.inputs = keras.layers.Input(shape=self.input_shape, name='encoder_input')
x = keras.layers.Dense(self.intermediate_dim, activation='relu')(self.inputs)
## Mean layer
self.z_mean = keras.layers.Dense(self.latent_dim, name='z_mean')(x)
## Log variance
self.z_log_var = keras.layers.Dense(self.latent_dim, name='z_log_var')(x)
## Latent space
self.z = keras.layers.Lambda(self.sampling, output_shape=(self.latent_dim,), name='z')([self.z_mean, self.z_log_var])
# instantiate encoder model
self.encoder = keras.models.Model(self.inputs, [self.z_mean, self.z_log_var, self.z], name='encoder')
The encoder will take each input image and encode it to two vectors, z_mean
and z_log_var
which together define a multivariate normal distribution in the latent space. As we're working with an easy dataset, we are using only one intermediate
layer. Notice how the intermediate layer is not directly connected to the latent space(z
). Instead z
is sampled from the multivariate normal distribution. Thus it is a sampling layer. We use Keras' Lambda
function to wrap our sampling function as a Keras Layer
object.
But such a sampling layer is stochastic. Therein lies the issue – we cannot backpropagate gradients through a sampling layer because of its stochastic nature. This is because backpropagation requires deterministic nodes to be able to iteratively pass gradients and apply the chain rule. The breakthrough idea to solve this issue is to reparameterize the sampling layer so that it can be trained end to end. The key idea is that given a fixed mean(μ
) and standard deviation (σ
) of the distribution as vectors we can sample a point z
from this distribution using this simple formulation –
where ε
is called epsilon, which is sampled from standard normal distribution(prior probability).
(Fig 5: VAE with and without reparametrization. (Source))
def sampling(self, args):
z_mean, z_log_var = args
batch = K.shape(z_mean)[0]
dim = K.int_shape(z_mean)[1]
# by default, random_normal has mean = 0 and std = 1.0
epsilon = K.random_normal(shape=(batch, dim))
return z_mean + K.exp(0.5 * z_log_var) * epsilon
I highly recommend checking out the “Reparameterization” trick in Variational Autoencoders by Sayak Paul for a more detailed coverage of this idea. We will shortly see how this small change in the autoencoder makes this a truly generative model.
The 2nd change to our original autoencoder was the loss function. Previously, our loss function only consisted of the mean_squared_error
loss between images, and their reconstructions after being passed through the autoencoder. The same reconstruction loss also appears in the variational autoencoder, but we require one extra component: the Kullback–Leibler (KL) divergence loss.
KL divergence is a method used to compare probabilities. We want to add some regularization on how our multivariate probability distribution is learned. We want to measure how different our normal distribution is, with parameters z_mean
and z_log_var
, from the standard normal distribution. Mathematically, KL divergence for our variational autoencoder is defined as –
To summarize, the KL divergence term penalizes the network for encoding observations
to z_mean
and z_log_var
variables that differ significantly from the parameters of a standard normal distribution, namely μ= 0
and σ= 0
.
def add_loss(self):
# VAE loss = mse_loss or xent_loss + kl_loss
reconstruction_loss = keras.losses.mse(self.inputs, self.outputs)
reconstruction_loss *= self.original_dim
kl_loss = 1 + self.z_log_var - K.square(self.z_mean) - K.exp(self.z_log_var)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
vae_loss = K.mean(reconstruction_loss + kl_loss)
self.vae.add_loss(vae_loss)
Let’s train our model. To build our variational autoencoder, we simply do this.
keras.backend.clear_session()
vae = VariationalAutoencoder(input_shape=(784),
original_dim=(784),
intermediate_dim=512,
latent_dim=2)
model = vae.build_model()
model.compile('adam')
Notice how the model was only compiled with an optimizer. The loss function was added to the model when we called the build_model
method. Just like we trained our autoencoder with the addition of the WandbCallback
and ReconstructionLogger
, we do the same for our VAE.
We have spent a lot of time understanding and improving our simple autoencoder for generative modeling. Now that we have improved it in the form of a variational autoencoder, let’s use it to finally generate new digits. To do so, we can simply sample random points from a standard normal distribution and feed it through the decoder of our VAE.
generated_images = []
for i in tqdm(range(32)):
## Sample from normal distribution
l_sample = np.random.normal(size=2)
l_sample = l_sample.reshape((1,)+l_sample.shape)
## Pass the sample to the decoder
gen_img = vae.decoder(l_sample)
gen_img = gen_img.numpy()
generated_images.append(gen_img.reshape(28,28))
wandb.log({"generated_images": [wandb.Image(image) for image in generated_images]})
GANs are one of the most exciting advancements in deep learning. GANs stage a battle between two adversaries, namely, the Generator
and the Discriminator
. As the name suggests, a generator is responsible for learning the latent space to generate new images, without directly encoding the image to that latent space as we were doing with an autoencoder. The discriminator, on the other hand, is responsible for telling apart the generated images from those present in the training dataset. Since images are not directly encoded, the latent space must start from random noise. Thus the generator tries to convert random noise into observations that look as if they are sampled from the original dataset, while the discriminator keeps a watchful eye to catch if the generator is fooling or not.
Fig 10: Simplistic overview of GAN((Snap from Generative Deep Learning by David Foster)
At the start of this battle, the generator outputs noisy images and discriminator makes predictions randomly. Note that they are theoretically trained separately. The key idea of GANs lies in how we alternatively train the generator and the discriminator. By altering the training of the two networks the generator becomes more adept in fooling the discriminator, while the discriminator becomes better at catching the images artificially created by the generator. This forces the generator to come up with new ways to fool and this cycle continues.
To understand this tug of war lets implement a GAN ourselves. I have instrumented the wonderful keras-gan repository. Let’s walk through the code! We'll start with the discriminator as it’s the easiest to implement.
Navigate to dcgan
directory in the linked repository. You will find the script dcgan.py
which contains a class DCGAN
. There you will find build_generator
, build_discriminator
, train
and show_imgs
methods. Let's walk through the first three to understand their implementations.
def build_discriminator(self):
model = Sequential()
model.add(Conv2D(32, kernel_size=3, strides=2,
input_shape=self.img_shape, padding="same"))
model.add(LeakyReLU(alpha=0.2))
model.add(Dropout(0.25))
.
.
model.add(Conv2D(256, kernel_size=3, strides=1, padding="same"))
model.add(BatchNormalization(momentum=0.8))
model.add(LeakyReLU(alpha=0.2))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
The goal of a discriminator is to predict if the image is real or fake. You can guess that this is a simple binary classification problem. Our discriminator’s Sequential
model has the same network architecture as a supervised image classification problem. As we are classifying whether the image is real or fake, the output layer is Dense
with one neuron. Since our architecture uses Convolutional layers it’s called Deep Convolutional GAN or DCGAN.
Now let’s build the generator.
def build_generator(self):
model = Sequential()
model.add(Dense(128 * 7 * 7, activation="relu", input_dim=self.latent_dim))
model.add(Reshape((7, 7, 128)))
model.add(UpSampling2D())
model.add(Conv2D(128, kernel_size=3, padding="same"))
model.add(BatchNormalization(momentum=0.8))
model.add(Activation("relu"))
model.add(UpSampling2D())
model.add(Conv2D(64, kernel_size=3, padding="same"))
model.add(BatchNormalization(momentum=0.8))
model.add(Activation("relu"))
model.add(Conv2D(self.channels, kernel_size=3, padding="same"))
model.add(Activation("tanh"))
Since the aim is to generate images from random noises which represent our latent space, we will start with a Dense
layer. Notice how the number of units (latent size) is then reshaped. We use UpSampling2D
layers to increase the dimension of the feature map. The last Conv2D
block is usually composed of 3 or 1 channels depending on if the training dataset is RGB or black and white, and is of same size. Notice that it’s activated using tanh
thus our training data must be rescaled to (-1, 1). The network architecture looks similar to that of a decoder
. The interesting part is that this decoder will never see any training data.
Both the adversaries were easy to implement, but as mentioned the trick here lies in the training process. Let’s try to understand this key aspect. We can train our discriminator by randomly sampling some images from our training set and some from the generated images. For real images the discriminator should output 1
and 0
for fake images. Thus we can treat this as a supervised learning problem. But what about our generator? We don’t just want to map the latent image to some true image – there is no training set to do so. We want to fool the discriminator into thinking that the generated image is real.
Thus, to train the generator we will connect it with the discriminator, i.e. feed the output image of the generator to the discriminator so that the output of this combined model is the probability that the image is real according to the discriminator.
# Build the discriminator
self.discriminator = self.build_discriminator()
# Build the generator
self.generator = self.build_generator()
# The generator takes noise as input and generates imgs
z = Input(shape=(self.latent_dim,))
img = self.generator(z)
# For the combined model we will only train the generator
self.discriminator.trainable = False
# The discriminator takes generated images as input and determines validity
valid = self.discriminator(img)
# The combined model (stacked generator and discriminator)
# Trains the generator to fool the discriminator
self.combined = Model(z, valid)
self.combined.compile(loss='binary_crossentropy', optimizer=optimizer)
To train this combined model, the input will be a randomly generated latent space vector and true
outputs will be set to 1
. Note that while doing so the discriminator is frozen so that the weights don’t get updated. This ensures that the discriminator does not learn to adjust itself while the generator is learning to make better predictions. We want generated images to be predicted close to 1 (real) because the generator is strong, not because the discriminator is weak.
The train
method implements the logic for GAN training. I integrated W&B into the dcgan.py
script so that we can log discriminator_loss
and generator_loss
along with discriminator_accuracy
. I am also logging images generated after some gen_interval
which you can specify as a command line argument. Let’s train it. 🍩
After a sufficient number of epochs the discriminator and generator will come to an equilibrium. Thus the generator will start learning useful features and and more realistic images are generated.
*Fig 11: Generated images logged using W&B. *
It’s worth mentioning that the generator started with random noises and slowly learned high level features to generate images. It’s really impressive and mind blowing. Let’s look at the loss plot.
If you have come this far, congratulations. Take a moment and reflect on what you learned, especially on those ideas that you had while reading this post.
Autoencoders and Variational Autoencoder are really powerful deep learning architectures. They have many practical applications like image denoising, anomaly detection, inpainting etc apart from generative modeling. GANs, on the other hand, are a very popular deep learning architecture. And everyday we see continued progress in this field, opening up so many new ways to harness their power.
While training GANs maybe hard, tracking tons of experiments is not at all hard with Weights and Biases.
Thank you for reading this article until the end. I would like to thank Sayak Paul for providing initial direction on this article. I would also like to especially thank Lavanya for going back and forth with this article. Please feel free to let me know if you have any feedback on twitter @ayushthakur0. I would appreciate it :)