In the words of Yann LeCun, Generative Adversarial Networks (GANs) are "The most interesting idea in Machine Learning in the last 10 years". This is not surprising since GANs have been able to generate almost anything from high resolution images of people "resembling" celebrities, building layouts and blueprints all the way to memes. Their strength lies in their incredible ability to model complex distributions. While autoencoders have attempted to be as versatile as GANs, they have (at least until now) not had the same generative power as GANs and historically have learnt entangled representations. The authors of the paper draw inspiration from recent progress in GANs and propose a novel autoencoder which addresses these fundamental limitations. In the next few sections, we'll dive deeper and find out how.
-> Source <-
An autoencoder is a combination of an encoder function, $g_{\phi}(x)$, which takes in input data and converts it into a different representation; and a decoder function, $f_{\theta}(z)$, which takes this representation back to the original domain. Here $z = g_{\phi}(x)$ is the latent representation of the input. Thus an encoder compresses high dimensional input space (the data) into low dimensional latent space (the representation). While the decoder reconstructs the given representation back to the original domain. Thus, they can combine both generative and representational properties by learning an encoder-generator(decoder) map simultaneously.
-> Basic Autoencoder (Source) <-
A GAN stages a battle between two adversaries, namely, the Generator($G$) and the Discriminator($D$). As the name suggests, a generator is responsible for learning the latent space($Z$), which is a known prior $p(z)$, to generate new images, representing synthetic distribution $q(x)$, without directly encoding the image to that latent space as we were doing with an autoencoder. The discriminator, on the other hand, is responsible for telling apart the generated images from those present in the training dataset, which represents the true distribution $p_D(x)$. GANs aims at learning $G$ such that $q(x)$ is as close as $p_D(x)$. This is achieved by playing a zero sum two player game with the discriminator. You can read more on autoencoder, GAN and latent representation here.
-> Basic GAN (Source) <-
Even though the autoencoders have been extensively studied, some issues have not been fully discussed and they are:
Points in the latent space holds relevant information about the input data distribution. If these points are less entangled amongst themselves, we would then have more control over the generated data, as each point contributes to one relevant feature in the data domain. The authors of Adversarial Latent Autoencoder have designed an autoencoder which can address both the issues mentioned above jointly. Next, let's take a closer look at the architecture.
The ALAE architecture is a modification of original GAN by decomposing the Generator$(\textbf{G})$ and Discriminator$(\textbf{D})$ networks into two networks such that: $\textbf{G}$ = $\textit{G} \circ \textit{F}$ and $\textbf{D}$ = $\textit{D} \circ \textit{E}$. The architecture is shown in Figure 3. It's assumed that the latent space between both the decomposed networks is same and is denoted as $\mathcal{W}$. Let's go through each network block one by one.
-> Architecture of ALAE (Source) <-
This network is responsible to convert the prior distribution $p(z)$ to an intermediate latent ($\omega$) distribution, $q_F(\omega)$. It has been shown in this paper that an intermediate latent space, far imposed from the input space, tends to have better disentanglement properties. The authors of ALAE, have assumed $\textit{F}$ to be a deterministic map in most general cases.
Thus $\textit{F}$ takes in samples from known prior, $p(z)$, and outputs $q_F(\omega)$.
Network definition in Model
class in model.py
## (F) Mapping from known prior p(z)->W
self.mapping_fl = MAPPINGS["MappingFromLatent"](
num_layers=2 * layer_count,
latent_size=latent_size,
dlatent_size=latent_size,
mapping_fmaps=latent_size,
mapping_layers=mapping_layers)
In the generate
method of the same class, this is how self.mapping_fl
is used.
if z is None:
z = torch.randn(count, self.latent_size)
## (F) maps p(z) to latent distribution, W
styles = self.mapping_fl(z)[:, 0]
This is the good old generator from our GAN. But with two differences:
The input to our old Generator is sampled directly from the latent space, whereas now the input to our generator is the intermediate latent space which you will soon realize is learned from the space of the input training data.
The output of our old Generator is fed to the Discriminator which is nothing more than a binary classifier. In ALAE architecture design the output of $\textbf{G}$ is fed to an Encoder($\textit{E}$) as shown in the figure above.
The authors have assumed that $\textbf{G}$ might optionally depend on an independent noisy input, $\eta$ which is sampled from a known fixed distribution $p_\eta(\eta)$.
Thus the inputs to Generator($\textbf{G}$) are $q_F(\omega)$ and optionally $p_\eta(\eta)$. The output is given by,
where, $q_G(x \vert \omega, \eta)$ is conditional probability of generated image $x$ given $\omega$ and $\eta$.
Network definition in Model
class in model.py
## Generator (G) Takes in latent space and noise
self.decoder = GENERATORS[generator](
startf=startf,
layer_count=layer_count,
maxf=maxf,
latent_size=latent_size,
channels=channels)
In the generator
method this is how self.decoder
is used.
rec = self.decoder.forward(styles, lod, blend_factor, noise)
They must have named it decoder because the Generator of GAN is similar to the Decoder of an autoencoder.
As the name suggests, Encoder(E) encodes the data space into latent space. The latent space between $\textbf{G}$ = $\textit{F} \circ \textit{G}$ and $\textbf{D}$ = $\textit{E} \circ \textit{D}$ is the same. That is, the Encoder should encode the data space to the same intermediate latent space, $\omega$.
During training, the input to the Encoder is either real images from the true data distribution $p_D(x)$ or generated images representing synthetic distribution $q(x)$. This is shown in figure 3.
The output of the Encoder when the input is drawn from the synthetic distribution is,
where, $q_E(\omega)$ is the conditional probability distribution of the latent space $\omega$ given the data space, $x$.
The output of the Encoder when the input is drawn from the true data distribution $p_D(x)$ is,
Since ALAE is trained with an adversarial strategy, $q(x)$ will eventually move towards $p_D(x)$. This also implies that $q_E(x)$ move towards $q_{E, D}(x)$.
The assumption on the latent space implies that the output distribution of the the Encoder($E$), $q_E(\omega)$, is similar to the input distribution to the Generator ($\textbf{G}$), $q_F(\omega)$.
This is achieved by minimizing the squared difference between both the distributions. Quite simple, yet quite wondrous.
In a vanilla autoencoder, the reconstruction loss such as the $\mathcal{l}_2$ norm is calculated in the data space. This however does not reflect human visual perception. It has been observed that the computation of the $\mathcal{l}_2$ norm in the image space is one of the reasons why autoencoders haven't been able to generate sharp images like GANs. This is where enforcing reciprocity in the latent space comes to the rescue.
The model definition in Model
class is model.py
.
## Encoder (E): Encodes image to latent space W
self.encoder = ENCODERS[encoder](
startf=startf,
layer_count=layer_count,
maxf=maxf,
latent_size=latent_size,
channels=channels)
In the encoder
method of the same class, this is how self.encoder
is used.
## Encode generated images into ~W
Z = self.encoder(x, lod, blend_factor)
In the forward
method of the Model
class, this is implemented this way.
## Known prior p(z)
z = torch.randn(x.shape[0], self.latent_size)
## generate method returns input(s) and output(rec) to generator.
s, rec = self.generate(lod, blend_factor, z=z, mixing=False, noise=True, return_styles=True)
## encode method returns encoder output(Z)
Z, d_result_real = self.encode(rec, lod, blend_factor)
## Mean squared error in the latent space-l2 norm.
Lae = torch.mean(((Z - s.detach())**2))
This network takes is fed by the encoder and outputs a variable of shape (batch_size, 1)
to be used as labels for binary classification by the discriminator.
During training it is used twice: once when the output of the encoder which is input for $\textbf{D}$, is produced by the generator, and second when the output of the encoder is generated by real data as input.
Network definition in class Model
in model.py
.
self.mapping_tl = MAPPINGS["MappingToLatent"](
latent_size=latent_size,
dlatent_size=latent_size,
mapping_fmaps=latent_size,
mapping_layers=3)
In the forward
method this is how self.mapping_tl
is used.
## generator method returns the generated image.
Xp = self.generate(lod, blend_factor, count=x.shape[0], noise=True)
## encode method takes in real data x and return real labels.
_, d_result_real = self.encode(x, lod, blend_factor)
## encode method takes in generated data and return fake label.
_, d_result_fake = self.encode(Xp.detach(), lod, blend_factor)
This network takes is fed by the encoder and outputs a variable of shape (batch_size, 1)
to be used as labels for binary classification by the discriminator.
During training it is used twice: once when the output of the encoder which is input for $\textbf{D}$, is produced by the generator, and second when the output of the encoder is generated by real data as input.
Network definition in class Model
in model.py
.
self.mapping_tl = MAPPINGS["MappingToLatent"](
latent_size=latent_size,
dlatent_size=latent_size,
mapping_fmaps=latent_size,
mapping_layers=3)
In the forward
method this is how self.mapping_tl
is used.
## generator method returns the generated image.
Xp = self.generate(lod, blend_factor, count=x.shape[0], noise=True)
## encode method takes in real data x and return real labels.
_, d_result_real = self.encode(x, lod, blend_factor)
## encode method takes in generated data and return fake label.
_, d_result_fake = self.encode(Xp.detach(), lod, blend_factor)
To summarize, the authors of ALAE have designed an Autoencoder(AE) architecture where:
Now that we have an understanding of the building blocks ALAE. Let's quickly go through the StyleALAE architecture which can generate 1024x1024 face images which is comparable to StyleGAN which is state of the art for face generation.
There are two components of StyleALAE:
The style information is extracted from the $i^{th}$ layer by introducing Instance Normalization(IN) in that layer. This layer outputs channel-wise averages $(\mu)$ and standard deviation $(\sigma)$, which represents the style content in each layer. The IN layer provides normalization to the input in each layer. The style content of each such layer of the encoder is used as input by the Adaptive Instance Normalization (AdaIN) layer of the symmetric generator which is linearly related to the latent space $\omega$. Thus, the style content of the encoder is mapped to the latent space via a multilinear map.
The authors have used progressive resizing during training. That is the training starts with a low-resolution image (4 x 4 pixels) and is gradually increased by blending in new blocks to Encoder and Generator.
-> Modifications to ALAE to get StyleALAE (Source) <-
The ALAE is a modification of a vanilla GAN architecture with some novel tweaks. Training of this architecture thus involves learning $min max$ with respect to the Generator and Discriminator pair. A vanilla GAN is trained with a two step training procedure. By altering the training of the generator and the discriminator networks, the generator becomes more adept in fooling the discriminator, while the discriminator becomes better at catching the images artificially created by the generator. This forces the generator to come up with new ways to fool the discriminator and this cycle continues as such.
In case of the ALAE, given the assumption on the latent space, which ensures that the output distribution of the the Encoder($E$), $q_E(\omega)$, is similar to the input distribution to the Generator ($\textbf{G}$), $q_F(\omega)$, a third training step is involved. These three updates as shown in the figure below.
Step $I$ updates the discriminator, network blocks E and D.
encoder_optimizer.zero_grad()
loss_d = model(x, lod2batch.lod, blend_factor, d_train=True, ae=False)
tracker.update(dict(loss_d=loss_d))
loss_d.backward()
encoder_optimizer.step()
Step $II$ updates the generator, network blocks F and G.
decoder_optimizer.zero_grad()
loss_g = model(x, lod2batch.lod, blend_factor, d_train=False, ae=False)
tracker.update(dict(loss_g=loss_g))
loss_g.backward()
decoder_optimizer.step()
Step $III$ updates the latent space of the autoencoder, network blocks G and E.
encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()
lae = model(x, lod2batch.lod, blend_factor, d_train=True, ae=True)
tracker.update(dict(lae=lae))
(lae).backward()
encoder_optimizer.step()
decoder_optimizer.step()
-> Training algorithm of ALAE (Source) <-
We tried to train the ALAE architecture on the MNIST dataset, however, ran into a bit of trouble. The authors have used their own library Dareblopy to prepare a training data generator. This library uses the TFRecord format. Unfortunately, we were not able to run the training script train_alae.py
on colab.
After a lot of intensive debugging and fixes, we decided to implement the MLP based ALAE ourselves and train it on the MNIST dataset. The results you see are from our unofficial implementation and not from the official repo. You can check out our implementation in the link below:
Note:
init
and thus if your losses are not behaving as it should try to restart and run all
. It worked out for us.We trained the network for 75 epochs using the hyperparameters recommended by the authors. As described earlier, training ALAE involved three-stage of optimization. Thus we have three separate loss metrics to monitor the performance of the network. You can see the logged metrics below. :point_down: Note that a smoothing factor of 0.5 is applied to show the trend properly.
disc_loss
: This loss is associated with the ability of the discriminator to predict fake(generated) images from the real ones. The discriminator loss reduces with lots of fluctuation which is common in adversary based training.
gen_loss
: This loss is associated with the ability of the generator to fool the discriminator. Our gen_loss
seems to go up. But the generated images improved over time. It's not perfect but certainly, draw out the potential of our unofficial implementation.
latent_loss
: This loss determines the similarity of the latent space between F and G network and between E and D network. Thus unlike a GAN network where the latent space is simply sampled from a known distribution, this distribution is learned in ALAE. Over time the latent_loss
reduces. We will soon see the effect of this.
The generated images after epoch 1, 25, 50, and 75 is shown below.
The authors unleash their StyleALAE network on three datasets, FFHQ, LSUN, and Celeb-A. They train the network to generate, reconstruct and mix samples on these datasets. Let's take a look at their results.
The authors train the StyleALAE described above on the FFHQ dataset which consists of 70k images of people's faces that are aligned and cropped at a resolution of $1024 \times 1024$. The dataset is split into 60k images for training and 10k images for testing. As expected, training a network on a large collection of high-resolution images needs significant computing resources. They use 8 x Titan RTX GPUs to train their network for 147 epochs of which 18 use training samples at the full $1024 \times 1024$ resolution. They use the _"progressive resizing" _ method where the input image resolution is first $4 \times 4$ and grows progressively to the highest resolution over the course of training.
One of the interesting aspects here is that the StyleALAE is able to produce robust results at the highest resolution like the StyleGAN paper but spend only 1M in image training time at the highest resolution of $1024 \times 1024$ while the StyleGAN spends 15M in image training time for the same resolution.
The FID or Fréchet Inception Distance is a measure of image visual quality and is generally accepted as an equivalent of human estimation of quality. The StyleALAE achieves very good FID scores but is still not as good as the StyleGAN per this metric. The table below shows the results on both the FFHQ and LSUN datasets.
What's interesting however is that while measuring the level of disentanglement of representations via the Perceptual Path Length (PPL) metric, the StyleALAE outperforms the StyleGAN as shown below:
The next logical question to ask would be "How does the StyleALAE stack up against other comparable methods?". The answers to that question are also available courtesy of this ablation on the Celeb-A dataset:
We've re-run some of the generation, reconstruction, and mixing experiments the authors have shared in their repo and the results are below. Note that the pre-trained models for these experiments are available in the official repo. If you'd like to follow along, here's the colab notebook for your quick reference
Here are some results of using the StyleALAE to generate some random images based on the three datasets it was trained on.
This is honestly one of the most impressive showings from an auto encoder architecture since sliced bread. It was a lot of fun for us to replicate some of the experiments and understand this work better. We hope our report has helped you in making this work more accessible, or on the very least has made you excited about interpolating between images. For any feedback feel free to drop us a note on Twitter: @DSaience, @ayushthakur0. Fin .