With the increase in Generative Adversarial Networks (GAN) in recent years, there has been a lot of research on failure scenarios in GANs and how to train such models properly. While extremely powerful, GANs also prove to be unstable at times and suffer many issues. These range from convergence to the diversity of data generated. Most evaluation metrics are either unsuitable for certain use cases or rely on qualitative measures such as how "good" the results look to a human evaluator.
In 2018, On GANs and GMMs by Richardson et al. introduced a new metric to evaluate a GAN failure case known as mode collapse - when the model fails to generate diverse enough outputs. Such a metric allows for quantitative measurement of a GAN's performance.
This report will extend the paper by integrating this metric into DCGANs to evaluate their failure state on the CelebA dataset and visualize how the score changes during training.
A GAN, aka Generative Adversarial Network, is a type of generative model used to generate realistic data by letting two sub-models compete against each other. A GAN consists of the generator, whose mission is to produce genuine fake data, and the discriminator, which seeks to classify real data from the generator's fake data.
During training, each of these models gets more proficient at its task, and the ideal cease state is an equilibrium where neither model overpowers the other. For a more detailed but easy to understand explanation on GANs, I highly recommend Machine Learning Mastery's article. The generator learns features of the training data to produce realistic images. The discriminator is training itself on both the labeled fake images and real images to classify them.
You can think of the generator as a criminal trying to produce ever better fake banknotes and the discriminator as an investigator developing better ways to detect fake notes.
As stated earlier, while GANs are powerful, they are also highly unstable. There are several failure scenarios that a GAN may find itself stuck in. The two most common types of failures are convergence failure (failing to produce good quality outputs), and mode collapse (failing to produce a variety of different looking outputs) - we will focus on the latter in this report.
A well trained GAN can generate a wide variety of outputs. When producing images of human faces, you would want the generator to create batches of different looking faces with different features. Mode collapse happens when the generator can only produce a single type of output or a small set of outputs. This may happen due to problems in training, such as the generator finds a type of data that is easily able to fool the discriminator and thus keeps generating that one type. Because there is no incentive for the generator to switch things up, the entire system will over-optimize on that one output.
There is no right way of measuring model collapse. Qualitative measures such as manually looking at the images only work if it is self-evident and may fail for more complex cases or more massive amounts of data. Other quantitative measures, such as Inception Score (IS) or Frechet Inception Distance (FID), rely on pre-trained models with a specific set of object classes. They lack an upper bound (hypothetically speaking, the highest possible score is infinity).
Number of Statistically-Different bins (NDB score) was an exciting method proposed in 2018 by Richardson and Weiss based on the following concept as mentioned in their paper: Given two sets of samples from the same distribution, the number of samples that fall into a given bin should be the same up to sampling noise.
To see how it roughly works, let us break it down step by step from a procedural perspective. Given training samples (real data) and test samples (generated/fake data):
If the train and test distributions align perfectly, then step 4 will result in a z-score exactly equal to the significance threshold value chosen in the two-sample test for step 3. The paper's experiments have a default threshold value of 0.05.
The paper also uses Voronoi cells to allocate bins, so all bins are guaranteed to contain samples (no empty bins).
In our experiment, we integrated the NDB score calculation into DCGAN's training of the CelebA dataset. Every 500 batches, or approximately half an epoch, we generate a set of images and perform an NDB score calculation.
Due to the scale of the dataset, for NDB scoring, our training and test set sizes are 20,000 images and 5,000 images, respectively. However, the DCGAN is still trained on the full dataset of 200,000 images for five epochs. This is to avoid memory limit issues when trained in a Kaggle/Colab notebook. The underlying assumption here is that the 20,000 images are a good enough representation of the full distribution. For our purpose, it is enough to demonstrate NDB scoring.
We visualize the results of three different runs:
Good run: This will be a stable training example.
Bad run: In this run, we will reduce the latent dimension size from 100 to 5, which hinders the GAN's ability to pick up image features. This, in theory, would cause partial mode collapse.
Worst run: Reduce latent dimension size to 1, which causes a complete mode collapse.
As a sanity check, samples from the largest bin are displayed for each run below.
While our GAN cannot produce high-quality images given our memory and data limits, we can see some basic patterns.
For example, the worst run is suffering from an apparent mode collapse since it seems to be one image repeated over and over again. Could this pattern be repeated in other bins of the worst run, thereby affecting the entire distribution?
What about for the bad run - exactly how much partial mode collapse is it suffering from?
Once again, it can hard to tell qualitatively whether mode collapse is there or not - even if you displayed all the bins. Except in cases of evident mode collapse, there is no way we can recognize all the patterns. The entire point of the NDB scoring is to provide a quantitative way to measure this.
Therefore, it would be useful to stop looking at the images at this point but check the NDB score graphs instead. Here are the charts for the NDB score and an accompanying Jensen-Shannon divergence score for each NDB score.
This gives us a better idea of performance. We can see that over time the good run can decrease its NDB score and arrive at a relatively stable state. On the other hand, the worst run suffers from a visible mode collapse with the NDB score hovering around 1.0. In general, there is no strict rule on what a "good" or "bad" NDB score is - however, the paper does list results from experiments on other generative models and datasets that can be used as a rough guideline. In general, a higher NDB score means the GAN is doing worse and could be suffering from mode collapse.
One more thing to note is that the NDB score does not directly give a measure of a generated image's sharpness or quality. While there could be a correlation between a more stable GAN and being able to produce higher quality images, the NDB score does not directly measure this.
GANs are extremely powerful and have widespread applications as they develop and mature. However, the problem of mode collapse continues to plague GANs, especially as they are adapted to different applications and scenarios. The NDB score is one way to measure the effects of mode collapse quantitatively. We show that logging the score during training can be a good way to detect mode collapse if we do not see a substantial decrease in the score over time. The key takeaway is that it's useful to have another quantitative metric to measure such an important issue, such as mode collapse instead of purely relying on visual methods or IS/FID scores.
I would recommend the paper Pros and Cons of GAN Evaluation Measures by Ali Borji. This paper gives an excellent rundown of GAN evaluation metrics along with their advantages and drawbacks.
For further reading about different types of GAN failure scenarios, I found two articles to be good reads:
If you would like to learn more about the field of generative models (GAN, VAE, and Autoencoders), check out Ayush Thakur's a well-written report on the evolution of generative modeling: Towards Deep Generative Modeling with W&B
The training and scoring done in this report were done only on a subset of the CelebA dataset. One interesting thing to try for those with more compute power is to test NDB scoring on the full CelebA dataset in addition to training for more epochs. This will allow for better quality generated images and allow the full distribution of the dataset to come into play when calculating the NDB score.
I encourage you to reproduce the results and build upon my analysis by running this colab for yourself.
If you enjoyed reading this report or have feedback, feel free to connect with me on Twitter: @kevinkwshen.