Train Generative Adversarial Networks (GANs) With Limited Data
In this report, we'll learn about the adaptive discriminator augmentation technique that enables us to train GANs with a limited training dataset
Created on March 20|Last edited on March 25
Comment
Training GANs is hard for a lot reasons: training instability, the difficulty in determining when to stop training, the requirement for a lot of training data, and mode collapses to name just a few. Despite the challenges, Generative Adversarial Networks (GANs) have paved the way for many exciting applications of deep learning. For example:
It's worth mentioning that, since their inception in 2014 by Goodfellow et al., a lot of progress has been made to mitigate the challenges of training GANs while making them powerful enough to generate images that are hard to distinguish from real images (this person does not exist is a fantastic example of this). GANs are one of the newer classes of generative models and, if you want a thorough introduction to deep generative models, this W&B report is a great place to start.
With that said, let's dig into a new paper. In this report, we will summarize "Training Generative Adversarial Networks with Limited Data" by Karras et. al.
Check out the video below for a quick summary before we get started:
Official Repo | Paper
GANs And Limited Data
GANs are trained in a two-player game configuration where discriminator and generator fight against each other. The Generator () network is tasked with generating "real" looking images, while the Discriminator () network is tasked with predicting if a given image is "real" or "fake". We denote images generated by to be "fake".
To successfully train GANs, ideally, a lot of (real) training data is required. For use cases such as generating faces, cats, dogs, sceneries, etc. data is readily available. (Note that a more practical use case of GANs is to generate new samples of images that can be used for another task (image classification). For example, if you have a small training dataset for leaf disease classification, you can benefit by generating new leaf images using a GAN trained on that dataset.)
The problem however is training GANs using too little data typically leads to discriminator overfitting, causing training to diverge.
A common metric to evaluate GANs is the Frechet Inception Score (FID). Check out "How to Evaluate GANs using Frechet Inception Distance(FID)" to learn more.
Run set
38
Prevent Overfitting: Augmentation
In good ol' image classification using a gool ol' deep neural network, image augmentation is a widely used regularization to prevent overfitting. Overfitting is when the classifier (in this case) memorizes the training data thus performing poorly on wild (test) data.
To a classifier, the same image (to us humans) rotated (or any other transformation for that matter) by an angle is an entirely new image. This principle if used properly allows to fill in the missing images in the training data that could potentially help represent the entire distribution of that dataset. From the paper:
Training an image classifier under rotation, noise, etc., leads to increasing invariance to these semantics-preserving distortions— a highly desirable quality in a classifier.
Can we leverage image augmentation to train GANs with limited data? Sadly, a GAN trained under similar dataset augmentations (as used in classification) learns to generate the augmented images. In other words, the GAN will start generating images that have been rotated, had noise added, been recolored, etc. This is called augmentation "leakage." We don't want leakage.
To investigate this, I have trained a simple DC-GAN to generate fashion-MNIST images. The GAN is trained with random horizontal/vertical flipping and random rotation. You can clearly see that some of the generated images are either flipped upside down or are rotated by some angle ( mostly).
Run set
1
Stochastic Discriminator Augmentation

Figure 1: Stochastic Discriminator Augmentation
The authors have proposed a straightforward augmentation strategy as shown in figure 1, where the discriminator is evaluated only using augmented images. The real images–as well as the generated images–are augmented with the same augmentation policies (rotation, flipping, etc).
Discriminator augmentation corresponds to putting distorting, perhaps even destructive goggles on the discriminator, and asking the generator to produce samples that cannot be distinguished from the training set when viewed through the goggles.
The term stochasticity refers to the use of a probability value that determines the strength of the augmentation.
At first glance, this should not work since the GAN is never looking at the real images. How can the discriminator guide the generator to generate images that it has never seen? And even more, how can augmentations be applied such that it does not leak?
Augmentations That Don't Leak
From the paper:
Bora et al. [4] consider a similar problem in training GANs under corrupted measurements, and show that the training implicitly undoes the corruptions and finds the correct distribution, as long as the corruption process is represented by an invertible transformation of probability distributions over the data space. We call such augmentation operators non-leaking.
Note that invertible transformations are not undoable transformations. Again from the paper:
For instance, an augmentation as extreme as setting the input image to zero 90% of the time is invertible in the probability distribution sense: it would be easy, even for a human, to reason about the original distribution by ignoring black images until only 10% of the images remain. On the other hand, random rotations chosen uniformly from are not invertible: it is impossible to discern differences among the orientations after the augmentation.
Instead of uniformly choosing the rotation degree, if it is applied with a probability , the relative occurrence of increases, and now the augmented distributions can match only if the generated images have the correct orientation.
Thus augmentations are non-leaky if they can be skipped () with a non-zero probability. But how much skipping () results in non-leaky augmentation? Or what's the upper bound of ?
To answer this the authors ran a series of experiments: trained GANs with different values of with stochastic discriminator augmentation.

Figure 2: The FID score against the choice of .
The authors concluded that some augmentation policies are inherently non-leaky like isotropic image scaling. The major conclusion however is that as long as remains below 0.8, leaks are unlikely to happen in practice.
Adaptive Discriminator Augmentation
Even though the experiments conducted by the authors gave a reasonable threshold to limit within. However, this value is dependent on the model architecture that they used (StyleGAN2), the dataset, the augmentation policies, etc. This value might not work for our case and will require extensive hyper-tuning optimization to find the right balance of augmentation strength.
To this end, the authors proposed Adaptive Discriminator Augmentation that turns the knob () based on heuristics derived from clever observations.

Figure 3: FID scores against the number of training images seen by the discriminator for the different training datasets.
From figure 3, the authors observed that:
- When overfitting kicks in, the validation set starts behaving increasingly like the generated images. Here FID score is computed against this hold-out validation set. Note that this might not be that useful due to its dependence on the validation set which might not be available for a small dataset.
- With the non-saturating loss used by StyleGAN2, the discriminator outputs for real and generated images diverge symmetrically around zero as the situation gets worse.
Based on the last observation the authors have proposed two heuristics. Let , , and be the discriminator's output for the training set, validation set, and generated images. Let be their mean over mini-batches. The authors have used with batch size 64.
- : This heuristic expresses the output for a validation set relative to the training set. However, note that this heuristic is not used in practice due to its dependence on a validation set which is not usually available for a small dataset. The authors are using this for comparison against the other heuristic.
- : This heuristic estimates the portion of the training set that gets positive discriminator outputs. This is used in practice.
For both the heuristics, means no overfitting and means complete overfitting. Based on the final heuristic the authors have proposed an update rule for From the paper:
We control the augmentation strength p as follows: We initialize to zero and adjust its value once every four minibatches based on the chosen overfitting heuristic. If the heuristic indicates too much/little overfitting, we counter by incrementing/decrementing by a fixed amount. We set the adjustment size so that can rise from 0 to 1 sufficiently quickly, e.g., in 500k images. After every step we clamp from below to 0. We call this variant adaptive discriminator augmentation (ADA).
The authors used 0.6 as the final value to reach for .
Conclusion
Many deep learning models can be data greedy but we often simply do not have the means to collect more of that data. It's easy enough to get new images for your facial recognition model, more handwritten text for your OCR project, or more social media messages for your sentiment analysis model. What you can't do is simply create more car accident data to train your self-driving car models.
Creating data is an elegant solution here, but augmentation comes with some level of peril–after all, you don't want that autonomous vehicle model basing its decisions off upside-down highways or unrealistic road conditions. But the authors here have presented some really novel ways around these problems, training GANs with less data while not substantively sacrificing performance by creating "non-leaky" augmentations in their training data. It's a fascinating idea that could have real benefits in domains where collecting or creating data is difficult, costly, or unethical.
Add a comment