Modern Data Augmentation Techniques for Computer Vision
Introduction
Deep Learning advancement is credited to faster compute power and to huge datasets. But for many real problems, the dataset is hard to come by. The best way to regularize your model or make it more robust is to feed in more data, but how can one get more data?
The easiest answer is to go out and collect more data, but that can either be expensive or physically impossible. One can think of generating new data samples using generative models like GAN but that can be unnecessary in many cases. The easiest way to train a model on a wide variety of data is to use Data Augmentation. This significantly increases the diversity of data available for training our models, without actually collecting new data samples.
Simple image data augmentation techniques like flipping, random crop, random rotation, etc are commonly used to train large models. This works well for most of the toy dataset and problem statements. But in reality, there can be huge data shift. Is our model robust to data shift and data corruption? As it stands models don't robustly generalize for shifts in data. If models could identify when they are likely to be mistaken, or estimate uncertainty accurately, then the impact of such fragility might be reduced. Unfortunately, the models are overconfident about its prediction.
In this report, we will dive into modern data augmentation techniques for computer vision. Here's a quick outline of what you should expect from this report:
- Theoretical know-how of some modern data augmentations along with there implementations in TensorFlow 2.x.
- Some interesting ablation study.
- Comparative study between these techniques.
- Benchmarking of models trained with the augmentations techniques on Cifar-10-C dataset.
Try it on Google CoLab
Experimental Setup and Baseline
I followed the following experimental setup for training the models with different augmentation techniques:
Dataset
-
Cifar-10 dataset is used for training our models using different augmentation strategies.
-
Cifar-10-C dataset is used to measure a model’s resilience to data shift. This dataset is constructed by corrupting the original CIFAR test set. For each dataset, there are a total of 15 noise, blur, weather, and digital corruption types, each appearing at 5 severity levels. Since these corruptions are used to measure network behaviour under data shift, they are not introduced into the training procedure(augmentation). In CIFAR-10-C, the first 10,000 images in each
.npyfiles are the test set images corrupted at severity 1, and the last 10,000 images are the test set images corrupted at severity five.
Miscellaneous pointers
-
No data augmentation apart from the augmentation strategy under experimentation was used.
-
Initial weight was saved and all the models were trained using this
initial_wt.h5to start from same model weight initialization, -
The models were trained with Early Stopping with the patience of 10 epochs and monitored
val_loss. The upper bound for epoch was 100. -
The trained weights were saved to be be used for robustness benchmarking. The weights were also saved from various ablation studies. You can find all the weights here.
Architecture
- All the experiments were conducted using ResNet20 architecture which uses
resnet_v1model builder from official Keras documentation example on Cifar-10. You can find the model definition in the repo included with this report.
⭐ Let's get started with the fun part. Experiments. 👇
Baseline Model
The baseline model will be used to compare all the augmentation techniques. It is used to do a comparative measure of a model's resilience to data shift/corruption. I trained the model using the above-mentioned configurations. The metrics plots along with my observations are shown below. 👇
Cutout Augmentation
Today, models have tens to hundreds of millions of learning parameters that provide necessary representational power. But with over representation, models tend to overfit to the training distribution rather than generalize well in that particular task. To overcome this, models must be regularized properly by either using data augmentation or with the judicious addition of noise to activations, parameters, data, etc.
One of the most common uses of noise to improve model accuracy is Dropout, which stochastically(randomly) drops neuron activations during training and as a result discourages the co-adaptation of feature detectors. Dropout tends to work well for fully connected layers but lacks that regularization power for convolutional layers. The authors of Cutout have two explanation for this:
-
Convolutional layers already have much fewer parameters than fully-connected layers, and therefore require less regularization.
-
The second factor is that neighbouring pixels in images share much of the same information.
Autoencoders are good at learning useful representation from the image in a self-supervised manner. Especially the class of autoencoders like Context autoencoders, where the input data is corrupted and the network is required to reconstruct them using the remaining pixels as context to determine how to best fill in those blanks. Such architectures have a better understanding of the global content of the image, and therefore they learn better higher-level features.
Cutout
Improved Regularization of Convolutional Neural Networks with Cutout
Inspired by dropout and context encoder, Cutout is a simple regularization technique that involves removing contiguous(patch) section from the input image while training. It can be viewed as the dropout operation but in the input space. This way the model is forced to look at the entire image rather than fixing its attention to some key features. We will see if this make models more robust to data shift. Let's implement it.
Ablation Study
Mixup Augmentation
In the recent state of the art neural network models, a learning rule minimizes the average error over the training data while the number of parameters increases linearly with the increase in the dataset. It has been observed that large neural networks tend to memorize the training data instead of generalizing from them, even in the presence of strong regularization.
Data Augmentation is a process of training the model with similar but different from the training examples. These similar virtual examples can be drawn from the vicinity distribution of the training examples to enlarge the support of the training distribution. For example, when performing image classification, it is common to define the vicinity of one image as the set of its horizontal reflections, slight rotations, and mild scalings. However, human knowledge is required to describe a vicinity or neighbourhood around each example in the training data.
While data augmentation consistently leads to improved generalization, they are limited by two issues:
- The procedure is dataset dependent, and thus requires the use of expert knowledge.
- Data augmentation assumes that the examples in the vicinity(neighbourhood) share the same class, and does not model the vicinity relation across examples of different classes.
Mixup
mixup: BEYOND EMPIRICAL RISK MINIMIZATION
In order to address the problems mentioned, the authors of Mixup came up with a data agnostic(data independent) and simple to implement data augmentation strategy. In a nutshell it creates virtual examples by,
where, (xi,yi)(x_i, y_i) and (xj,yj)(x_j, y_j) are two data points randomly drawn from the training dataset and λλ ∈\in [0, 1]. Thus Mixup performs linear interpolation in the input space with similar interpolation in the associated target space. This improves model robustness to corrupt labels, avoids overfitting as it's hard to memorize virtual labels and increases generalization.
λλ is drawn from Beta(α,α)Beta(α, α) distribution. Where αα controls the strength of interpolation in the input/target space. More on Beta distribution here.