How to Handle Image Distortions and Misclassifications
Spoiler alert: encoding high-level features.
Created on September 12|Last edited on September 23
Comment
Introduction
In many applications, the large dimensionality of inputs makes it challenging to create robust systems that can generalize well for unknown distortions. For example, if we look at image classification systems, even changing one pixel can significantly impact the final result.

Example of an adversarial attack.
Now, solutions to this problem usually involve changing the Deep Convolutional Neural Network (DCNN) architecture to better deal with distortions or using a structure such as a Deep Variational Autoencoder (DVAE) to clean the images.
Unfortunately, these two solutions are not always applicable. Why exactly? Let's see:
- In many applications using transfer learning, changing the DCNN architecture also means giving up on the knowledge acquired through transfer learning. So at every change, the model has to be retrained from scratch on those large datasets (such as ImageNet or COCO).
- Changes performed in the architecture of the DCNN (such as mathematical operations, connections between layers, activation functions, etc.) can be very effective against a specific kind of distortion but may be ineffective against others. This means that this type of solution doesn't generalize well.
- Cleaning images is a complex task. DVAEs are commonly used for this due to their objective functions that differ from the vanilla autoencoder. But because even small images have many pixels, these models require many parameters and a large number of training samples to generalize well. Applications that use real-time images with high throughput may not allow the delay generated by these denoising systems. On top of that, these solutions are reported to fail with heavily noised images (too complex a problem).
Digging into autoencoders
In the work Encoding high-level features: an approach to robust Transfer Learning, we show that variational autoencoders can be used as an intermediary between the feature extractor and classifier to increase classification results for clean images from the CIFAR10 and CIFAR100 datasets as well as for noisy images from the CIFAR-Corrupted dataset. We compared the classical structure (feature extractor and classifier) to the same one with the proposed change, using different pre-trained DCNNs on ImageNet. This change caused an increase in accuracy and a significant reduction in the loss for all models tested.

Illustration of the proposed change.
The idea is simple: instead of providing the feature maps directly to the classifier, we place a variational autoencoder in the middle, which is trained to encode the feature maps in a latent space and retrieve them from this latent space. But instead of taking the output of the decoder, we feed the latent representations directly to the classifier. The classifier is now receiving information encoded from a much smaller space in the form of a sequence of distributions.

Essentially, we are using dimensionality reduction and the objective function of the VAE to preserve only the most important parts of the information (consequently leaving the noise behind) while also trying to generate useful representations for the classifier.
Isn't it the same as using a DVAE to clean images? Not quite.
Even if the objective is the same (remove/ignore noise), DVAEs that work with images use convolutional layers to extract the features and deconvolutional layers to retrieve the image. This structure is much more complex due to the size of the information it receives and outputs.
On the other hand, using a VAE in the high-level feature space (output of DCNN) doesn't require convolutional layers as we are already dealing with feature maps extracted with a pre-trained DCNN. Working in the high-level feature space has advantages as DCNNs output an abstract but structured representation of the image in the form of feature maps. This means that the input and output of this VAE are much smaller than the ones used directly with images.
For example, DenseNet121 expects images that have 224 x 224 x 3 pixels RGB and outputs a sequence of 1024 channels after Global Average Pooling. Because of this, the structure used between the classifier and feature extractor can generalize better with fewer training samples while representing a much smaller addition to the overall system.

Comparison of DenseNet121 with and without the Variational Feature Encoder (VFE 256) on CIFAR10 and CIFAR100 datasets.

Three types of distortions in five levels of severity from the CIFAR-Corrupted dataset.

DenseNet121 and Inception ResNet V2 with and without the VFE 256 on the CIFAR-Corrupted images. These are all unknown distortions to the system (it was only trained on cleaned images).

Results on CIFAR10 and CIFAR100 using VFEs with different latent channel capacities (64, 128, and 256).
We hope that this simple idea will be helpful to those seeking to improve the robustness of their systems when changing architectures, giving up TL, or adding large structures isn't feasible. There is an ipynb file to show how it's fairly simple to add this structure to a classical DCNN-based image classifier. The code is written using TensorFlow.
Thank you & Good Luck!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.