Skip to main content

FAN: Computer Vision & Handling Corrupt Input

Computer vision has been a staple of machine learning for many years, with improvements constantly coming, however, the struggle for interpretting corrupted inputs remains. In a paper by Zhou et al., they explore what makes certain model types better at handling corrupted input.
Created on April 27|Last edited on April 27
Computer vision is one of the most important fields of machine learning, yet it has always had some serious issues handling what might seem trivial to a human observer: basic, sometimes unnoticable, visual corruption.
Most models trained for vision can handle clean input surprisingly well, being able to consistently differentiate objects without too much error, but when we throw something as simple as human-unnoticable JPEG compression onto the input data, the model will go completely haywire.
Certain types of computer vision models struggle more than others, such as classic convulutional nets failing a lot more compared to the more recent visual transformers. Zhou et al. analyze certain things important to understanding what makes different model types better at handling input visual corruption, and go on to propose their own line of models, "FANs", which show favorable performance over existing models.
Their paper can be viewed here: https://arxiv.org/abs/2204.12451.
Additionally, they've made a GitHub repo that will be hosting all of their pretrained FAN models here: https://github.com/NVlabs/FAN

How FAN models compare to others


The FANs (Fully Attentional Network) themselves are modified visual transformer type models, particularily in the way the blocks that build a conventional visual transformer are constructed. Visual transformers, taking cues from natural language model architechture in their utilization of tokenization, have always had the edge over convolutional neural networks in handling corrupted input data.
In testing, they recorded how well a model is able to perform on clean data vs how it is able to perform on varying types of corrupted data. The difference in performance between data types is considered it's robustness, or the ability to perform on corrupted data compared to clean data.
One notable thing found was that while the recent ConvNeXt model does outperform standard visual transformers and even Swin, it performs worse in robustness than a basic FAN visual transformer model (FAN-ViT) as well as a combination model of FAN and Swin (FAN-Swin). However, the regular ConvNeXt model did maintain higher performance in non-corrupted data. This lead to the team making the ultimate FAN model, the FAN-Hybrid model using some architecture from ConvNeXt in combination with FAN principles.
The FAN-Hybrid model comes in various sizes and was tested against existing stand-out models in a variety of vision tests, including dealing with clean input data as well as corrupted input data. The FAN-Hybrid model showed improvements across the board in testing.

Find out more

Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.