Backpropagated Gradient Representations for Anomaly Detection

A reproduction of the paper 'Backpropagated Gradient Representations for Anomaly Detection' by Kwon et al., in Proceedings of the European Conference on Computer Vision (ECCV), 2020 for the Reproducibility Challenge 2021.
Shambhavi Mishra

Reproducibility Summary

In this report, we attempt to reproduce the paper 'Backpropagated Gradient Representations for Anomaly Detection' by Kwon et al., accepted in Proceedings of the European Conference on Computer Vision (ECCV 2020). We cover each aspect of reproducing the results and claims put forth in the paper.
This paper proposes a novel approach to utilize gradient-based representations to achieve state-of-the-art anomaly detection performance in benchmark image recognition datasets. Although we could reproduce the results with some code fixes, getting the results was computationally expensive.

Scope of Reproducibility

The paper proposes utilizing backpropagated gradients as representations to characterize anomalies. The authors highlight the computational efficiency and the simplicity of the proposed method in comparison with other state-of-the-art methods relying on adversarial networks or autoregressive models, which require at least 27 times more model parameters than the proposed method. In the paper, authors propose an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations.

Methodology

We used the code provided by the authors in their Github repository. We fixed dependencies and the code and we've uploaded our implementation here as well. Total training times for each dataset (MNIST and CIFAR) ranged from 5-8 hours (approx.) on four NVIDIA Tesla P100 GPU. Further details are presented in run set 2. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

Results

Due to high computational requirements, we were not able to replicate all of the experiments conducted and presented in the paper. We reproduced the results for the Convolutional Autoencoder for the two datasets, MNIST and CIFAR. The results obtained do not exactly overlap with the ones promised in the paper. The reason for the discrepancy could have been the random seed used by the authors which they did not provide.

What was easy

The paper was understandable and it was quite fascinating to follow its structure. Along with the theoretical concept, the mathematical equations provided ease to reformulate the paper. In addition to this, the authors provided with the original implementation which was easy to run with very little modifications.

What was difficult

It was difficult to run the code on the available system NVIDIA GTX 1060 and had to switch to a more expensive compute. The results did not match the ones in the paper and the random seed was not provided by the authors. Thus, it was difficult to analyze the claims made by the authors that the proposed method outperformed the state-of-the-art.

Communication with original authors

The authors have been so far unresponsive to the emails sent corresponding to the incorrect claims (discrepancy of 5% approx.) made in the paper along with requests to provide accurate training code to ensure reproducibility. Also, on raising a github issue, they could not satisfactorily answer the concern.

Introduction

This reproducibility submission is an effort to validate the ECCV 2020 paper by Kwon et al. (2020), titled 'Backpropagated Gradient Representations for Anomaly Detection.' This report evaluates the central claim of the paper, which proposes an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations. We provide an evidence of discrepancy in the results with those provided in the paper.

Scope of Reproducibility

By analyzing model behavior on anomalies, the authors suggest adopting a gradient-based representation for anomaly identification. They present a geometric interpretation of gradients and construct an anomaly score based on gradient deviation from the directional constraint.
In compared to activation-based representations, the authors claim that gradient-based representations are more successful for detecting anomalies. Furthermore, in evaluating image recognition datasets, the suggested anomaly detection method, GradCon, which is a combination of the reconstruction error and the gradient loss, claims to achieve state-of-the-art performance.
The authors also claim that GradCon features a lower number of model parameters and a faster inference time than other state-of-the-art anomaly detection methods in terms of computing efficiency.
The major contributions as listed in the paper are:

Proposed Approach: Gradient-based Representations

Distance information assessed using a specified loss function characterizes the anomaly in activation-based representations. The gradients, on the other hand, give directional information, indicating the movement of the manifold in which data representations are located. This movement describes the direction in which the abnormal data distribution deviates from normalĀ representations.
Furthermore, in comparison to current representations of normal data, the gradients generated from several layers give a complete viewpoint to describe abnormalities. As a result, the directional information from gradients may be used in conjunction with the distance information from the activation to provide additional information.

Theoretical Interpretation of Gradients

Gradient-based representations describe model updates from query data and distinguish normal from abnormal data using the Fisher kernel.
An autoencoder configuration is used, but the encoder and decoder are treated as probability distributions.
Given the latent variable, z, the decoder models input distribution through a conditional distribution, P(x|z). The autoencoder is trained to minimize the negative log-likelihood, log P(x|z). When 'x' is a real value and P(x|z) is assumed to be a Gaussian distribution, the decoder estimates the mean of the Gaussian. Also, the minimization of the negative log-likelihood corresponds to using a mean squared error as the reconstruction error. When 'x' is a binary value, the decoder is assumed to be a Bernoulli distribution. The negative log-likelihood is formulated as a binary cross entropy loss. Considering the decoder as the conditional probability enables to interpret gradients using the Fisher kernel.
Fisher kernels allow discriminant characteristics to be extracted from generative models, and they've been utilized in variousĀ applications including image categorization, image classification, and action recognition.
The Fisher kernel is used to quantify the distance between training data and normal test data, as well as between training data and abnormal test data, so that the distribution may be generalized to test data.
The Fisher kernel for normal data (inliers), K^{{in}}_{F K} , and abnormal data (outliers), K^{{out}}_{F K}, are derived as follows, respectively:
K^{{in}}_{F K}(X_{tr}, X_{te,in}) = U^{{X_{tr}}^T}_{\phi} F^{-1} U^{X^{te,in}}_{\phi,z}
K^{{out}}_{F K}(X_{tr}, X_{te,out}) = U^{{X_{tr}}^T}_{\phi} F^{-1} U^{X^{te,out}}_{\phi,z}
where X_{tr}, X_{te}, X_{te, in}, X_{te, out} are training data, normal test data, and abnormal test
data, respectively.
The distance between x_{out} and x_{in} is formulated as the reconstruction error and characterizes the abnormality of the data as shown in the above figure.

Method: Gradient Constraint

Modeling the normalcy of data is frequently used to separate inliers and outliers in the representation space. The irregularity is captured by the deviation from the normalcy model. Constraints imposed during training are frequently used to mimic normalcy. Normal data is easily restricted by the constraint, whereas abnormal data deviates.
We propose to train an autoencoder with a directional gradient constraint to model the normality. In particular, based on the interpretation of gradients from the Fisher kernel perspective, we enforce the alignment between gradients. This constraint makes the gradients from normal data aligned with each other and result in small changes to the manifold. On the other hand, the gradients from abnormal data will not be aligned with others and guide abrupt changes to the manifold.
The gradient loss is calculated by averaging the cosine similarity across all layers in the decoder at the kth iteration of training:
{L_{grad}} = -E[cosSIM{(\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}}),{(\frac {\partial L ^{k}}{\partial \phi _ {i}}})],{\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}} = \frac{1}{(k-1)}\sum_{t=1}^{k-1}{(\frac {\partial J ^{t}}{\partial \phi _ {i}}})
J = L + \Omega + \alpha L_{grad}
The reconstruction error and latent loss are the first and second terms, respectively, and they are determined by different types of autoencoders. The gradient loss is given a weight called \alpha.

Methodology

Model Descriptions

The paper uses a convolutional autoencoder (CAE) for GradCon. The encoder and the decoder are symmetric and consist of 4 convolutional layers and the dimension of the latent variable is 3 x 3 x 64.
They also train four different autoencoders, which are CAE, CAE with the gradient constraint (CAE + Grad), VAE, VAE with the gradient constraint (VAE + Grad) for the baseline experiments. VAEs are trained using binary cross entropy as the reconstruction error and Kullback Leibler (KL) divergence as the latent loss.

Datasets

The paper utilizes four benchmark datasets :
We performed reproducibility on CIFAR-10 and MNIST as the code for these were shared and due to computational limitations we could not perform the remaining experiments.

Hyper-parameters

Throughout all our experiments, we used the same hyper-parameters as in the paper. The authors clearly state all hyper-parameters to train the models in the experiments.

Computational requirements

To conduct the experimental setup, we used 4 NVIDIA Tesla P100 GPUs.
The integration of the existing code base with weights & biases and the modifications needed to run the code can be found here.

Results

We reproduced the GradCon Model for Convolutional Autoencoder (CAE) for the two datasets, MNIST and CIFAR-10. We plot the three losses obtained from the models below, Loss or the MSE (Mean Squared Error) Loss, Reconstruction Loss from the autoencoder and the Grad Loss as defined above.

Results on CIFAR-10

Anomaly detection AUROC results on CIFAR-10

Class In the Paper Run 1 Run 2
Plane 0.760 0.721 0.759
Car 0.598 0.472 0.526
Bird 0.648 0.632 0.606
Cat 0.586 0.594 0.587
Deer 0.733 0.725 0.702
Dog 0.603 0.568 0.519
Frog 0.684 0.689 0.695
Horse 0.567 0.522 0.538
Ship 0.784 0.782 0.750
Truck 0.678 0.451 0.529
Average 0.664 0.616 0.621

Results on MNIST

Anomaly detection AUROC results on MNIST

Class In the Paper Run 1 Run 2
0 0.995 0.996 0.996
1 0.999 0.999 0.999
2 0.952 0.933 0.924
3 0.937 0.954 0.958
4 0.969 0.568 0.566
5 0.977 0.955 0.961
6 0.994 0.472 0.471
7 0.979 0.663 0.633
8 0.919 0.900 0.896
9 0.973 0.577 0.582
Average 0.973 0.802 0.799

Visualizing the losses for CIFAR and MNIST Datasets

The plots below characterize the losses obtained during the training of GradCon where x-axis is the number of steps and y-axis is the value of the loss respectively.