Backpropagated Gradient Representations for Anomaly Detection

A reproduction of the paper 'Backpropagated Gradient Representations for Anomaly Detection', in Proceedings of the European Conference on Computer Vision (ECCV).
Shambhavi Mishra
Created on August 15|Last edited on June 26
Comment
In this article, we attempt to reproduce the paper 'Backpropagated Gradient Representations for Anomaly Detection' by Kwon et al., accepted in Proceedings of the European Conference on Computer Vision (ECCV 2020). We cover each aspect of reproducing the results and claims put forth in the paper. 
This paper proposes a novel approach to utilize gradient-based representations to achieve state-of-the-art anomaly detection performance in benchmark image recognition datasets. Although we could reproduce the results with some code fixes, getting the results was computationally expensive.
Here's what we'll cover: 
Table of ContentsScope of ReproducibilityMethodologyResultsWhat was easyWhat was difficult Communication with original authorsIntroductionScope of ReproducibilityProposed Approach: Gradient-based Representations Theoretical Interpretation of GradientsMethod: Gradient ConstraintMethodologyModel DescriptionsDatasetsHyper-parametersComputational requirementsResultsResults on CIFAR-10Results on MNISTVisualizing the losses for CIFAR and MNIST Datasets
﻿
Scope of ReproducibilityThe paper proposes utilizing backpropagated gradients as representations to characterize anomalies. The authors highlight the computational efficiency and the simplicity of the proposed method in comparison with other state-of-the-art methods relying on adversarial networks or autoregressive models, which require at least 27 times more model parameters than the proposed method. In the paper, the authors propose an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations.
MethodologyWe used the code provided by the authors in their GitHub repository. We fixed dependencies and the code, and we've uploaded our implementation here as well. Total training times for each dataset (MNIST and CIFAR) ranged from 5-8 hours (approx.) on four NVIDIA Tesla P100 GPU. Further details are presented in run set 2.﻿ We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard. 
ResultsDue to high computational requirements, we were not able to replicate all of the experiments conducted and presented in the paper. We reproduced the results for the Convolutional Autoencoder for the two datasets, MNIST and CIFAR. The results obtained do not exactly overlap with the ones promised in the paper. The reason for the discrepancy could have been the random seed used by the authors, which they did not provide.
What was easyThe paper was understandable, and it was quite fascinating to follow its structure. Along with the theoretical concept, the mathematical equations provided ease in reformulating the paper. In addition to this, the authors provided the original implementation, which was easy to run with very few modifications. 
What was difficult It was difficult to run the code on the available system NVIDIA GTX 1060, and had to switch to a more expensive compute. The results did not match the ones in the paper, and the random seed was not provided by the authors. Thus, it was difficult to analyze the claims made by the authors that the proposed method outperformed the state-of-the-art.
Communication with original authorsThe authors have been so far unresponsive to the emails sent corresponding to the incorrect claims (discrepancy of 5% approx.) made in the paper, along with requests to provide an accurate training code to ensure reproducibility. Also, on raising a GitHub issue, they could not satisfactorily answer the concern.
IntroductionThis reproducibility submission is an effort to validate the ECCV 2020 paper by Kwon et al. (2020), titled 'Backpropagated Gradient Representations for Anomaly Detection.' This report evaluates the central claim of the paper, which proposes an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations. We provide evidence of discrepancy in the results with those provided in the paper. 
Scope of ReproducibilityBy analyzing model behavior on anomalies, the authors suggest adopting a gradient-based representation for anomaly identification. They present a geometric interpretation of gradients and construct an anomaly score based on gradient deviation from the directional constraint.
In comparison to activation-based representations, the authors claim that gradient-based representations are more successful in detecting anomalies. Furthermore, in evaluating image recognition datasets, the suggested anomaly detection method, GradCon, which is a combination of the reconstruction error and the gradient loss, claims to achieve state-of-the-art performance. 
The authors also claim that GradCon features a lower number of model parameters and a faster inference time than other state-of-the-art anomaly detection methods in terms of computing efficiency.
The major contributions as listed in the paper are:
The authors propose utilizing backpropagated gradients as representations to characterize anomalies.
The authors validate the representation capability of gradients for anomaly detection in comparison with activation through comprehensive baseline experiments.
The authors propose an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations.
Proposed Approach: Gradient-based Representations Distance information assessed using a specified loss function characterizes the anomaly in activation-based representations. The gradients, on the other hand, give directional information, indicating the movement of the manifold in which data representations are located. This movement describes the direction in which the abnormal data distribution deviates from normal representations.
Furthermore, in comparison to current representations of normal data, the gradients generated from several layers give a complete viewpoint to describe abnormalities. As a result, the directional information from gradients may be used in conjunction with the distance information from the activation to provide additional information. 
Theoretical Interpretation of GradientsGradient-based representations describe model updates from query data and distinguish normal from abnormal data using the Fisher kernel. 
An autoencoder configuration is used, but the encoder and decoder are treated as probability distributions. 
Given the latent variable, z, the decoder models input distribution through a conditional distribution, P(x|z). The autoencoder is trained to minimize the negative log-likelihood, log P(x|z). When 'x' is a real value and P(x|z) is assumed to be a Gaussian distribution, the decoder estimates the mean of the Gaussian. Also, the minimization of the negative log-likelihood corresponds to using a mean squared error as the reconstruction error. When 'x' is a binary value, the decoder is assumed to be a Bernoulli distribution. The negative log-likelihood is formulated as a binary cross-entropy loss. Considering the decoder as the conditional probability enables us to interpret gradients using the Fisher kernel. 
Fisher kernels allow discriminant characteristics to be extracted from generative models, and they've been utilized in various applications, including image categorization, image classification, and action recognition.
The Fisher kernel is used to quantify the distance between training data and normal test data, as well as between training data and abnormal test data, so that the distribution may be generalized to test data. 
The Fisher kernel for normal data (inliers), KFKinK^{{in}}_{F K}KFKin​﻿ , and abnormal data (outliers), KFKoutK^{{out}}_{F K}KFKout​﻿, are derived as follows, respectively:
KFKin(Xtr,Xte,in)=UϕXtrTF−1Uϕ,zXte,inK^{{in}}_{F K}(X_{tr}, X_{te,in}) = U^{{X_{tr}}^T}_{\phi} F^{-1} U^{X^{te,in}}_{\phi,z}KFKin​(Xtr​,Xte,in​)=UϕXtr​T​F−1Uϕ,zXte,in​﻿
KFKout(Xtr,Xte,out)=UϕXtrTF−1Uϕ,zXte,outK^{{out}}_{F K}(X_{tr}, X_{te,out}) = U^{{X_{tr}}^T}_{\phi} F^{-1} U^{X^{te,out}}_{\phi,z}KFKout​(Xtr​,Xte,out​)=UϕXtr​T​F−1Uϕ,zXte,out​﻿
where XtrX_{tr}Xtr​﻿, XteX_{te}Xte​﻿, Xte,inX_{te, in}Xte,in​﻿, Xte,outX_{te, out}Xte,out​﻿ are training data, normal test data, and abnormal test
data, respectively.
﻿
The distance between xoutx_{out}xout​﻿ and xinx_{in}xin​﻿ is formulated as the reconstruction error and characterizes the abnormality of the data as shown in the above figure.
Method: Gradient ConstraintModeling the normalcy of data is frequently used to separate inliers and outliers in the representation space. The irregularity is captured by the deviation from the normalcy model. Constraints imposed during training are frequently used to mimic normalcy. Normal data is easily restricted by the constraint, whereas abnormal data deviates. 
We propose to train an autoencoder with a directional gradient constraint to model the normality. In particular, based on the interpretation of gradients from the Fisher kernel perspective, we enforce the alignment between gradients. This constraint makes the gradients from normal data aligned with each other and result in small changes to the manifold. On the other hand, the gradients from abnormal data will not be aligned with others and guide abrupt changes to the manifold. 
The gradient loss is calculated by averaging the cosine similarity across all layers in the decoder at the kth iteration of training:
Lgrad=−E[cosSIM(∂Jk−1∂ϕiavg),(∂Lk∂ϕi)],∂Jk−1∂ϕiavg=1(k−1)∑t=1k−1(∂Jt∂ϕi){L_{grad}} = -E[cosSIM{(\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}}),{(\frac {\partial L ^{k}}{\partial \phi _ {i}}})],{\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}} = \frac{1}{(k-1)}\sum_{t=1}^{k-1}{(\frac {\partial J ^{t}}{\partial \phi _ {i}}})Lgrad​=−E[cosSIM(∂ϕiavg​∂Jk−1​),(∂ϕi​∂Lk​)],∂ϕiavg​∂Jk−1​=(k−1)1​∑t=1k−1​(∂ϕi​∂Jt​)﻿
﻿LgradL_{grad}Lgrad​﻿ - Gradient loss (a regularization term in the entire loss function)
﻿(∂Lk∂ϕi)(\frac {\partial L ^{k}}{\partial \phi _ {i}})(∂ϕi​∂Lk​)﻿ - The gradients of a certain layer i in the decoder at the kth iteration of training.
﻿∂Jk−1∂ϕiavg{\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}}∂ϕiavg​∂Jk−1​﻿ - The average of the training gradients of the same layer i obtained until the                (k -1)th iteration
J=L+Ω+αLgradJ = L + \Omega + \alpha L_{grad}J=L+Ω+αLgrad​﻿
The reconstruction error and latent loss are the first and second terms, respectively, and they are determined by different types of autoencoders. The gradient loss is given a weight called α\alphaα﻿.
Methodology
Model DescriptionsThe paper uses a convolutional autoencoder (CAE) for GradCon. The encoder and the decoder are symmetric and consist of 4 convolutional layers and the dimension of the latent variable is 3 x 3 x 64.
They also train four different autoencoders, which are CAE, CAE with the gradient constraint (CAE + Grad), VAE, VAE with the gradient constraint (VAE + Grad) for the baseline experiments. VAEs are trained using binary cross entropy as the reconstruction error and Kullback Leibler (KL) divergence as the latent loss.
DatasetsThe paper utilizes four benchmark datasets :
CIFAR-10 - abnormal class detection, 60000 color images with 10 classes.
MNIST - abnormal class detection, 70000 handwritten digit images from 0 to 9.
fashion MNIST (fMNIST) - abnormal class detection, 10 classes of fashion products and there are 7,000 images per class.
CURE-TSR - abnormal condition detection, dataset has 637, 560 color traffic sign images which consist of 14 traffic sign types under 5 levels of 12 different challenging conditions.
We performed reproducibility on CIFAR-10 and MNIST as the code for these were shared, and due to computational limitations, we could not perform the remaining experiments. 
Hyper-parametersThroughout all our experiments, we used the same hyperparameters as in the paper. The authors clearly state all hyper-parameters to train the models in the experiments.
Computational requirementsTo conduct the experimental setup, we used 4 NVIDIA Tesla P100 GPUs. 
The integration of the existing code base with weights & biases and the modifications needed to run the code can be found here.
ResultsWe reproduced the GradCon Model for Convolutional Autoencoder (CAE) for the two datasets, MNIST and CIFAR-10. We plot the three losses obtained from the models below, Loss or the MSE (Mean Squared Error) Loss, Reconstruction Loss from the autoencoder, and the Grad Loss as defined above.
Results on CIFAR-10
Anomaly detection AUROC results on CIFAR-10

ClassIn the PaperRun 1Run 2
Plane0.7600.7210.759
Car0.5980.4720.526
Bird0.6480.6320.606
Cat0.5860.5940.587
Deer0.7330.7250.702
Dog0.6030.5680.519
Frog0.6840.6890.695
Horse0.5670.5220.538
Ship0.7840.7820.750
Truck0.6780.4510.529
Average0.6640.6160.621
﻿
Results on MNIST
Anomaly detection AUROC results on MNIST

ClassIn the PaperRun 1Run 2
00.9950.9960.996
10.9990.9990.999
20.9520.9330.924
30.9370.9540.958
40.9690.5680.566
50.9770.9550.961
60.9940.4720.471
70.9790.6630.633
80.9190.9000.896
90.9730.5770.582
Average0.9730.8020.799
﻿
Visualizing the losses for CIFAR and MNIST DatasetsThe plots below characterize the losses obtained during the training of GradCon, where the x-axis is the number of steps, and the y-axis is the value of the loss, respectively.
﻿
Run set2
﻿
﻿
﻿
Run set2
﻿
﻿
Class	In the Paper	Run 1	Run 2
Plane	0.760	0.721	0.759
Car	0.598	0.472	0.526
Bird	0.648	0.632	0.606
Cat	0.586	0.594	0.587
Deer	0.733	0.725	0.702
Dog	0.603	0.568	0.519
Frog	0.684	0.689	0.695
Horse	0.567	0.522	0.538
Ship	0.784	0.782	0.750
Truck	0.678	0.451	0.529
Average	0.664	0.616	0.621
Class	In the Paper	Run 1	Run 2
0	0.995	0.996	0.996
1	0.999	0.999	0.999
2	0.952	0.933	0.924
3	0.937	0.954	0.958
4	0.969	0.568	0.566
5	0.977	0.955	0.961
6	0.994	0.472	0.471
7	0.979	0.663	0.633
8	0.919	0.900	0.896
9	0.973	0.577	0.582
Average	0.973	0.802	0.799
Add a comment
Tags: Intermediate, Computer Vision, Classification, Research, GradCon, Github, Plots, CIFAR10, MNIST, RC
Iterate on AI agents and models faster. Try Weights & Biases today.