A reproduction of the paper 'Backpropagated Gradient Representations for Anomaly Detection' by Kwon et al., in Proceedings of the European Conference on Computer Vision (ECCV), 2020 for the Reproducibility Challenge 2021.

In this report, we attempt to reproduce the paper 'Backpropagated Gradient Representations for Anomaly Detection' by Kwon et al., accepted in Proceedings of the European Conference on Computer Vision (ECCV 2020). We cover each aspect of reproducing the results and claims put forth in the paper.

This paper proposes a novel approach to utilize gradient-based representations to achieve state-of-the-art anomaly detection performance in benchmark image recognition datasets. Although we could reproduce the results with some code fixes, getting the results was computationally expensive.

The paper proposes utilizing backpropagated gradients as representations to characterize anomalies. The authors highlight the computational efficiency and the simplicity of the proposed method in comparison with other state-of-the-art methods relying on adversarial networks or autoregressive models, which require at least 27 times more model parameters than the proposed method. In the paper, authors propose an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations.

We used the code provided by the authors in their Github repository. We fixed dependencies and the code and we've uploaded our implementation here as well. Total training times for each dataset (MNIST and CIFAR) ranged from 5-8 hours (approx.) on four NVIDIA Tesla P100 GPU. Further details are presented in run set 2. We used Weights & Biases as our default logging and tracking tool to store all results in a single dashboard.

Due to high computational requirements, we were not able to replicate all of the experiments conducted and presented in the paper. We reproduced the results for the Convolutional Autoencoder for the two datasets, MNIST and CIFAR. The results obtained do not exactly overlap with the ones promised in the paper. The reason for the discrepancy could have been the random seed used by the authors which they did not provide.

The paper was understandable and it was quite fascinating to follow its structure. Along with the theoretical concept, the mathematical equations provided ease to reformulate the paper. In addition to this, the authors provided with the original implementation which was easy to run with very little modifications.

It was difficult to run the code on the available system NVIDIA GTX 1060 and had to switch to a more expensive compute. The results did not match the ones in the paper and the random seed was not provided by the authors. Thus, it was difficult to analyze the claims made by the authors that the proposed method outperformed the state-of-the-art.

The authors have been so far unresponsive to the emails sent corresponding to the incorrect claims (discrepancy of 5% approx.) made in the paper along with requests to provide accurate training code to ensure reproducibility. Also, on raising a github issue, they could not satisfactorily answer the concern.

This reproducibility submission is an effort to validate the ECCV 2020 paper by Kwon et al. (2020), titled 'Backpropagated Gradient Representations for Anomaly Detection.' This report evaluates the central claim of the paper, which proposes an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations. We provide an evidence of discrepancy in the results with those provided in the paper.

By analyzing model behavior on anomalies, the authors suggest adopting a gradient-based representation for anomaly identification. They present a geometric interpretation of gradients and construct an anomaly score based on gradient deviation from the directional constraint.

In compared to activation-based representations, the authors claim that gradient-based representations are more successful for detecting anomalies. Furthermore, in evaluating image recognition datasets, the suggested anomaly detection method, GradCon, which is a combination of the reconstruction error and the gradient loss, claims to achieve state-of-the-art performance.

The authors also claim that GradCon features a lower number of model parameters and a faster inference time than other state-of-the-art anomaly detection methods in terms of computing efficiency.

The major contributions as listed in the paper are:

- The authors propose utilizing backpropagated gradients as representations to characterize anomalies.
- The authors validate the representation capability of gradients for anomaly detection in comparison with activation through comprehensive baseline experiments.
- The authors propose an anomaly detection algorithm using gradient-based representations and show that it outperforms state-of-the-art algorithms using activation-based representations.

Distance information assessed using a specified loss function characterizes the anomaly in activation-based representations. The gradients, on the other hand, give directional information, indicating the movement of the manifold in which data representations are located. This movement describes the direction in which the abnormal data distribution deviates from normalĀ representations.

Furthermore, in comparison to current representations of normal data, the gradients generated from several layers give a complete viewpoint to describe abnormalities. As a result, the directional information from gradients may be used in conjunction with the distance information from the activation to provide additional information.

Gradient-based representations describe model updates from query data and distinguish normal from abnormal data using the Fisher kernel.

An autoencoder configuration is used, but the encoder and decoder are treated as probability distributions.

Given the latent variable, z, the decoder models input distribution through a conditional distribution, P(x|z). The autoencoder is trained to minimize the negative log-likelihood, log P(x|z). When 'x' is a real value and P(x|z) is assumed to be a Gaussian distribution, the decoder estimates the mean of the Gaussian. Also, the minimization of the negative log-likelihood corresponds to using a mean squared error as the reconstruction error. When 'x' is a binary value, the decoder is assumed to be a Bernoulli distribution. The negative log-likelihood is formulated as a binary cross entropy loss. Considering the decoder as the conditional probability enables to interpret gradients using the Fisher kernel.

Fisher kernels allow discriminant characteristics to be extracted from generative models, and they've been utilized in variousĀ applications including image categorization, image classification, and action recognition.

The Fisher kernel is used to quantify the distance between training data and normal test data, as well as between training data and abnormal test data, so that the distribution may be generalized to test data.

The Fisher kernel for normal data (inliers), K^{{in}}_{F K} , and abnormal data (outliers), K^{{out}}_{F K}, are derived as follows, respectively:

K^{{in}}_{F K}(X_{tr}, X_{te,in}) = U^{{X_{tr}}^T}_{\phi} F^{-1} U^{X^{te,in}}_{\phi,z}

K^{{out}}_{F K}(X_{tr}, X_{te,out}) = U^{{X_{tr}}^T}_{\phi} F^{-1} U^{X^{te,out}}_{\phi,z}

where X_{tr}, X_{te}, X_{te, in}, X_{te, out} are training data, normal test data, and abnormal test

data, respectively.

The distance between x_{out} and x_{in} is formulated as the reconstruction error and characterizes the abnormality of the data as shown in the above figure.

Modeling the normalcy of data is frequently used to separate inliers and outliers in the representation space. The irregularity is captured by the deviation from the normalcy model. Constraints imposed during training are frequently used to mimic normalcy. Normal data is easily restricted by the constraint, whereas abnormal data deviates.

We propose to train an autoencoder with a directional gradient constraint to model the normality. In particular, based on the interpretation of gradients from the Fisher kernel perspective, we enforce the alignment between gradients. This constraint makes the gradients from normal data aligned with each other and result in small changes to the manifold. On the other hand, the gradients from abnormal data will not be aligned with others and guide abrupt changes to the manifold.

The gradient loss is calculated by averaging the cosine similarity across all layers in the decoder at the kth iteration of training:

{L_{grad}} = -E[cosSIM{(\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}}),{(\frac {\partial L ^{k}}{\partial \phi _ {i}}})],{\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}} = \frac{1}{(k-1)}\sum_{t=1}^{k-1}{(\frac {\partial J ^{t}}{\partial \phi _ {i}}})

- L_{grad} - Gradient loss (a regularization term in the entire loss function)
- (\frac {\partial L ^{k}}{\partial \phi _ {i}}) - The gradients of a certain layer i in the decoder at the kth iteration of training.
- {\frac {\partial J ^{k-1}}{\partial \phi _ {iavg}}} - The average of the training gradients of the same layer i obtained until the (k -1)th iteration

J = L + \Omega + \alpha L_{grad}

The reconstruction error and latent loss are the first and second terms, respectively, and they are determined by different types of autoencoders. The gradient loss is given a weight called \alpha.

The paper uses a convolutional autoencoder (CAE) for GradCon. The encoder and the decoder are symmetric and consist of 4 convolutional layers and the dimension of the latent variable is 3 x 3 x 64.

They also train four different autoencoders, which are CAE, CAE with the gradient constraint (CAE + Grad), VAE, VAE with the gradient constraint (VAE + Grad) for the baseline experiments. VAEs are trained using binary cross entropy as the reconstruction error and Kullback Leibler (KL) divergence as the latent loss.

The paper utilizes four benchmark datasets :

- CIFAR-10 - abnormal class detection, 60000 color images with 10 classes.
- MNIST - abnormal class detection, 70000 handwritten digit images from 0 to 9.
- fashion MNIST (fMNIST) - abnormal class detection, 10 classes of fashion products and there are 7,000 images per class.
- CURE-TSR - abnormal condition detection, dataset has 637, 560 color traffic sign images which consist of 14 traffic sign types under 5 levels of 12 different challenging conditions.

We performed reproducibility on CIFAR-10 and MNIST as the code for these were shared and due to computational limitations we could not perform the remaining experiments.

Throughout all our experiments, we used the same hyper-parameters as in the paper. The authors clearly state all hyper-parameters to train the models in the experiments.

To conduct the experimental setup, we used 4 NVIDIA Tesla P100 GPUs.

The integration of the existing code base with weights & biases and the modifications needed to run the code can be found here.

We reproduced the GradCon Model for Convolutional Autoencoder (CAE) for the two datasets, MNIST and CIFAR-10. We plot the three losses obtained from the models below, Loss or the MSE (Mean Squared Error) Loss, Reconstruction Loss from the autoencoder and the Grad Loss as defined above.

Class | In the Paper | Run 1 | Run 2 |
---|---|---|---|

Plane | 0.760 | 0.721 | 0.759 |

Car | 0.598 | 0.472 | 0.526 |

Bird | 0.648 | 0.632 | 0.606 |

Cat | 0.586 | 0.594 | 0.587 |

Deer | 0.733 | 0.725 | 0.702 |

Dog | 0.603 | 0.568 | 0.519 |

Frog | 0.684 | 0.689 | 0.695 |

Horse | 0.567 | 0.522 | 0.538 |

Ship | 0.784 | 0.782 | 0.750 |

Truck | 0.678 | 0.451 | 0.529 |

Average | 0.664 | 0.616 | 0.621 |

Class | In the Paper | Run 1 | Run 2 |
---|---|---|---|

0 | 0.995 | 0.996 | 0.996 |

1 | 0.999 | 0.999 | 0.999 |

2 | 0.952 | 0.933 | 0.924 |

3 | 0.937 | 0.954 | 0.958 |

4 | 0.969 | 0.568 | 0.566 |

5 | 0.977 | 0.955 | 0.961 |

6 | 0.994 | 0.472 | 0.471 |

7 | 0.979 | 0.663 | 0.633 |

8 | 0.919 | 0.900 | 0.896 |

9 | 0.973 | 0.577 | 0.582 |

Average | 0.973 | 0.802 | 0.799 |

The plots below characterize the losses obtained during the training of GradCon where x-axis is the number of steps and y-axis is the value of the loss respectively.