Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification

A reproduction of the paper "Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification" by Narayan et al. (2020), accepted to ECCV 2020.
Animesh Gupta
Created on July 25|Last edited on June 26
Comment
The original version of this article was published on OpenReview as part of the Papers With Code 2021 Reproducibility Challenge. You can find our docs with all code to reproduce our results here. 
Here's what we'll be covering: 
Table of ContentsReproducibility SummaryScope of ReproducibilityMethodologyResultsWhat was easyWhat was difficultCommunication with original authorsReport1. Introduction2. Scope of Reproducibility3. Methodology3.1 TF-VAEGAN3.2 Datasets3.3 Fine-tuning3.4 Reconstruction4. Implementation details4.1 Training Strategy4.2 Experimental Setup4.3 Hyperparameters details4.4 Computational Requirements5. Results6. Discussion
﻿
Reproducibility SummaryIn this study, we show our results and experience during replicating the paper titled  "Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification" (Narayan et al. [2020]). We have updated the model for the recent PyTorch version. We were able to reproduce both the quantitative and qualitative results, as reported in the paper, which includes inductive, fine-tuning, and reconstruction of the original images from synthesized features. The authors have open-sourced their code for inductive setting. We have implemented the code for the fine-tuning setting and reconstruction of the images.
Scope of ReproducibilityTF-VAEGAN (Narayan et al. [2020]) proposes to enforce a semantic embedding decoder (SED) at training, feature synthesis, and classification stages of (generalized) zero-shot learning. They introduce a feedback loop, from SED for iteratively refining the synthesized features during both the training and feature synthesis stages. The synthesized features, along with their corresponding latent embeddings from the SED are then transformed into the discriminative features and utilized during the classification stage to reduce ambiguities among the categories.
MethodologyAs the TF-VAEGAN method was available in PyTorch 0.3.1, we ported the entire pipeline to PyTorch 1.6.0, along with implementing the fine-tuning and reconstruction codes from scratch. Our implementation is based on the original code and the discussions with the authors. Total training times for each method ranged from 2-8 hours on Caltech-UCSD-Birds (Welinder et al. [2010]) (CUB), Oxford Flowers (Nilsback et al. [2008]) (FLO), SUN Attribute (Patterson et al. [2012]) (SUN), and Animals with Attributes2 (Xian et al. [2018]) (AWA2) on a single NVIDIA Tesla V100 GPU. Further details are presented in Table 2.
ResultsWe were able to reproduce the results quantitatively on each of the four datasets as reported in the original paper. Additionally, we were able to reconstruct the original images from the generated features.
What was easyThe authors’ code was well-written and documented, and we were able to reproduce the preliminary results using the documentation provided with the code. The authors were also extremely responsive and helpful via email.
What was difficultThe feature reconstruction codes from Dosovitskiy et al. [2016], Mahendran et al. [2015] are not available in PyTorch. Therefore, we had to implement it in PyTorch along with a hyperparameter search to get the images. We also performed a hyperparameter search to get the fine-tuning results.
Communication with original authorsWe reached out to the authors a few times via email to ask for clarifications and additional implementation details.
Report
1. IntroductionZero-shot learning (ZSL) is a challenging vision task that involves classifying images into new "unseen" categories at test time, without having been provided with any corresponding visual example during training. In the generalized variant, the test samples can further belong to seen or unseen categories. Most recent work in ZSL and GZSL recognition (Xian et al. [2018], Felix et al. [2018], Xian et al. [2019]) are based on Generative Adversarial Networks (GANs), where a generative model is learned using the seen class feature instances and the corresponding class-specific semantic embeddings. 
Feature instances of the unseen categories, whose real features are unavailable during training, are then synthesized using the trained GAN and used along with the real feature instances from the seen categories to train zero-shot classifiers in a fully supervised setting.
In this reproducibility report, we study the proposed work by Narayan et al. [2020] in detail, which consists of implementing the architecture described in the paper, running experiments, reporting the important details about certain issues encountered during reproducing and comparing the obtained results with the ones reported in the original paper. We report our numbers on seen accuracy, unseen accuracy, and Harmonic mean in Table 4.
2. Scope of ReproducibilityThe core finding of the paper is that utilizing semantic encoder decoder (SED) at all stages (i.e training, feature synthesis, and classification) of a VAE-GAN based ZSL framework helped to obtain absolute gains of 4.6%4.6\%4.6%﻿, 7.1%7.1\%7.1%﻿, 1.7%1.7\%1.7%﻿, and 3.1%3.1\%3.1%﻿  respectively on Caltech-UCSD-Birds (CUB), Oxford Flowers (FLO), SUN Attribute (SUN), and Animals with Attributes2 (AWA2) for generalized zero-shot (GZSL) object recognition comparing to baseline. To achieve this, the authors introduced the following two methods:			
A feedback module that transforms the latent embeddings of the SED and modulates the latent representations of the generator for utilizing SED during training and feature synthesis.
A discriminative feature transformation, used at the classification stage, utilizes the latent embeddings of SED along with respective features.
To provide effective re-implementation, we make sure that the quantitative results are reproduced with marginal errors, which might have been caused due to porting the codes to a newer PyTorch, TorchVision and CUDA Toolkit version. We also assure that our visual results look similar to those presented in the original paper.
3. MethodologyFirstly, we ran the authors' publicly available code on all four CUB, FLO, SUN, and AWA2 datasets to get the preliminary results. Secondly, we ported the publicly available code to a recent PyTorch version and made sure that to get on par results with the original code. Thirdly, we used the authors' shared fine-tuned features to train the inductive method code to get fine-tuned results along with the hyperparameter search. Lastly, we implemented the code for reconstructing the images from the synthesized features. Furthermore, we integrated WandB (Biewald et al. [2020]) library into the training loop to track our experiments during training.
3.1 TF-VAEGANTF-VAEGAN architecture is a VAE-GAN based network with an additional semantic decoder (SED) DecDecDec﻿ at both the feature synthesis and (G)ZSL classification stages. The authors introduced a feedback module FFF﻿ which is used during the training and features synthesis stage along with the Decoder DecDecDec﻿. We employ the same architecture as that of the authors in which the VAE-GAN consists of an Encoder EEE﻿, Generator GGG﻿ and Discriminator DDD﻿. 
Real features of seen classes xxx﻿ and the semantic embeddings aaa﻿ are input to EEE﻿ which gives the parameters of a noise distribution as the output. The KL divergence (LKL{L}_{KL}LKL​﻿) loss is applied between these parameters and a zero-mean unit-variance Gaussian prior distribution. The network GGG﻿ synthesizes the features x^\hat{x}x^﻿ using noise zzz﻿ and embeddings aaa﻿ as inputs. Further, a binary cross-entropy loss LBCE{L}_{BCE}LBCE​﻿ is integrated between the synthesized features x^\hat{x}x^﻿ and the original features xxx﻿. 
The discriminator DDD﻿ takes either xxx﻿ or x^\hat{x}x^﻿ along with embeddings aaa﻿ as input and outputs a real number, thus determining whether the input is real or fake. The WGAN loss LW\mathcal{L}_{W} LW​﻿ is used at the output of DDD﻿ that learns to distinguish between the real and fake features. The architecture design focuses on the integration of an additional semantic embedding decoder (SED) DecDecDec﻿ at both the feature synthesis and (G)ZSL classification stages. The paper proposes to use a feedback module FFF﻿, along with DecDecDec﻿, during the training and feature synthesis. 
Both DecDecDec﻿ and FFF﻿ collectively address the objectives of enhanced features synthesis and reduces vagueness among categories during classification. The DecDecDec﻿ takes either xxx﻿ or x^\hat{x}x^﻿ and reconstructs the embeddings a^\hat{a}a^﻿. It is trained using a cycle-consistency loss LR\mathcal{L}_RLR​﻿. The learned DecDecDec﻿ is subsequently used in the (G)ZSL classifiers. The feedback module FFF﻿ transforms the latent embedding of DecDecDec﻿ and feeds it back to the latent representation.
Figure 1: Representation of the TF-VAEGAN architecture. Obtained from the paper Narayan et al. [2020].
3.1.1 Semantic Embedding DecoderThe authors introduce a semantic embedding decoder Dec:X→ADec:\mathcal{X} \rightarrow \mathcal{A}Dec:X→A﻿, which reconstructs the semantic embeddings aa a﻿ from the generated features x^\hat{x}x^﻿. This helps to enforce a cycle-consistency on the reconstructed semantic embeddings thus ensuring that the generated features are transformed to the same embeddings that generated them. As a result, semantically consistent features are obtained during feature synthesis. The cycle-consistency of the semantic embeddings was achieved using the reconstruction loss, ℓ1\ell_1ℓ1​﻿ as:
LR=E[∣∣Dec(x)−a∣∣1]+E[∣∣Dec(x^)−a∣∣1]\mathcal{L}_{R} = \mathbb{E}[||Dec(x) - a||_1] + \mathbb{E}[||Dec(\hat{x}) - a||_1]LR​=E[∣∣Dec(x)−a∣∣1​]+E[∣∣Dec(x^)−a∣∣1​]﻿
The loss formulation for training the proposed TF-VAEGAN can be defined as:
Ltotal=Lvaegan+βLR\mathcal{L}_{total} = \mathcal{L}_{vaegan} + \beta \mathcal{L}_RLtotal​=Lvaegan​+βLR​﻿
where β\betaβ﻿ is a hyper-parameter for weighting the decoder reconstruction error. The authors' utilized SED at all three stages of the VAE-GAN based ZSL pipeline: training, feature synthesis and classification.
3.1.2 Discriminative feature transformationNext, the authors introduce a discriminative feature transformation scheme to effectively utilize the auxiliary information in the semantic embedding decoder (SED) at the ZSL classification stage. The generator GGG﻿ learns a per-class "single semantic embedding to many instances" mapping using only the seen class features and embeddings.
The SED was trained using only the seen classes but learns a per-class "many instances to one embedding" inverse mapping. Thus, the generator GGG﻿ and SED DecDecDec﻿ are likely to encode complementary information of the categories. Here, the authors propose to use the latent embedding from SED as a useful source of information at the classification stage. A detailed overview of the architecture is presented in Figure 2.
 
Figure 2: Integration of SED: Taken from the original paper Narayan et al. [2020]. The authors used the decoder at the ZSL/GZSL classification stage. 
3.1.3 Feedback ModuleLastly, the authors introduce a feedback loop for iteratively refining the feature generation during both the training and the feature synthesis phase. The feedback loop was added between the semantic embedding decoder DecDecDec﻿ and the generator GGG﻿, via the incorporated feedback module FFF﻿ (see Fig. 1). The proposed module FFF﻿ enables the effective utilization of DecDecDec﻿ during both training and feature synthesis stages. Let glg^lgl﻿ denote the lthl^{th}lth﻿ layer output of GGG﻿ and x^f\hat{x}^fx^f﻿ denote the feedback component that additively modulates glg^lgl﻿. The feedback modulation of output glg^lgl﻿ is given by,
gl←gl+δx^fg^l \leftarrow g^l + \delta \hat{x}^fgl←gl+δx^f﻿
where x^f=F(h)\hat{x}^f = F(h)x^f=F(h)﻿, with hhh﻿ as the latent embedding of DecDecDec﻿ and δ\deltaδ﻿ controls the feedback modulation. The authors based their feedback loop on ~\cite{shama19iccv}, which introduces a similar feedback module but for the task of image super-resolution. The authors make necessary modifications in the feedback module in order to use it for zero-shot recognition as a naive plug-and-play of the module provides sub-optimal performance for zero-shot recognition. A detailed overview of the feedback module is given in Figure 3.
Figure 3: Feedback module brief: Taken from the original paper Narayan et al. [2020].
3.2 DatasetsWe evaluated the TF-VAEGAN method on four standard zero-shot object recognition datasets: Caltech-UCSD-Birds (Peter et al. [2010]) (CUB), Oxford Flowers (Maria-Elena et al. [2008]) (FLO), SUN Attribute (Genevieve et al. [2012]) (SUN), and Animals with Attributes 2 (Xian et al. [2018]) (AWA2) containing 200, 102, 717 and 50 total categories, respectively. CUB contains  11,788  images from 200 different types of birds annotated with 312 attributes.
SUN contains 14,340 images from 717 scenes annotated with 102 attributes. FLO dataset has 8189 images from  102  different types of flowers without attribute annotations. Finally, AWA2 is a coarse-grained dataset with 30,475 images, 50 classes and 85 attributes. We use the same splits as used in the original paper for  AWA2, CUB, FLO and SUN ensuring that none of the training classes is present in ImageNet (Deng et al. [2009]). Statistics of the datasets are presented in Table 2.
﻿

DatasetAttributesSeen classesUnseen classesTraining Time (h)Memory (GB)
CUB312100 + 505042.6
FLO-62 + 20205.53.1
SUN102580 + 657272.6
AWA28527 + 13102.752.6
﻿
Table 2: CUB, SUN, FLO, AWA2 datasets, in terms of number of attributes per class (Attributes), number of classes in training + validation (seen classes) and test classes (unseen classes). Total training time and memory consumption in terms of GB taken by each dataset are also reported in the above table.
﻿
3.3 Fine-tuningFor fine-tuning results, we used the same approach as discussed in the original paper. We used the original ResNet-101 (He et al. [2016]) that is pre-trained on ImageNet-1k (Deng et al. [2009]) and fine-tune the last layer of ResNet-101 on the seen training dataset of CUB, AWA2, FLO, and SUN respectively. We further use the fine-tuned layer to extract seen visual features which are used for training the TF-VAEGAN method.
3.4 ReconstructionWe follow a strategy similar to (Narayan et al. [2020]) and used an upconvolutional neural network to invert feature embeddings to the image pixel space. A generator consisting of a fully connected layer followed by 5 upconvolutional blocks was used for the reconstruction task. An upconvolutional block was build using an upsampling layer, a 3x3 convolution, BatchNorm and ReLU non-linearity. We reconstructed the image with an image size of 64x64. 
Then, a discriminator is used to processes the image through 4 downsampling blocks, the feature embedding is sent to a linear layer and spatially replicated and concatenated with the image embedding, and this final embedding is passed through a convolutional and sigmoid layer to get the probability that the sample is real or fake. 
We used an L1 loss between the ground truth image and the inverted image, along with a perceptual loss, by passing both images through a pre-trained ResNet101, and then we calculated an L2 loss between the feature vectors at conv5_4 and average pooling layers. To improve the image quality of our image, we used an adversarial loss by feeding the image and feature embedding to a discriminator. We train this model on all the real feature-image pairs of the 102 classes of the FLO (Nilsback et al. [2008]) dataset and use the trained generator to invert images from synthetic features. Reconstructed images can be seen in Figure 5.
	Figure 4: Feature reconstruction results. Taken from the original paper (Narayan et al. [2020])
Figure 5: Ours reproduced feature reconstruction results. From the above figure, we were able to show that the TF-VAEGAN method is able to generate more visually similar features for the Ground-truth. The feedback provided from the TF-VAEGAN method also provides features that are more similar to the Ground-truth in terms of colors and shape.
4. Implementation details
4.1 Training StrategyWe follow the same training strategy as that of the paper for training discriminative feature transformation. First, the feature generator GGG﻿ and the semantic embedding decoder DecDecDec﻿ are trained. Then, DecDecDec﻿ is used to transform the features(real and synthesized) to the embedding space A\mathcal{A}A﻿. The latent embeddings from DecDecDec﻿ are then combined with the respective visual features. Let hsh_shs​﻿ and h^u∈H\hat{h}_u \in \mathcal{H}h^u​∈H﻿ denote the hidden layer (latent) embedding from the DecDecDec﻿ for inputs xsx_sxs​﻿ and x^u\hat{x}_ux^u​﻿, respectively. The transformed features are represented by: xs⊕hsx_s \oplus h_s xs​⊕hs​﻿ and x^u⊕h^u\hat{x}_u \oplus \hat{h}_ux^u​⊕h^u​﻿, where ⊕\oplus⊕﻿ denotes concatenation. In the proposed TF-VAEGAN method, the transformed features are used to learn the final ZSL and GZSL classifiers as
fzsl:X⊕H→Yuandfgzsl:X⊕H→Ys∪Yuf_{zsl}:\mathcal{X} \oplus \mathcal{H}  \rightarrow \mathcal{Y}^u \qquad \text{and} \qquad f_{gzsl}:\mathcal{X} \oplus \mathcal{H} \rightarrow \mathcal{Y}^s \cup \mathcal{Y}^ufzsl​:X⊕H→Yuandfgzsl​:X⊕H→Ys∪Yu﻿
As a result, the final classifiers learn to distinguish categories using transformed features properly. The authors used semantic embedding decoder DecDecDec﻿ as the input, as it was used to reconstruct the class-specific semantic embeddings from features instances. In the original paper, GGG﻿ and FFF﻿ are trained alternately (Sharma et al. [2019]) to utilize the feedback for improved feature synthesis. In the proposed alternating training strategy, the generator training iteration is unchanged. However, during the training iterations of FFF﻿, two sub-iterations are performed
First sub-iteration: The noise zzz﻿ and semantic embeddings aaa﻿ are input to the generator GGG﻿ to yield an initially synthesized feature x^[0]=G(z,a)\hat{x}[0]=G(z,a)x^[0]=G(z,a)﻿, which is then passed through to the semantic embedding decoder DecDecDec﻿.
Second sub-iteration: The latent embedding h^\hat{h}h^﻿ from DecDecDec﻿ is input to FFF﻿, resulting in an output x^f[t]=F(h^)\hat{x}^f[t] = F(\hat{h})x^f[t]=F(h^)﻿, which is added to the latent representation (denoted as glg^lgl﻿ in Eq.) of GGG﻿. The same zzz﻿ and aaa﻿ (used in the first sub-iteration) are used as input to GGG﻿ for the second sub-iteration, with the additional input x^f[t]\hat{x}^f[t]x^f[t]﻿ added to the latent representation glg^lgl﻿ of generator GGG﻿. The generator then outputs a synthesized feature x^[t+1]\hat{x}[t+1]x^[t+1]﻿, as,
x^[t+1]=G(z,a,x^f[t])\hat{x}[t+1] = G(z,a,\hat{x}^f[t])x^[t+1]=G(z,a,x^f[t])﻿
The refined feature x^[t+1]\hat{x}[t+1]x^[t+1]﻿ is input to DDD﻿ and DecDecDec﻿, and corresponding losses are computed for training. In practice, the second sub-iteration is performed only once. The feedback module FFF﻿ allows generator GGG﻿ to view the latent embedding of DecDecDec﻿, corresponding to current generated features. This enables GGG﻿ to appropriately refine its output (feature generation) iteratively, leading to an enhanced feature representation. A detailed overview of the model architecture is given in Table 1.
﻿
4.2 Experimental SetupIn this study, we have followed the same training procedures for all the settings, as described in the original paper. The parameters for all training settings can be found in the configuration file in our GitHub repository. Our implementation is open-sourced and can be accessed here.
Visual features and embeddings: Following the same approach as discussed in the paper, we extracted the average-pooled feature instances of size 204820482048﻿ from the ImageNet-1k (Deng et al. [2009]) pre-trained ResNet-101 (He et al. [2016]). For semantic embeddings, we use the class-level attributes for CUB (312312312﻿-d), SUN (102102102﻿-d) and AWA2 (858585﻿-d). For FLO, fine-grained visual descriptions of images are used to extract 102410241024﻿-d embeddings from a character-based CNN-RNN (Reed et al. [2016]).
4.3 Hyperparameters detailsThe discriminator DDD﻿, encoder EEE﻿ and generator GGG﻿ are implemented as two-layer fully-connected (FC) networks with 409640964096﻿ hidden units. The dimensions of zzz﻿ and aaa﻿ are set to be equal (Rdz=Rda\mathbb{R}^{d_z}=\mathbb{R}^{d_a}Rdz​=Rda​﻿). The semantic embedding decoder DecDecDec﻿ and feedback module FFF﻿are also two-layer FC networks with hidden units. The input and output dimensions of FFF﻿ are set to 409640964096﻿ to match the hidden units of DecDecDec﻿ and GGG﻿. We used to same activation function as used by the authors, therefore we used LeakyReLU activation with a negative slope of 0.2 everywhere, except at the output of GGG﻿, where a Sigmoid activation is used for applying the BCE loss. 
The network is trained using the Adam optimizer with a 10−410^{-4}10−4﻿ learning rate. Final ZSL/GZSL classifiers are single fully-connected layer networks with output units equal to the number of test classes. Hyper-parameters  α\alphaα﻿, β\betaβ﻿ and δ\deltaδ﻿ are set to 101010﻿, 0.010.010.01﻿ and 111﻿, respectively. The gradient penalty coefficient λ\lambda λ﻿ is initialized to 101010﻿ and WGAN is trained (Arjovsky et al. [2017]). We also did a hyperparameter search for the SUN dataset for the Finetune-inductive setting. Detailed results can be seen in Table 3.
﻿

GAN_lrDecoder_lrFeedback_lra1a2zsl accuracyunseen accuracyseen accuracyH
0.000010.000010.000010.10.0163.337.547.738.5
0.00010.00010.000010.10.0164.342.247.044.4
0.00010.000010.000010.10.0165.044.346.745.5
0.00010.000010.00010.10.0164.342.649.345.7
0.00010.000010.000010.010.0164.641.850.745.9
0.00010.000010.000010.10.0164.642.450.045.9
0.00010.000010.0000010.10.0164.542.050.846.0
0.00010.000010.000010.10.165.042.350.846.1
0.00010.000010.0000010.010.166.241.551.346.0
﻿
Table 3: Detailed results of the hyperparameter search for the SUN dataset for the fine-tuned inductive setting. We observe on changing the values of the LR of GAN, Decoder, and Feedback we were able to replicate the H with a minimum marginal difference of 0.2.
﻿
4.4 Computational RequirementsAll the experiments were run on the NVIDIA TESLA V100 with 32 GPU memory. A breakdown total training time taken by each dataset is provided in Table 2.
5. ResultsWe have implemented the model from scratch by following the descriptions provided in the original paper. We were able to replicate the claimed results of TF-VAEGAN by referring to the published code. Overall, our implementation of the TF-VAEGAN achieved relatively close ZSL accuracy and Harmonic mean in Inductive and Fine-tune-Inductive settings for all the four datasets CUB (Peter et al. [2010]) (CUB), Oxford Flowers (Maria-Elena et al. [2008]) (FLO), SUN Attribute (Genevieve et al. [2012]) (SUN), and Animals with Attributes2 (Xian et al. [2018]) (AWA2) with a marginal difference. Also, we were able to generate similar looking reconstructed images from the features, thus showing the effectiveness of the TF-VAEGAN feature synthesis stage.
In Table 4, we report the original and our reproduced results with the paper's and our results on the two training setting Inductive and Fine-tune-Inductive.
Caltech-UCSD-Birds: Our implementation was able to replicate the results reported in the original paper with a marginal difference of 0.5−0.7%0.5-0.7\%0.5−0.7%﻿.
Animals with Attributes2: We were able to out-perform the Generalized Zero-shot learning metric for seen classes with a difference of 1%1\%1%﻿ and was able to replicate the others with marginal difference ranging from 0.2−1.0%0.2-1.0\%0.2−1.0%﻿. We saw a decrement in the performance for Zero-shot learning with a difference of 0.7%0.7\%0.7%﻿.
Oxford Flowers: Our implementation was able to out-perform the Generalized Zero-shot learning metrics for unseen classes and harmonic mean with a difference of 0.6%0.6\%0.6%﻿ and 0.1%0.1\%0.1%﻿ respectively. We were also able to replicate the other results reported in the original paper with a marginal difference of 0.5−1.0%0.5-1.0\%0.5−1.0%﻿.
SUN dataset: We were able to improve the Generalized Zero-shot learning metric for seen classes with a difference of 0.8%0.8\%0.8%﻿. We were also able to replicate others results from the original paper with a marginal difference of 0.4−1.9%0.4-1.9\%0.4−1.9%﻿.
﻿
﻿
﻿
Run set4
﻿
Feature Visualization: To qualitatively assess the feature synthesis stage, we used the same approach as mentioned in the original paper and trained an upconvolutional network to invert the feature instances back to the image space by following a similar strategy as in (Dosovitskiy et al. [2016],  Xian et al. [2019]). 
The model is trained on all real feature-image pairs of the 102 classes of the FLO dataset (Maria-Elena et al. [2008]). The comparison between the original paper and our reconstruction on Baseline and Feedback synthesized features on four example flowers are shown in Fig. 4 and Fig. 5 respectively. 
For each flower class, a ground-truth (GT) image along with three images inverted from its GT feature, Baseline and Feedback synthesized features, respectively are shown. Generally, inverting the features synthesized by the Feedback Module yields an image that is semantically closer to the GT image than the Baseline synthesized feature, suggesting that the Feedback module improves the feature synthesis stage over the Baseline, where no feedback is present.
6. DiscussionWe found that the proposed feedback module in the VAE-GAN model helps modulate the generator's latent representation, improving the feature synthesis. By enforcing the generation of semantically consistent features at all stages, the authors were able to outperform previous zero-shot approaches on four challenging datasets. The qualitative results generated with our replication are similar to those shown in the paper. Thus strengthening the authors' claim of a highly effective feedback module. 
As per our replicated quantitative results, we affirm that our implementation ofTF-VAEGAN is consistent with the one provided by the authors. Overall, the paper and the provided code were sufficient for replicating the results on Inductive and Fine-tune-Inductive settings. For re-implementing the model from scratch, we have ported the code to a relatively new PyTorch version and ended up with a comparative performance with the ones in the paper on all settings. Lastly, to provide an insight for run-time, we use the same hardware used by the authors, a Tesla V100 GPU card, on which we showing the runtimes on the four datasets.
Recommendations for reproducibility:
Overall, the paper was clearly written and it was easy to follow the explanation and reasoning of the experiments. We ran into several obstacles while making a fairly old environment of PyTorch 0.3.1. For reproducing the quantitative results, we had to assure that most, if not all, training/evaluation details were true to the experiments in the paper. 
We are extremely grateful to the original authors who gave swift responses to our questions. Nevertheless, it would have been easier to reproduce the results with the latest PyTorch version compatibility. We hope our report and published code help future use of the paper.
Recommendations for reproducing papers
We recommend communicating early with the original authors to determine undisclosed parameters and pin down the experimental setup. Lastly, in particular, we suggest checking how training is progressing in as many different ways as possible for reproducing training processes. In our process, this involved looking at the progression of H i.e Harmonic mean between GZSL and ZSL, examining training curves for individual loss function terms, both of which helped us pinpoint our issues.
﻿
Dataset	Attributes	Seen classes	Unseen classes	Training Time (h)	Memory (GB)
CUB	312	100 + 50	50	4	2.6
FLO	-	62 + 20	20	5.5	3.1
SUN	102	580 + 65	72	7	2.6
AWA2	85	27 + 13	10	2.75	2.6
GAN_lr	Decoder_lr	Feedback_lr	a1	a2	zsl accuracy	unseen accuracy	seen accuracy	H
0.00001	0.00001	0.00001	0.1	0.01	63.3	37.5	47.7	38.5
0.0001	0.0001	0.00001	0.1	0.01	64.3	42.2	47.0	44.4
0.0001	0.00001	0.00001	0.1	0.01	65.0	44.3	46.7	45.5
0.0001	0.00001	0.0001	0.1	0.01	64.3	42.6	49.3	45.7
0.0001	0.00001	0.00001	0.01	0.01	64.6	41.8	50.7	45.9
0.0001	0.00001	0.00001	0.1	0.01	64.6	42.4	50.0	45.9
0.0001	0.00001	0.000001	0.1	0.01	64.5	42.0	50.8	46.0
0.0001	0.00001	0.00001	0.1	0.1	65.0	42.3	50.8	46.1
0.0001	0.00001	0.000001	0.01	0.1	66.2	41.5	51.3	46.0
Add a comment
Aakash Gupta • 4 years ago
This looks good. Quick question - how long did it take for you to complete all the experiments?
1 reply
Tags: Intermediate, Computer Vision, Classification, PyTorch, Research, Plots, RC
Iterate on AI agents and models faster. Try Weights & Biases today.