Fine-Grained Image Classification (FGIC) with B-CNNs

Bi-linear convolutions for classifying images with minute differences, complete with code, resources, and some dogs for good measure.
Rajesh Shreedhar Bhat


Fine-Grained Image Classification (FGIC) is an area of expertise in image recognition where we get to differentiate minor categories such as dog breeds, bird species, airplanes, etc.
There are two main challenges associated with such fine-grained tasks. First of all, the visual differences between different classes are so small that they can be influenced by factors such as pose, location of the object, etc. In fact, even one small section of the object that can make a big impact on the classification. Secondly, it is very complex to identify the minute discriminating parts within the images. Various methodologies have been implemented to identify and locate those parts within the images using orderless descriptors such as Fisher vectors, VLAD, and Neural models like Bi-linear Convolutions (B-CNN), etc. In this report, we do a deep dive on B-CNNs which have proven to be one of the most successful techniques in obtaining the best results in Fine-Grained Image Classification tasks.

1. Introduction

Convolution Neural Networks (CNNs) have become standard for many Computer Vision tasks to the extent that CNNs can outperform humans on some of these tasks. But when regular CNNs try to more finely categorize objects like a breed of a dog, species of a bird, etc. it becomes quite a challenge.
This is evident from the set of images as shown in Fig. 1 and Fig. 2; the differences between these breeds relatively slim. The major challenge in such a task is to create a methodology that can identify and differentiate the minute variations between categories. This can be solved if we can locate the parts in the images containing the finer discriminative details by identifying the appropriate interactions amongst the various features of the image.
Table 1: Different Fine-grained image classification datasets with the number of classes and average samples per class.
We have used Transfer Learning approaches that help us to leverage knowledge from previously learned tasks and apply them to similar tasks. Transfer Learning is useful when we have limited training data and the domains that we train and classify belong to different label spaces. As shown in Table 1, in Fine-Grained Image Classification tasks, the number of labels/classes is comparatively higher than the images belonging to the respective classes. This enhances the complexity of the problem and puts a restriction on the number of parameters needed for building a generalizable model. B-CNN’s have shown significant improvement in accuracy numbers over traditional methodologies like VLAD, Fishers Vectors with SIFT features, etc. in Fine-Grained Image Classification tasks.

2. Related Work

There has been a lot of interesting work in the field of computer vision related to FGIC. In the fine-grained classification model, we extract the features from the images and then train a classifier from the derived features. The differences between each class are generally small and can be influenced by many factors.
A way to address these nuances is to initially localize the various parts of the object and build a classifier using the detected parts. For detecting the various parts, traditional approaches like sliding window techniques with feature extractors such as HOG[10] or SIFT[11] and a classifier can then be used to classify if the object/part of interest is present or not. Part detectors are to be trained in a supervised manner and labeling the parts requires a lot of domain expertise compared to providing just a label to the overall image.
Additionally, there is no end-to-end learning to detect the parts of an object in focus with previously mentioned traditional approaches compared to recent advancements using CNN-based models like YOLO[12], Single Shot Multibox Detectors[13], etc. One of the early papers in Bi-Linear models introduced by Tennenbaum and Freeman concerns two-factor problems in vision[8]. They used bilinear models to “learn the style content structure of a pattern analysis or synthesis problem which can then be generalized to solve related tasks using different styles and/or content."
Here, we'll discuss Bilinear CNN architecture for fine-grained visual recognition [1], where models trained on the ImageNet dataset [2] are shortened at a convolutional layer to extract features. Using pre-trained ImageNet models (Transfer Learning) will benefit by providing training data implicitly when there is limited domain-specific data. A number of recognition tasks ranging from texture detection, object detection & fine-grained recognition provide proof of this [3, 4, 5, 6].
There has been significant research in the field of detection and localization of key parts from images using convolution neural networks in a supervised framework which has significantly improved the past performance which relied on handcrafted and bag of words-based features like HOG. But one of the major challenges of this approach is the creation of labels for training data as it is extremely costly to extract the parts and label them for all the images. Another approach is to use robust descriptors such as Fisher Vectors, VLAD [10], etc. with SIFT [11] features or pre-trained image embeddings [7] but this approach fails to efficiently capture the various significant feature interactions, giving less accuracy than the part-based models.

3. Methodology

Fig 3: Bilinear CNN Model Architecture
In Bilinear CNNs there are two CNN streams (as shown in Fig. 3) and these two streams can be the same or a different model. We have used the ResNet50 model for both CNN streams. Outer products of these feature maps are taken to obtain a bilinear vector that captures pairwise interaction between different features in a translationally invariant fashion and then flattened and passed to a fully connected layer for classification. Authors of the BCNN paper claim that order-less features are useful in fine-grained image classification compared to order-full features in a typical CNN model.

3.1 Feature Interaction Matrix

Fig 4: BCNN Feature Interaction Matrix
In the Fine-Grained Image Classification (FGIC) task, where we have to differentiate minor categories such as dog breeds, bird species, airplanes, etc, the major challenge lies in finding the minute visual differences amongst features such as pose, location, shape, size, etc. The Bi-linear Convolutions have proven to be most successful in efficiently capturing the part-feature interaction and have shown to give promising results in the field of Fine-Grained Image Classification tasks.
In order to explain with more detail, let's consider a case where we have to segregate amongst the fine-grained categories of birds based on their breed. It is an extremely complex task as the number of training data points per category/breed of bird is very low which directs us towards a transfer-learning-based framework. However, if we extract features from any standard CNN, we will end up with generic features which won't have enough information to differentiate amongst these fine-grained categories.
One interesting finding which can be observed from the Fig 4 above is that a significant amount of information lies in specific parts of the image. For example, it is clear from Fig 4 that there is a significant difference in the belly color of the birds whereas the other aspects are almost similar. Hence, if we can devise a methodology that can efficiently capture the interaction between the color and position features then we can solve this problem, and that exactly how BCNN tries to capture the feature interaction.
As shown in Fig 5, let's say the CNN stream A of the BCNN framework extracts the color features in the image and the CNN stream B of the BCNN framework detects the different parts of the bird in the image. Then, due to the outer product operation of the BCNN methodology, we will receive the feature interaction matrix where the most significant interaction will be reflected as shown in Fig 5. A bird with a gray belly will have a high value in that position of the interaction matrix and a low value in other places. This matrix on subsequent processing will generate meaningful and segregating feature vectors which can differentiate between the fine-grained classes with limited labels per class and thereby contributing hugely to the success of the BCNN network.
Fig 5: BCNN Feature Interaction Matrix

3.2 Bilinear Convolution:

A bilinear model B for image classification consists of two feature functions f_a, f_b (CNNs in our case: CNN Stream A and B) which extract features from an input image which are pooled together using a pooling function P to obtain the resulting feature representation as shown in Fig 3.
The feature function transforms the image into an embedding space of latent representations encoding the most relevant information inside the image. W,H represents the width and height of the feature maps for an input image extracted from the CNN streams. M \& N represents the number of channels of the feature maps extracted from CNN Stream A and B for the input image respectively. Hence the dimensionality of the features extracted from the CNN streams A and B is represented by W \times H \times M and W \times H \times N which can also be represented by C \times M and C \times N respectively, where C = W \times H . Here, C represents the locations in the image and M \& N which are the number of channels can be thought of as the features for all the locations in C. Let the reshaped extracted features from the CNN streams A and B are represented by F_a \in R^{C\times M} and F_b \in R^{C\times N} respectively. Now, considering the above setting, the pooled bilinear feature matrix x is given by
x = F_a^T \times F_b
where x \in R^{M\times N}. Outer products of these feature maps are taken to obtain a bilinear vector that captures pairwise interaction between different features in a translationally invariant fashion and then flattened and passed to a fully connected layer for classification. Authors in [1] claim that order-less features are useful in fine-grained image classification compared to order-full features in a typical CNN model. x is then flattened and converted to a vector R^{MN\times 1}. The resulting bilinear feature vector x is then passed through a signed square root step, x'= sign(x) \times \sqrt {|x|} followed by an l2 normalization to improve the backpropagation, x'' = \frac{x'}{\Vert x' \Vert}_2 and finally is fed to the classification layer.
Fig 6: Directed Acyclic Architecture of the BCNN Network
As shown in Fig6, since the BCNN architecture resembles a directed acyclic graphical structure, the parameters of the model can be trained easily by backpropagating the classification loss. Let dl/dx be the gradient of the loss function l w.r.t to x, then by chain rule of gradients we can write :
\frac{dl}{dF_a} = F_b\times (\frac{dl}{dx})^T and \frac{dl}{dF_b} = F_a\times \frac{dl}{dx} and the gradient of the other functions are straightforward to compute.

4. Experiments

4.1 Training Data

The Stanford-Dogs dataset contains 20,580 images with 120 classes(dog breeds). The images were taken from the ImageNet database and they come with annotations that mark out the bounding boxes around the dog which is present in the image. The location of bounding boxes will vary and scenes could be non-uniform for most of the images within each class. Also, background objects, poses, occlusion, colors could be different within and across classes. As a pre-processing step, the images are cropped first and then re-sized to 224 x 224 for training and validation sets. Normalization is done for each of the RGB channels with ImageNet Mean(0.485, 0.456, 0.406) and SD(0.229, 0.224, 0.225) respectively.

4.2 B-CNN Implementation[14] in PyTorch

Please read the inline comments along with the shapes mentioned for getting a clear understanding of how the BCNN model is implemented.
features = 2048 ## nothing but the depth of featuremapsfmap_size = 14 ## W & H of the feature map obtained from ResNet model for input image of shape 224, 224, 3.class BCNN(nn.Module): def __init__(self, fine_tune=False): super(BCNN, self).__init__() resnet = models.resnet50(pretrained=True) # freezing parameters if not fine_tune: for param in resnet.parameters(): param.requires_grad = False else: for param in resnet.parameters(): param.requires_grad = True ### removing the fully connected layer from resent layers = list(resnet.children())[:-2] self.resent = nn.Sequential(*layers).cuda() ### Fully connected layer from Feature Interaction matrix to Classification layer ### In this case we have 120 dog breeds/classes. ### features ** 2 is dimension of flattening the feature interaction matrix self.fc = nn.Linear(features ** 2, 120) self.dropout = nn.Dropout(0.5) # Initialize the fc layers. nn.init.xavier_normal_( if self.fc.bias is not None: torch.nn.init.constant_(, val=0) def forward(self, x): ## X: bs, 3, 256, 256 ## N = bs N = x.size()[0] ## x : bs, 1024, 14, 14 x = self.resent(x) ### reshaping the features from ### (batch_size, 1024, 14, 14) --> (batch_size, 1024, 14*14) x = x.view(N, features, fmap_size ** 2) x = self.dropout(x) # Batch matrix multiplication to get the feature interaction matrix # bs, (1024 * 196) matmul (196 * 1024) = (bs, 1024, 1024) x = torch.bmm(x, torch.transpose(x, 1, 2))/ (fmap_size ** 2) ## flattening, sqrt, normalization and dropout ### shape : (bs, 1024 * 1024) x = x.view(N, features ** 2) x = torch.sqrt(x + 1e-5) x = F.normalize(x) x = self.dropout(x) ## feeding to fully connected layer ### shape (bs, 1024*1024, 120 number of classes) x = self.fc(x) return x

4.3 Training, Validation Accuracy + Loss, and model predictions.

The model was trained using Kaggle GPUs with early stopping. Below are the graphs showing training + validation accuracy for 8 epochs. Later the best model is used to make predictions on the validation data. Below are the predictions for few images shown along with the ground truth for different dog breeds.

4.4 Conclusion

B-CNN's give superior performance in FGIC tasks, but the computational complexity of the model is exceptionally high leading to very high dimensional feature space. In B-CNN, the entire feature interaction matrix has been used as the feature vector, which is the major reason for its expansion. In our next article, we will focus on efficiently reducing the dimensionality of the feature space yet capturing all the relevant information from them.
Stay tuned for the next article !!
Link to full code:


[1] Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1449-1457).
[2] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009,June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp.248-255). Ieee.
[3] Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3606-3613).
[4] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2014, January). Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning (pp. 647-655).
[5] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587).
[6] Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 806-813).
[7] Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
[8] Freeman, W. T., & Tenenbaum, J. B. (1997, June). Learning bilinear models for two-factor problems in vision. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 554-560). IEEE.
[9] Farrell, R., Oza, O., Zhang, N., Morariu, V. I., Darrell, T., & Davis, L. S. (2011, November). Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In 2011 International Conference on Computer Vision (pp. 161-168). IEEE.
[10] Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection.
[11] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91-110.
[12] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.779-788).
[13] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.
[14] Colab BCNN for Fine Grained Analysis. -