How Well Can a CNN Detect Architectural Style?

Convolutional neural nets perform well on objects with disparate features. But do they perform as well on objects with a high number of similar features? Let's find out.
Mason Sanders

Contents

  1. Abstract
  2. Introduction
  3. Datasets
  4. Model Description
  5. Visualizations and Reporting
  6. Hypothesis Verification
  7. Demonstrative Application and User's Guide
  8. Learning Experience
  9. The Next Step
  10. References and Links

Abstract

This attempt to create a convolutional neural net that can classify the architectural style of a building was my first experience with machine learning. Specifically, this is an attempt to push classification beyond distinguishing between dissimilar objects and into the realm of classifying objects with a high number of shared features.
The resulting model can determine if a building belongs to one of ten architectural styles with 60-65% accuracy. This demonstrates that computers are capable of detecting features within objects which are otherwise very similar.

Introduction

There are many standard datasets for testing computer vision models (ImageNet, MNIST, Fashion-MNIST, and CIFAR-100 for example). And while these datasets are extensive in their breadth of data, none of them explore differences between largely similar objects as the key feature. As an example, the Fashion-MNIST dataset consists of 70,000 28x28 grayscale images of clothing items, with 10 classifications. However, the classes represented in the dataset are, in general, quite distinct, which helps a model differentiate between them.
In contrast, something like architecture style is much more subtle; the differences between these styles are much more nuanced than comparing a dress to a sandal. The purpose of this experiment is to determine how well a convolutional neural net can distinguish between similar objects (buildings) and between similar features within those objects (architectural style).

Datasets

The raw dataset used for training was found on Kaggle and can be found here. It contained 25 architecture styles of which nine styles were selected, and images from the Japanese Traditional style were added for a total of ten. The resulting dataset was supplemented by, on average, adding 15 images collected from a Google Image search per style in order to increase the diversity and size of the dataset. We'll be looking at these classes:
The dataset was cleaned by reducing or increasing the sample size of each style to 150 items with a test set of 45 items. These images were then cropped (see below) to reduce the amount of non-feature noise.
Before Cropping
After Cropping
After training on the dataset with 150 items per class, it appeared the variety of features present was too diverse for the model in its current form. The actions needed to rectify this would be of such magnitude as to render the model more complete than the intended prototypical form. Instead, we limited the number of features for each style and the dataset reduced to 50 training items and 25 testing items per class. The cropping, sorting, and cleaning of the dataset was done by hand, and there was no machine assistance during this normalization process.
Pictured here is a sample comparing the diversity within the Art Nouveau and Baroque styles before and after normalization.

Before Normalization:

After Normalization:

Again, it is easy to see how distinct features will be more readily accessible to the machine after this process. However, it should be noted that the two styles chosen have a higher than average amount of crossover during the estimation process. This further reinforced that there is room for improvement by iterating on the method after finishing prototyping. The cropping, sorting, acquisition and cleaning was done by hand.

Model Description

Once a version of the dataset was ready, it was uploaded to a Google Drive which could be accessed through the development environment. The environment used was Google Colaboratory, hereafter abbreviated as Colab. Computer vision is very computationally intensive, and the machines available for the cloud processing are significantly more powerful than those otherwise accessible.
The notebook extracted the files from the Drive, reduced them to a uniform size and converted them to grayscale, again to increase training efficiency. After this was completed, the images were run through the model for training. We used a Convolutional Neural Network (CNN) which predicted what style of architecture an image belongs to. Weights & Biases was used to track hyper-parameters, usage and automate the training process. By the end of the project, we ran over 1500 distinct tests, producing over 100 gigabytes of metadata.
The first layer was a preprocessing layer which applied random rotation, zoom, and horizontal flipping to the images. This was to expand the dataset. Even if the same image is seen multiple times, due to augmentation the machine would be unable to recognize it. This is of course a well-documented technique to enable training on small datasets for computer vision but the difference in performance between the final model with and without preprocessing can be seen here:
Afterwards: the processing layers. The architecture has 6 convolutional layers. These are for extracting features from the image – lines, shapes, etc. After these are 3 dense (or fully connected) layers which are responsible for the prediction itself. A final dense layer outputs the prediction by taking the aggregated information from the preceding layers, attempting to detect patterns in the features, and assigning weights to each of the 10 categories, indicating how likely it is that the image belongs to that class.
A variety of models were created and trained on the dataset, which can give some insight on the impact of hyper-parameters on the the model's performance. The first step was establishing a high-performing layer format, in this case the 6 convolutional layers that feed into the 3 dense layers with 320 connecting nodes in each convolutional layer and an average of 480 nodes that in the dense layers.
All charts in this section represent the mean accuracy of 5 runs with the same architecture in order to account for variance in training. The min, max, and standard deviation within it 5 run set can be viewed by hovering your mouse over the bar representing the group which you'd like more information about. The runs are not cherry-picked. However, for the sake of clarity and brevity, each run displayed here deviated by at most one hyper-parameter from the final model.
Here is a chart comparing the final model (6 convolutional layers and 3 dense layers) with a model with 3 convolutional and 3 dense layers, and a model with 6 convolutional layers and 1 dense layer.
Here is a chart comparing the performance in the final structure (6 convolutional and 3 dense layers) when given a different number of nodes.
Note: while including layers with a higher number of connections than 320 is probable to increase accuracy, the testing environment could not handle a model of such size.
The next thing to fall into place was dropout. Dropout in the final model was included after the middle 4 convolutional layers at an intensity of .1, and after the first 2 dense layers at an intensity of .3 and .6 respectively.
The last hyperparameter to be solidified was the filter size for the convolutional layers. The filter size with the greatest accuracy was 2.
While it must be admitted that this a generalization of things, it should demonstrate and establish the processes through which optimization of the model occurred. As an example, we examined learning rate at every step along the revision process, but for this particular model a learning rate of .0001 was the most consistent and high performing:
Finally, at the end of testing, the model's final shape can be described as follows:
Model = { Preprocessing layer: { Horizontal_Flip Random_Zoom (.1) Random_Rotation(.1) }Convolutional_Layer(filters = 320, kernel_size = (2,2))Maxpooling2dConvolutional_Layer(filters = 320, kernel_size = (2,2))Dropout(.1)Maxpooling2dConvolutional_Layer(filters = 320, kernel_size = (2,2))Dropout(.1)Maxpooling2dConvolutional_Layer(filters = 320, kernel_size = (2,2))Dropout(.1)Maxpooling2dConvolutional_Layer(filters = 320, kernel_size = (2,2))Dropout(.1)Maxpooling2dConvolutional_Layer(filters = 320, kernel_size = (2,2)Maxpooling2dFlattening_LayerDense_Layer(units = 640)Dropout(.3)Dense_Layer(Units = 480)Dropout(.6)Dense_Layer(units = 320)Dense_layer(Units = 10) #Softmax Layer for classification
Once the model was complete, the predictive aspect of the project was complete. For the descriptive aspect K-means clustering was used to categorize the images based upon similar features as interpreted by the model. This was used to visualize what the final layer of the model is doing. To accomplish this, the model was loaded, and the predicting layer stripped off. With the predictive layer removed, the model would now output the raw features of each image, which can be fed into a K-means clustering algorithm to categorize the features into similar clusters. This allows for an observer to intuit the final step in the process without having to gain a complex understanding of the processes occurring during the predictive process.

Visualizations and Reporting

Clustering is a visualization method that allows the user to see a representation of what the machine is doing. While K-Means clustering is not an exact one-to-one correlation with the prediction method used, it lets someone less familiar with the processes gain enough of an understanding to evaluate the program’s merit.
This visualization takes a sample of 10 images from each of the clusters to show the similarities found by the program. The images are labelled by their category and can be viewed in either raw form or processed form, and the user can choose between seeing the images sorted into 10 clusters or 20. This is to demonstrate that clustering is distinct from prediction, and puts visually similar images together, irrespective of their class. Clustering is a technique which would likely be used in further iterations of the project to categorize sections of an image to increase accuracy.
Taking a look at cluster 0, we can see that the images are quite similar.
Above are two images from a cluster, and the prediction made on each image. Confidence and accuracy are both high on these predictions. These two images are classed together by the clustering method, but are predicted to be two different styles, so it can be seen that although clustering is a useful metaphor, it is not always accurate.

Hypothesis Verification

The established hypothesis of the project was to determine if it is possible for computer vision to be used to differentiate nuanced details represented in architectural styles. With ten distinct classes, random chance would have the machine’s accuracy at somewhere near 10% in training and testing. Training accuracy was frequently reaching 90% or higher, due to the small sample size, and testing accuracy peaked at 65%. It is, therefore, easy to conclude that the hypothesis was correct. Even an unguided CNN can distinguish nuanced architectural features with reasonable accuracy without bounding box techniques to classify sections of an image before aggregating to complete a summative prediction. The use of bounding boxes would, however, be the next step forward for a more accurate model.
As previously established, the highest accuracy reached by a model was 65% on the smaller dataset. Seen below is a set of confusion matrices generated during the training of a model which reached 60% testing accuracy. Using the gear in the top left you can look through all of the training epochs to see how its performance improved over time.
The y axis indicates the index of the class the images belonged to. The x axis indicates the predicted value of the images.
An x/y pair indicates the number of images in the test set that the model predicted was x, but was y. If we look at the matrix from step 323, cell 6,6 has a value of 17, meaning that 21 images were classifies as Japanese Traditional and were Japanese Traditional. On the other hand, of the 25 Art Deco images, only 8 were correctly categorized as seen in cell 1,1. Further information can be gained from this matrix. A cell like 1,9 can indicate why there is an inaccuracy in Art Deco predictions. This cell has 15 images, which tells us that 15 of the images in the Art Deco class were misidentified as Tudor Revival. In fact, in the final version of the model before loading the best weights, 58 images were mis-identified as Tudor Revival. That indicates a large bias in themodel toward Tudor revival, likely a result of over-fitting to the training set.

Demonstrative Application and User’s Guide

Demonstrative Notebook
Training Notebook
The notebooks to run the demonstrative application and the training environment are linked above. The demonstrative application is contained in the notebook called ‘Architecture Dataset.ipynb’. This will take a user into a Google Colab session. Once in the Colab session, the user needs to run the two initialization cells to load all the relevant data, after that they can utilize the features of the application. These instructions are also included there for completion. If a user wants to watch the training process, they can use the notebook called ‘Training.ipynb’ and follow the directions in that notebook, starting with running the initialization cell to load the data. As a note, they must initiate a cloud-based GPU processing session, or the process will not complete in a reasonable amount of time. With a GPU session, it typically takes anywhere from 20-40 minutes depending on the GPU they are given access to by the service. Detailed instructions on how to set up a GPU session are included in the notebook. For the sake of curiosity, a user can also train the same architecture on the Fashion-MNIST dataset. It typically performs near 90% accuracy.

Learning Experience

This project is easily the most complicated piece of software I’ve ever worked on. Additionally, it’s the project I’ve felt that I was least prepared for at the onset once I finished it. While I had experience working with Python and am very comfortable creating structures which I can use to accomplish a task, I had otherwise no experience with any of the tools I used to accomplish this task. I ended up having to spend a fair amount of time reading documentation for each of the tools I used, something that, luckily, I am comfortable with. If I were not already familiar with digging through library documentations, I doubt this task could have been completed. Documentation, however, is no replacement for actual knowledge when it comes to neural nets. I had no practical experience working with CNNs before this project, so I had to do a fair bit of research throughout the process to reach enough of an understanding to first improve the model, and second to recognize when the model had reached its practical limits based on the resources I had available. While it is true that I had no practical experience with the creation of machine learning models at the onset of this project, I have found the process very enjoyable and intend to continue researching ways to improve upon this model. Additionally, I have been making plans to start a handful of other machine learning projects which I anticipate will force me to grow in similar ways to this one.

The Next Step

I currently have tentative plans to continue improving the model. However, to truly improve upon what currently exists would require entirely redesigning the model. My current hypothesis is that implementing bounding box techniques would allow the model to focus upon specific features within a building’s architecture and analyze them separately. From there the model could aggregate analysis of sections of the building before performing its assessment. This should increase accuracy while increasing the flexibility of the model when it comes to data cleanliness. Additionally, the model could be modified to produce a more detailed breakdown of the architectural styles represented in the image, as well as highlighting what specific features are influencing a prediction.

References & Links

Weights & Biases:
Experiment Tracking with Weights and Biases, 2020, https://www.wandb.com, Biewald, Lukas
Kaggle dataset from the following paper:
Xu, Z., Tao, D., Zhang, Y., Wu, J., & Tsoi, A. (2014). Architectural Style Classification Using Multinomial Latent Logistic Regression. ECCV.
Demonstrative Notebook
Training Notebook
Dataset on Github