Towards Representation Learning for an Image Retrieval Task

This report explains self-supervised and regularized supervised image retrieval with the help of the latent space of an autoencoder.
Aritra Roy Gosthipaty


In an image retrieval task, there are two fundamental units: image repository, and query image. Simply put, an image repository is a collection of images. A query image is a reference used to retrieve other images from the repository that are perceptually close to it. Image retrieval's task is to rank images in an index set with respect to their relevance to a query image. repo.png

In this report, we approach the image retrieval problem from an unsupervised perspective. The foundation of our work lies in the latent space representation of the images learned through a self-supervised learning task. The goal here is to capture the latent space embeddings of images and then try to determine the distance among them in the latent space. With this approach, we are focussing on the perceptual realm of an image. We validate the quality of the learned representations through a Clustering task and measure its performance through the normalized mutual information score & rand index. Then we identify the issues in learning in a purely unsupervised scenario (link) and show the enhancement in the information content of the learned representations with a hint of supervision. We train a regularised autoencoder with the supervised information. We validate the performance in a retrieval framework for the test set.

Full code in colab notebooks →


There are two parts to an autoencoder, the encoder, and the decoder. The encoder compresses (encodes) the input image into embeddings. The bottleneck of an autoencoder gives rise to the latent space. On the other hand, the decoder tries to learn from these embeddings and recreate the input image. In the process, the embeddings are knocked on and off to become the low dimensional representation of the high dimensional images. To know more about autoencoders, one can quickly glance through this report.

latent space.png

In the training phase, we use autoencoders to compress high dimensional image representation to latent space embedding. These embeddings can be thought of as points in the latent space. These embeddings are then clustered together by the k-means algorithm. After training, we have a repository of embeddings that are clustered in the latent space.


In the testing phase, we first acquire the latent space embedding of our query image. embed.png With the query embedding, we predict the cluster to which it belongs. We extract all the image embeddings from the repo that fall in that cluster. After we have all the image embeddings, we calculate the euclidean distance of each image embedding to the query embedding. The lesser the distance, the similar the images are.


Upon giving this process, some thought this turns out to be logical. The embeddings represent images in a vector space. Moreover, they represent abstract features of the images. They need to be distinct for individual classes, and similar to similar classes. Here in this report, we take note that the autoencoder learns neat representations in the latent space.


We consider different metrics to assess the quality of the feature transformation and its impact on clustering the data.

Rand index

$RI=\frac{TP + TN}{TP + FP + FN + TN}$

where TP, TN, FP, and FN denote true positive, true negative, false positive, and false-negative prediction of cluster labels with respect to the ground truth class label information available on the data points. The measure RI is a pair-counting measure assuming values in the interval $[0,1]$ and it tends to show higher values for equal-sized big clusters

Normalized Mutual Information :

The measure Normalized Mutual Information (NMI) is given by






denotes the total number of observations in the dataset, and the true number of clusters and the number of clusters derived are given by respectively $k^*$ and $k$. Here $N_{jj'}$ denotes the number of agreements between the cluster $j$ and true class $j'$. The number of observations in cluster $j$ and true class $j'$ are given by $N_{j}$ and $N_{j'}$ respectively. NMI is a measure based on mutual information between the true data partition and the obtained clusters. NMI can have higher values for unequal sized clusters, especially in the presence of multiple smaller clusters. It also assumes values in the interval $[0,1]$ and higher values suggest better data clusters.

Full code in colab →


As the foundations have already been laid, we need to know how the experiments are done. There are two experiments, one with a vanilla autoencoder and another an autoencoder with a hint of supervision. With the vanilla autoencoder, we solely extract the embeddings from the bottleneck and then proceed with the clustering. With the second approach, we apply a classification head to the bottleneck and train both the autoencoder and the classifier. The goal here is to optimize the two objectives jointly. While the autoencoder learns about the image as a whole, the classifier provides information about which image belongs to which class. This way, the embeddings of similar classes flock together. As we will see later in the report, the two objectives work against each other. Here the embeddings not only represent images but also align themselves in clusters. The added advantage with this method is that the model learns in a regularized supervised environment, which adds to the generalisability of the learned representation.


For our experiments, we are using the CIFAR-10 dataset. The dataset consists of 60,000 images (50,000 train and 10,000 tests) with 10 classes. The images are of dimension (32,32,3), which is the height width and channel, respectively. As this is going to be an image retrieval task, we have decided to use a bigger training dataset.

The dataset that we decided upon has the following dimensions:

X_train shape: (57000, 32, 32, 3)

y_train shape: (57000, 1)

X_test shape: (3000, 32, 32, 3)

y_test shape: (3000, 1)

The dataset is equally distributed among its classes. There are 10 classes shown below.

Section 2

Vanilla Autoencoder

In the autoencoder, we have used blocks of $Conv-BatchNorm-ReLU$. In the encoder, these blocks are followed by $MaxPooling$ layers. In the decoder, these blocks are preceded by $UpSampling$ layers. The last $Conv$ layer in the encoder is the bottleneck, and the output of this is used as our latent space embeddings.

A vanilla autoencoder is used on the training data. The model uses the Adam optimizer with default parameter settings. The loss is the mean squared error between the input image and the reconstructed image. The loss, in this case, is self-supervised as we have not used any label information.

The Reconstruction

One can see from the reconstructions of test images that the autoencoder did well. We can now say that the latent space has indeed learned the representation of the images.

Original Images (TEST) image.png Reconstruction of the Images image.png

t-SNE of the Embeddings

t-SNE is a powerful dimensionality reduction tool. It reduces the dimensions of a dataset to 2 or 3 dims, maintaining the distance of the points. After the training is over, the trained images are sent through the model in a forward pass,and the embeddings are extracted. t-SNE on the embeddings are plotted and are analyzed too. The motive behind us plotting the t-SNE was to keep a sanity check on our thought experiment. If the t-SNE were any less promising, we would have known that the clustering technique would not work.

Full code in colab →

Section 12

Evaluation on Clustering

The clustering process is simple, the embeddings that are extracted from the training images are applied to the k-means algorithm.

classifier = KMeans(n_clusters=10, random_state=0).fit(pred)

The classifier is the object that can be used to predict the label that another embedding belongs to.

As described above the clustering metrics are as follows:

The Final Retrieval of the Images

In this step, we use a random image from the test set and then go through the following steps:

image.png image.png image.png

The test results look promising, but there is a flavor of bias since they have been trained in an unsupervised framework with no information on the label space. Though the high-confidence results seem logical, there are factors like the color that affects the representations which, in turn, retrieved back to certain irrelevant images for the given query.

Autoencoders with a Hint of Supervision

In this method, the autoencoder remains the same. That is to say, the architecture and the configuration are the same. The only change that is made is an addition of a classification head to the bottleneck. class.jpeg Tying a classifier to the bottleneck has its pros and cons. Now, the latent space embeddings are backpropagated not only with the loss of the autoencoder but also with the classification's loss. Here we are not only instructing the embeddings to pick up the representation of the images but also providing the information of classes so that similar images flock together in the embedding space.

The Reconstruction

Here we notice something brilliant. The reconstructions are not that good as with a vanilla autoencoder. This means that the classification head is interfering with the embeddings. Come to think of it, both the objectives (reconstruction and classification) are fighting against each other in the process. This leads to a more inferior reconstruction but a better latent space alignment, as is shown later in the report.

Original images (TEST) image.png Reconstruction of the images image.png

t-SNE of Embeddings

The purpose of t-Sne has already been laid out; let's look at the plot.

Full code in colab →

Section 8

The Evaluation on the Clustering

In this section, we compare the two approaches from a clustering perspective since our primary intent is image retrieval.

Models NMI RI
Vanilla Autoencoder 0.074 0.811
Autoencoder with a hint of supervision 0.433 0.876

The NMI and RI seem to increase significantly with the addition of the classification head. This also directs us towards the fact that the representation learned is much more informative and has the contextual information in the second case.

The Final Retrieval of the Images

The process of image retrieval has already been discussed. Here we look at the results with the new representations learned with a hint of supervision. image.png



The retrievals are much better in terms of perceptual closeness. The color bias seems to go down while the similar images clutter together. The addition of a single classification head indeed has boosted our performance in the image retrieval task.


This report came into existence due to human thoughts and the will to collaborate. This report talks about representation learning from an image retrieval mechanism in an unsupervised framework and then with a hint of supervision. We showed how representations learned in an unsupervised framework contain particular bias/sensitivity towards specific attributes, which affects the quality of the same. Whereas, we also demonstrate that by adding a sense of supervision to the framework, the representations learned are much more informative and meaningful. The same theory has been illustrated in a simple image retrieval framework to understand the enhancements.

Another critical point to highlight is that the approach does not talk about the state of the art methods to retrieve images. The primary purpose was to understand the quality of the representations learned in both the scenarios and how specific unwanted attribution effects can be removed using a hint of supervision in the system. We would love to know more from the reader. You can reach both of us via twitter.

References :

[1]. The blog 'Keras: Multiple outputs and multiple losses' by Adrian Rosebrock gives a brilliant illustration of Multiple output classification with Multiple loss functions

[2]. The blog 'Risks and Caution on applying PCA for Supervised Learning Problems' by Souradip Chakraborty, Amlan Jyoti Das, Sai Yaswanth explains a very crucial problem of using Projections in Unsupervised frameworks.