# Understanding what works (and why) in Deep Metric Learning

Publish your model insights with interactive plots for performance metrics, predictions, and hyperparameters. Made by Karsten Roth using Weights & Biases
Karsten Roth

## Overview

Deep Metric Learning (DML) aims to provide insights into how one can best learn models that encode the notion of (image) similarity. Most commonly, the DML setting involves training a deep neural network that takes an image and returns a dense, vectorized representation (this is different from e.g. classification where your network outputs categorical representations!). In a perfect world, distances between these representations then directly relate to "semantic" similarities in the original data space. An example: I have three images - two containing a picture of the same car, but rotated differently, and one showing a car of a different make. A good DML model then provides representations (embeddings) for each image such that the distance between the representations of the first two images is smaller than that of any of those two images to the third one.

There are a lot of applications for such models, in particular image similarity tasks that reduce down to image retrieval applications: Given one query image and a large database of other images, show me which of my database images are semantically similar to my query. Such retrieval applications are important for product recommendations, face re-identification or in the medical domain to e.g. group together variations of a specific cell type. For example, (Yang et al) used similar similarity-based approaches to aggregate similar images of cells under a microscope, making it easier for physicians to diagnose diseases (Yang et al).

Image from https://github.com/easonyang1996/DML_HistoImgRetrieval

Unfortunately, research is plagued by inconsistency — papers with different training protocols, architectures, and parameters make it impossible to compare techniques and to find out what works and what doesn't. To address this, our paper introduces the following crucial elements:

1. We quantified factors of variations in the DML research pipelines on their impact on the final reported results. These can e.g. include using different network architecturs, different batch-sizes, varying data-augmentation or weight regularization. We are also the first to look at the influence the selection of the mini-batch has on the training performance, which is something that was never really put into question. We show that all these factors, if adjusted correctly, can boost performance independent of the actually proposed method. While this is good to know if one wishes to go for peak-performance models, it also breaks fair comparisons. Thus, based on these variations, we proposed a standardized training setup that allows for fair comparison between different methods.
2. Benchmarked models from 15 papers covering baseline and proclaimed state-of-the-art methods using our fair and standardized setup. We found that under fair comparison, most of the methods published in the recent years perform very similarly, putting in question the actual progress made in DML research.
3. While similar, some methods still outperform others. To understand why, we investigated a set of embedding space metrics to see what properties better performing methods (implictly) optimize for. We found that metrics commonly used in the literature to explain improvements in performance against other approaches were not able to explain the difference in performance on the test set when evaluated against our large set of investigated objectives. To remedy this issue, we introduced two other embedding space metrics that had a much stronger link to the test performance. 4.Based on these insights, we propose a regularization to implicitly optimize for embedding space properties beneficial for generalization, and which can be applied on top of most DML methods to boost the generalization performance.
4. Finally, to facilitate research for people new to this field, we provide an introduction into key methods, metrics and approaches in DML in the supplementary as well as a code repository containing implementation for everything done in the paper.

(Green shows regularized variants of exisiting objectives, while orange shows the performance of said objectives under our fair training and evaluation setting. As can be seen, there is a large performance plateau, which is only broken by applying our proposed regularization.

### 1. Understanding how one can "cheat" in Deep Metric Learning

Note: Figures in this section are results for the CARS196 benchmark dataset. For the other benchmarks, just take a look into the paper; insighs gained are transferable.

To understand how dependent reported performance is one the utilized training pipeline, we first benchmarked how different common factors of variation can contribute to performance improvements, such as choices of backbone architectures, dimensionality of embedding spaces, augmentation protocols and batchsizes:

This plot visualizes relative performance changes, and as can be seen, across different methods, performance changes notably by solely adjusting these factors. This can, both intentionally or unintentionally, be used to improve the performance of methods that would otherwise fall behind under fair comparison.

In addition, we accounted for an often unadressed factor, namely the construction of the mini-batches: As most DML methods utilize a ranking surrogate over tuples sampled from the mini-batch, it seems also sensible to look how the distribution of samples within such a batch can influence performance:

To put things short, there is a notably benefit to improving diversity of classes (and thus negative samples) instead of diversity of samples from single classes; and which is something that is most often left unreported in literature.

Quantifying these factors thus allows to provide a consistent training pipeline that ensures a fair comparison in the next section.

### 2. Examining how published methods perform under a fair comparison

To examine how much DML models actually differ in performance, we trained both recent state-of-the-art as well as established baseline methods under a comparable training pipeline.

For reproduction and further study, we made these runs public, which contain information about the performance measured across all relevant DML evaluation metrics as well as values and progression for embedding space metrics of interest, as also used in the subsequent sections):

## Section 8

As we care primarily about the mean peak performance, results are also summarized in a table for all methods and benchmarks:

As can be seen, the majority of improvements are captured within the standard deviations, which stands in contrast to most methods claiming state-of-the-art performance, often by a not insignificant margin!

### 3. Understanding how embedding spaces should look like for maximum generalization performance

However, while there exists a notable saturation in performance, one can still see differences in the zero-shot generalization performance of methods. The question we asked ourselves thus is why some methods perform better than others - how do they structure and learn their representation space to encourage better generalization performance?

To understand this, we look into two common metrics used to explain changes in performance: Intra-class and Inter-class distances.

Intra-class distance $\pi_\text{intra}$ measures how tight samples from the same class are clustered in the embedding space, while inter-class distance $\pi_\text{inter}$ measures how far apart different class clusters are projected. Both metrics were (are) commonly used to justify changes in performance.

But as we can see by studying the correlation between test metrics and these metrics on the training data (using our W&B runs introduced in Section 2, a script to download, evaluate and visualize these runs is made available in the official Github repository), there is no consistent link between changes in these metrics and the downstream zero-shot test performance.

Thus, we introduce two novel metrics that may provide a better understanding: Embedding space density $\pi_\text{ratio}$, which is just the ratio of intra-class and inter-class distance and measures how uniformly and dense training data is embedded in the available embedding space, and the spectral decay $\rho(\Phi)$, essentially computing the singular value distribution of the learned embedding space via Singular Value Decomposition (SVD) and comparing it to an uniform distribution, with the goal of measuring how many different features are used to solve the training objective.

In both cases, we find a much stronger link to generalization performance, which coincides with subsequent work by Wang et al., highlighting similar benefits of embedding space uniformity for generalization in Self-supervised representation learning. Especially the spectral decay appears to be the most reliably predictor for generalization performance to novel classes. Intuitively, one may attribute this to the fact that a higher feature diversity can increase the chance to have descriptors for the novel data and classes at test time.

### 4. Improve generalization of you DML method with this one simple trick!

Finally, we try to incorporate the insights gained to improve the performance of DML methods by implicitly aiming to improve the spectral decay $\rho(\Phi)$ and density $\pi_\text{ratio}$.

To do so, we introduce a regularization term to any ranking-based DML objective dubbed $rho$-regularization, which is very easy to implement: Simply swap a negative sample $x_{n_i}$ in your training tuple $(x_a, x_p, x_{n_1}, ..., x_{n_N})$ (with anchor sample $x_a$ and positive sample $x_p$ from the same class) with any positive sample to both avoid classes to be embedded to tightly together as well as introducing intraclass features that, while not necessarily important to solve the training task, may come in handy at test time: Understanding how to separate samples within a class, such as a specific car make, may introduce features that will help distinguish between different cars at test time.

We find that such $rho$-regularization indeed introduces features that may be beneficial at test time, while not necessarily being important to solve the training task.

Let's first take a look at a toy example:

Here, training data consists of point groups (marked by colour) along a diagonal, and the goal of the ranking objective is to train a small network to provide a unit sphere projection in which these are still separated (second picture).
While this can be easily achieved, newly introduced test points aligned both vertically and horizontally the small network has a hard time to separate - even though the training data provides context on both vertical and horizontal alignment!

This is where $\rho$-regularization comes in handy, as learning to separate samples within a class more explicitly introduces the idea of "vertical" and "horizontal" separation (second-to-last picture). Think about it - for the network to solve this specific toy task, it only needs to learn either to separate training samples horizontally or vertically. And networks love shortcuts! However, it's more than reasonable that at test time more horizontally or vertically classes occur. And, as expected, we see an increase in spectral decay $\rho(\Phi)$ ("flatter" spectral value distribution).

Finally, when applied to actual benchmarks, we see a large performance improvement, especially on datasets with more samples per class such as CUB200-2011 and CARS196, that noticeably outperforms all state-of-the-art methods under fair comparison.

### 5. Conclusion

To better understand the state of Deep Metric Learning, we perform a large-scale study of existing methods under fair comparison, simultaneously evaluating a large set of performance and embedding space metrics. We find that most DML methods perform very similar, even though each claims state-of-the-art performance in their respective publication. In addition, we are able to attribute the differences that do exist to changes in two newly introduces embedding space metrics.

Such a study involved thousands of training runs, each logging up to 40 different metrics - without W&B to aggregate runs from different servers and providing an easy way to summarize and share these, this work would have required significantly more time, especially to this extent.