# Unsupervised Visual Representation Learning with SwAV

In this report, we explore the SwAV framework, as presented in the paper "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments" by Caron et al. SwAV is currently the SoTA in self-supervised learning for visual recognition. We also address common problems in existing self-supervised methods.
Ayush Thakur

Unsupervised visual representation learning is progressing at an exceptionally fast pace. Most of the modern training frameworks (SimCLR[1], BYOL[2], MoCo (V2)[3]) in this area make use of a self-supervised model pre-trained with some contrastive learning objective. Saying these frameworks perform great w.r.t supervised model pre-training would be an understatement, as evident from the figure below -

-> Figure 1: Top-1 accuracy of linear classifiers trained with the frozen features of different self-supervised methods w.r.t the fully supervised methods (Source: SwAV [4]). <-

Moreover, when the features learned using these different self-supervised methods are fine-tuned with as little as 1% and 10% of labeled training data show tremendous performance -

-> Figure 2: Performance of different semi-supervised and self-supervised frameworks on fine-tuning with very little labeled data (Source: SwAV [4]). <-

From the above two figures, it is clear that SwAV is currently prevailing in the results (SwAV was published in July 2020), and is currently the SoTA in self-supervised learning for visual recognition. This report will discuss the novel parts that make SwAV such a powerful self-supervised method along with short code walkthroughs.

We expect that you are already familiar with how self-supervised learning works at a high level. If not, this blog post by Jeremy Howard can help you get started.

## Existing Problems with Self-Supervised methods

Methods for self-supervised learning generally formulate some kind of supervised signal from an unlabeled dataset. For example, they are used to predict rotation angles, the next word from a sequence of words, masked words in sequence words, etc. These tasks are typically referred to as pretext tasks. Jing et al. [5] provide a comprehensive overview of such pretext tasks for visual representation learning. Existing works show that representations learned using these pretext tasks can transfer quite well (often beating fully supervised approaches) to downstream tasks such as image classification, semantic segmentation, and object detection. We get a fair sense of this from the figures shown above.

There are two distinct families of self-supervised learning frameworks for visual representation learning -

• Clustering-based approaches that first extract features from a large pool of images using a feature extractor (typically a ResNet50) and then apply some form of clustering on top of those features to group semantically similar image features [6] [7]. The underlying model is then trained on these cluster assignments (treated as pseudo-labels) in a similar manner as supervised training.
• Noise contrastive estimation based approaches where models learn to maximize the agreement between semantically similar images while minimizing the same for different images. These approaches rely on two primary elements - a. contrastive loss, b. strong image transformations [1] [2] [3] [8]. Please note that this form of learning is also referred to as contrastive learning. You can refer to this report if you want to get a more detailed overview of contrastive learning.

Both of these approaches suffer from several problems:

• Most of the clustering-based approaches are generally offline, i.e., they require a forward pass on the entire dataset to calculate the cluster assignments. For large datasets, this quickly becomes computationally expensive.

• Methods based on noise contrastive estimation generally operate by comparing different pairs of images and then calculating a contrastive loss. These pairs are generally formed using strong data augmentation techniques like random resized crops, color distortions, and horizontal flips. Unlike the clustering-based approaches, these methods base the loss calculation on feature-wise comparisons, which becomes computationally expensive for large datasets.

Besides, in order to have enough negative samples (to make the learning model aware of the semantically different images), these methods maintain a large memory bank, a large queue of previously computed features over a sizeable overall batch size. This also introduces computational expense. Methods like BYOL[2] and MoCo [3] use Momentum Encoder to calculate a separate set of features from the queued features to enhance the contrastive learning objective.

To give you a fair idea of how computationally expensive these methods can be, here are some quotes from the SimCLR[1] and MoCo[3] papers respectively -

"With 128 TPU v3 cores, it takes ∼1.5 hours to train our ResNet-50 with a batch size of 4096 for 100 epochs."

"In contrast to SimCLR’s large 4k∼8k batches, which require TPU support, our “MoCo v2” baselines can run on a typical 8-GPU machine and achieve better results than SimCLR."

Motivated by these problems, SwAV presents a simplified training pipeline for self-supervised visual representation learning. Furthermore, it introduces a multi-crop augmentation policy that helps us produce an increased number of views from the training images. This dramatically improves performance, as we would see in a moment.

## High-Level Overview of SwAV

In the coming sections, we will unravel the finer details of this training paradigm. The authors of Unsupervised Feature Learning via Non-Parametric Instance Discrimination [9] investigated a question:

Can we learn a meaningful metric that reflects apparent similarity among instances via pure discriminative learning?

To answer this, they devised a novel unsupervised feature learning algorithm called instance-level discrimination. Here each image and its transformations/views are treated as two separate instances. Each image instance is treated as a separate class. The aim is to learn an embedding, mapping $x$ (image) to $v$ (feature) such that semantically similar instances(images) are closer in the embedding space. The current state of the art self-supervised learning algorithms follows this instance-level discrimination as a pretext task.

Successful implementation of instance discrimination depends on:

• Contrastive loss - conventionally, this loss compares pairs of image representations to push away representations from different images while bringing similar representations closer.
• Image augmentation - techniques to apply one or more affine transforms to an image to create different views of the same image. These augmentations typically include random resized crops, horizontal flips, color distortions, and Gaussian blurs.

SwAV provides improvements to both the aforementioned components - an improvement to the objective function (contrastive loss) and the augmentation policy.

SwAV uses a clustering-based approach and introduces an online cluster assignment approach as an improvement to allow this algorithm to scale well, unlike its predecessors. Cluster assignment is achieved by assigning the features to a prototype vector and passing it through Sinkhorn Knopp (more on Sinkhorn Knopp later). This prototype vector is nothing more than the weights of a Dense/Linear. Thus prototype is learned by backpropagating the loss. To compare cluster assignments to contrast different image views, SwAV proposes a simple “swapped” prediction problem where the code of the view of an image is predicted from the representation of another view of the same image. The rationale here is if two views are semantically similar, their codes should also be similar. The features are learned by Swapping Assignments between multiple Views of the same image (SwAV).

SwAV also proposes a unique data augmentation policy referred to as multi-crop. Most contrastive methods compare one pair of transformations per image (each transformation is a view), even though in this paper, self-supervised learning of pretext-invariant representations [10] suggests that the more the views for comparison, the better the resulting model. However, with multiple views (multiple data augmentation) per image, both memory, and computational requirement increase. SwAV proposes a multi-crop augmentation policy where-in the same image is randomly cropped to get a pair of high resolution (ex: 224x224) images and cropped to get additional views of low resolution (ex: 96x96) images. By doing so, we are not only able to get more views for comparison, but also the resulting model becomes scale invariant as well.

-> Figure 3: High-level overview of SwAV (Source: SwAV [4]). <-

Figure 3 gives a high-level view of SwAV. In SwAV,

• First, multiple views of a batch of images are generated using multi-crop, and other augmentation operations applied sequentially like color distortion, random flipping, and random grayscaling (more in the next section).
• The views are passed through a CNN (ResNet50) backbone to first get the embedding (output from last Global Average Pooling layer) vector.
• This embedding vector then goes to a shallow non-linear network. The output of this network is the projection vector denoted by $Z$.
• The projection vector is fed to a single linear layer; i.e., this layer does not contain any non-linearity. This layer is the prototype layer and is denoted by $C$. This layer maps $Z$ to $K$ trainable prototype vectors. The output of the layer is the dot product between Z and the prototypes. The associated “weights” matrix (the one updated during backpropagation) of this layer can be considered a learnable prototype bank.
• Finally, a “part” of this linear layer's output is used for cluster assignment using the Sinkhorn Knopp algorithm, and a swapped prediction problem is setup. The output of Sinkhorn Knopp is denoted by $Q$.

We will get into the finer details soon and will dissect it using code.

## Multi-Crop Augmentation Policy

This is a simple yet very effective augmentation policy. Simple as it produces multiple views of the same image instead of just a pair of views without quadratically increasing the memory and computational requirement. This is achieved by the proposed multi-crop strategy, where two standard or high resolution (ex: 224x224) cropped images are generated, as shown in Figure 4. This can be seen as a global view of the image. The strategy also samples V additional low resolution (ex: 96x96) image along with these two views. The use of low-resolution views ensures only a small increase in computing costs.

-> Figure 4: Multi-crop augmentation policy as introduced in SwAV (Source: SwAV [4]). <-

This brings us to the effectiveness of this strategy. The authors observed that mapping small scenes with global views of the image can significantly boost performance, as shown in Figure 5.

-> Figure 5: Improvement gains with multi-crop on several self-supervised frameworks (Source: SwAV [4]). <-

However, the multi-crop strategy is just one essential part of this novel augmentation strategy. The augmentation policy also packs itself with techniques like color distortion, random flipping, grayscaling, and Gaussian blurring (similar to the augmentation operations used in SimCLR [1]).

Check out our Multi-Crop augmentation code here $\rightarrow$

We tried to implement the original PyTorch implementation of Multi-Crop augmentation using TensorFlow's tf.data APIs. We believe to have implemented a one-to-one replica of the original implementation.

The next few paragraphs will be a rundown of the key pieces from our implementation.

The first few code blocks are meant to build one view of the image. We will then tie them up together. The custom_augment function applies a series of sequential affine transformations(augmentations) to an image(view). We use the random_apply function to assign the transformation a probability value. Loosely speaking, it is the strength of that transformation applied to a batch of view. We are applying random flip, Gaussian blur (code adapted from here), color jitters, and finally, color drop. One can experiment with the order of the sequence of these transformations.

def custom_augment(image):
# Random flips
image = random_apply(tf.image.flip_left_right, image, p=0.5)
# Randomly apply gausian blur
image = random_apply(gaussian_blur, image, p=0.5)
# Randomly apply transformation (color distortions) with probability p.
image = random_apply(color_jitter, image, p=0.8)
# Randomly apply grayscale
image = random_apply(color_drop, image, p=0.2)
return image


Up next, we have the star of the show. Our random_resize_crop function tries to mimic torch.RandomResizedCrop which was originally used. Learn more about this PyTorch API here.

Notice that we are first conditionally resizing the input image into either 260x260 dimensional or 160x160 dimensional depending on the required crop_size. Note the crop_size is not the size we crop from the input image, but it is the size of the resized output image after cropping. We crop by randomly sampling size from a uniform distribution with the minimum and maximum value set to min_scale and max_scale times the image_shape depending on the crop_size.

We finally get a crop from the image, and we resize it to crop_size.

def random_resize_crop(image, min_scale, max_scale, crop_size):
# Conditional resizing
if crop_size == 224:
image_shape = 260
image = tf.image.resize(image, (image_shape, image_shape))
else:
image_shape = 160
image = tf.image.resize(image, (image_shape, image_shape))
# Get the crop size for given min and max scale
size = tf.random.uniform(shape=(1,), minval=min_scale*image_shape,
maxval=max_scale*image_shape, dtype=tf.float32)
size = tf.cast(size, tf.int32)[0]
# Get the crop from the image
crop = tf.image.random_crop(image, (size,size,3))
crop_resize = tf.image.resize(crop, (crop_size, crop_size))


Finally, we will tie together our augmentation policy to generate one view from the image. We are first scaling our image pixels in the range of [0, 1] followed by random_resize_crop. Once we have a high resolution or low-resolution view of the input image, we apply a set of sequential transformations using custom_augment.

def tie_together(image, min_scale, max_scale, crop_size):
# Retrieve the image features
image = image['image']
# Scale the pixel values
image = scale_image(image)
# Random resized crops
image = random_resize_crop(image, min_scale,
max_scale, crop_size)
# Color distortions & Gaussian blur
image = custom_augment(image)
return image


The next few code blocks describe the actual logic behind building multiple views and making it available on the fly while training our SwAV architecture.

The get_multires_dataset function takes our usual tf.data.Dataset as input and spits out a list of tf.data.Datasets with varying sizes. The default value of num_crops argument is [2, 3] in our implementation, which denotes two high-resolution views and 3 low-resolution views. The value of num_crops ensure the order of the view. Thus with the default value, the first two views will be 224x224 while the rest of the views would be 96x96.

We are using .map() method to map the functionality of tie_together, which holds the augmentation logic, to the tf.data.Dataset.

def get_multires_dataset(dataset,
size_crops,
num_crops,
min_scale,
max_scale,
options=None):
for i, num_crop in enumerate(num_crops):
for _ in range(num_crop):
dataset
.shuffle(1024)
.map(lambda x: tie_together(x, min_scale[i],
max_scale[i], size_crops[i]), num_parallel_calls=AUTO)
)
if options!=None:


We decided to go with 2 high-res crops (224x224) and 3 low-res (96x96) crops totaling to five views. To get these 5 views on the fly while training, we use tf.data.Dataset.zip.

Check out with this Colab notebook to play with our implementation

# Zipping

.batch(BS)
.prefetch(AUTO)
)


Let's end this section by looking at the image shape of different views and some examples from a view.

im1, im2, im3, im4, im5 = next(iter(trainloaders_zipped))
print(im1.shape, im2.shape, im3.shape, im4.shape, im5.shape)


This would print (32, 224, 224, 3) (32, 224, 224, 3) (32, 96, 96, 3) (32, 96, 96, 3) (32, 96, 96, 3). The first dimension denotes the batch size. Note that every time next() is called the order of views remains the same. To play around with this, we suggest checking out the Colab Notebook linked above.

-> Figure 6: Examples from the first view(224x224 resolution). <-

## Cluster Assignments and Contrasting Them

Existing works either use K-Means for the clustering part [6] or use the Sinkhorn Knopp algorithm on the entire feature-set [7]. None of these approaches scale well. SwAV authors treat cluster assignment as an optimal transport problem and solve it by using the Sinkhorn Knopp algorithm. The difference here is that cluster assignments are done in an online fashion i.e., and the cluster assignments are computed only on the current batch. A full-blown discussion on optimal transport and the Sinkhorn Knopp algorithm is out of scope for this report. However, we will provide you with a high-level idea of the approach adopted in the paper.

We can set up the clustering problem as an optimal transport problem in the following manner -

"Given $N$ image features, the task is to generate a matrix $Q$ that allocates these features into K clusters."

In order to adapt it to an online variant, here is what the authors are doing -

• After getting the image features ($Z$), we map the features to a few prototypes ($C$). We do this by passing the features through a linear layer where the number of neurons is equal to the number of prototypes. We can treat these prototypes as clusters, as shown in Figure 7.
• To maximize the similarity between the image features and prototypes, we optimize $C$ with the Sinkhorn Knopp algorithm enforcing an equipartition constraint. This constraint ensures that the different image features do not get mapped to the same prototype. The optimized $C$ vector we get out of Sinkhorn Knopp is denoted by $Q$ throughout the original paper. We adapted the pseudo-code provided by the authors in Appendix A.1 of the SwAV paper. You can find the implementation here.

-> Figure 7: Schematic representation of generating the initial prototypes from image features. <-

• To keep this process online, we keep the optimized $Q$ to its continuous form, unlike [10].

Also, remember that these prototypes are trainable. As we operate on mini-batches, it is essential to have a good enough $C$ and then update it with backpropagation. This helps SwAV to function well in this kind of online setting.

You can check out our model implementation here $\rightarrow$

We refer the readers that are interested to know more about about the Sinkhorn Knopp algorithm to the following resources -

## Swapped Prediction Problem

This is the heart of SwAV. It beautifully ties the three main components of SwAV - Swapping, Assignments, and Views. The “swapped” prediction problem is set up with the help of the following loss function -

$L\left(\mathbf{z}{t}, \mathbf{z}{s}\right)=\ell\left(\mathbf{z}{t}, \mathbf{q}{s}\right)+\ell\left(\mathbf{z}{s}, \mathbf{q}{t}\right)$

$l(z,q)$ denotes the cross-entropy loss taken between the code ($q$) and softmax of the dot product between the image features ($z$) and all the prototypes ($C$).

One can consider $q$ as the ground-truth label for a particular image feature and softmax of the dot product between the image features ($z$) and all the prototypes ($C$) as the predicted output. The goal is to minimize the cross-entropy loss between the two. The premise here is that if two different views of the same image contain similar information, then it should be possible to predict its code from one or the other feature. Notice the subscripts in the loss function. Figure 8 presents this idea pictorially.

-> Figure 8: Setting up the swapped prediction problem between two separate views of the same image. <-

## Single Forward Pass in SwAV

Now let us perform a single forward pass in SwAV to grasp all the theories we have discussed so far. We are going to discuss the train_step function that we have implemented.

[You can check out the implementation in this Colab notebook $\rightarrow$] (https://colab.research.google.com/github/ayulockin/SwAV-TF/blob/master/Train_SwAV_10_epochs.ipynb)

#### Multi-Crop

We start with Figure 9, where multiple views of the same image are generated using multi-crop augmentation. In our implementation, we are generating two 224x224 resolution views and three 96x96 resolution views per batch.

-> Figure 9: Get multiple views of a batch of images for a forward pass. <-

We thus start by unpacking the views. Note im1, im2, im3, im4, and im5 can have any order of resolution. We thus create a list crop_sizes which holds this order. We then compute the unique consecutive counts for each resolution. For example, if the current batch's order is [96, 96, 224, 96, 224] then the unique_consecutive_count will be [2, 1, 1, 1]. Finally, idx_crops would be the cumulative sum of unique_consecutive_count, and for our example, it would be [2, 3, 4, 5]. This idx_crop is going to be important, as you will soon see.

def train_step(input_views, feature_backbone, projection_prototype,
optimizer, crops_for_assign, temperature):
# ============ retrieve input data ... ============
im1, im2, im3, im4, im5 = input_views
inputs = [im1, im2, im3, im4, im5]
batch_size = inputs[0].shape[0]
# ============ create crop entries with same shape ... ============
crop_sizes = [inp.shape[1] for inp in inputs]
unique_consecutive_count = [len([elem for elem in g]) for _, g in groupby(crop_sizes)]
idx_crops = tf.cumsum(unique_consecutive_count)


#### Passing the Concatenated Views Through the Networks

We now have the data and idx_crop and can forward pass the data through our models. Figure 10 describes the forward pass thought the backbone ResNet50 model. In the SwAV overview section, we have given a high-level introduction to this. Now let us look this in action through code.

-> Figure 10: Forward pass through backbone ResNet50 and projection_prototype model. <-

The code block below is the continuation of our train_step function. Notice that concat_input is a slice of inputs depending on the idx_crop. We start with start_idx as zero for the first iteration, and the end_idx is 2 (as per our example). Thus, we are concatenating the same resolution views and forward passing it through the network at once. The concat_input first goes through feature_backbone model (ResNet50). Here _embedding is the output of the Global Average Pooling layer of this model. We concatenate these embeddings in the same order of view resolutions. The shape of embeddings will be [batch_size*5, 2048](multiplied by 5 since we have 5 views). Finally, all the embeddings are passed through our projection_prototype model. Following our implementation, we get a projection vector of shape [batch_size*5, 128] and prototype vector of shape [batch_size*5, 15].

In the code listing below, there are multiple tf.stop_gradient calls. These have been placed carefully in order to exclude the computations from dependency tracing.

    # ============ multi-res forward passes ... ============
start_idx = 0
for end_idx in idx_crops:
_embedding = feature_backbone(concat_input) # get embedding of same dim views together
if start_idx == 0:
embeddings = _embedding
else:
embeddings = tf.concat((embeddings, _embedding), axis=0) # concat all the embeddings from all the views
start_idx = end_idx

projection, prototype = projection_prototype(embeddings)


#### Sinkhorn Knopp and Swapped Prediction

We have finally arrived at the part where we can do an online cluster assignment and set up the swapped prediction problem. Figure 11 shows the idea of swapped prediction, where the code of a view of an image is predicted from the representation of another view of the same image. Let us understand this better using the last section of our train_step. This is where we compute the contrastive loss and perform back-propagation.

-> Figure 11: Cluster assignment followed by swapped prediction. <-

Our initial hypothesis for the order of views was that it was randomized. The official implementation made sense that way however this line from the paper confused us,

Note that we compute codes using only the full resolution crops.

With randomized order, there was no piece of code that ensured code computation from the full-resolution crops. We sought clarity from the authors by raising this GitHub issue which you might find insightful. Our hypothesis was wrong and the order of views was constant with the first two views being high resolution while the rest being low resolution.

The default value for crops_for_assign is [0, 1]. This ensured the use of high resolution view for code computation. We use this to take a slice from the prototype vector. For the first iteration, the out will be of shape [batch_size, 15], and this belongs to the first view in crop_sizes. We apply the Sinkhorn Knopp algorithm to get the soft codes from the initial prototypes. SwAV is trained to predict these codes from different views of the images, as discussed earlier.

        # ============ swav loss ... ============
loss = 0
for i, crop_id in enumerate(crops_for_assign):
with tape.stop_recording():
out = prototype[batch_size * crop_id: batch_size * (crop_id + 1)]
# get assignments
q = sinkhorn(out)

# cluster assignment prediction
subloss = 0
for v in np.delete(np.arange(np.sum(NUM_CROPS)), crop_id):
p = tf.nn.softmax(prototype[batch_size * v: batch_size * (v + 1)] / temperature)
subloss -= tf.math.reduce_mean(tf.math.reduce_sum(q * tf.math.log(p), axis=1))
loss += subloss / tf.cast((tf.reduce_sum(NUM_CROPS) - 1), tf.float32)

loss /= len(crops_for_assign)


We see that the cluster assignments done by the Sinkhorn Knopp algorithm (denoted by the sinkhorn() function have been put under a tape.stop_recording() context. It is there to ensure the computations for cluster assignments do not get traced for gradient updates.

We based our training loop in reference to the original training loop of SwAV.

You can check out our implementation in this Colab notebook $\rightarrow$

This brings us to the most exciting part of our report - experimental results.

## Experimental Results

Before we proceed toward discussing the results we got from our implementation, let us review our experimental setup.

First and foremost, we implemented SwAV in a minimal capacity. Our aim here is to walk the readers through the primary workflow of SwAV and not focus on the secondary bits. Below we list out the significant differences in our implementation -

• The authors maintain a queue when using small batches when the prototypes' number gets bigger than the batch size. We did not use a queue for our minimal implementation.
• We used two crops of 224x224 resolution and 3 crops of 96x96 resolution. This is slightly different from the proposed settings of multi-crop.
• We used 15 prototypes. In the original paper, the authors use 3000 prototypes for the ImageNet dataset. ImageNet originally has 1000 classes, and the Flowers dataset has 5. So, we linearly scaled down our choice for the number of prototypes (5 * (3000/1000)). In practice, choosing a large enough (relative to the dataset) prototype usually works just fine. The authors also show that the number of prototypes has very less impact on the performance of SwAV.
• We used SGD along with a cosine decay schedule with a base learning rate of 0.1. The authors used a combination of warmup and cosine decay for the learning rate schedule.

In order to make our implementation quick, we only tuned a handful of hyperparameters. For a rigorous comparison of the differences between our implementation and the original one, we redirect the reader to this notebook, which shows SwAV training in an end-to-end manner.

It is recommended that you use a Kaggle Kernel to run it. On Colab, you might run into runtime timeout errors. We used a combination of Kaggle Kernel and a V100-powered AI Platform Notebook on GCP.

We used the Flowers dataset in order to demonstrate SwAV. The dataset is available as a TensorFlow Dataset here.

## Section 9

### Linear Evaluation

In linear evaluation, we keep the feature backbone (ResNet50 in our case) trained using a given framework to be frozen and learn a linear classifier on top of it. We can implement this in the following way -

def get_linear_classifier():
# input placeholder
inputs = Input(shape=(224, 224, 3))
# get swav baseline model architecture
feature_backbone = architecture.get_resnet_backbone()
feature_backbone.trainable = False

x = feature_backbone(inputs, training=False)
outputs = Dense(5, activation="softmax")(x)
linear_model = Model(inputs, outputs)

return linear_model


So, the workflow so far has been -

• Train a feature backbone using SwAV with unlabeled data.
• Gather some annotations for the same data and run linear evaluation.

In literature, we generally use the entire training dataset during the linear evaluation.

## Section 11

### Linear evaluation with the supervised counterpart

We make very little changes to our get_linear_classifier() method for this part -

def get_linear_classifier(trainable=False):
inputs = layers.Input(shape=(224, 224, 3))
EXTRACTOR = tf.keras.applications.ResNet50(weights="imagenet", include_top=False,
input_shape=(224, 224, 3))
EXTRACTOR.trainable = trainable
x = EXTRACTOR(inputs, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(5, activation="softmax")(x)
classifier = models.Model(inputs=inputs, outputs=x)

return classifier


The major difference is we are now using ImageNet weights as the initialization of our network.

## Section 13

### ResNet50 Trained From Scratch

In this experiment, we train a ResNet50 from scratch to study how effective model pre-training can be.

## Section 15

### Fine-Tuning on 10% Labeled Data

For this experiment, we follow a two-stage training process as described in this guide. We first do a round of training with the weights of the feature backbone frozen. This is typically done with a higher learning rate. We then train the entire network with a lower learning rate in order for the pre-trained features to adapt to the downstream task.

Check out our fine-tuning Colab Notebook $\rightarrow$

## Section 17

In this case, we can see the clear benefits of the SwAV pre-training. This suggests that the supervised ImageNet-based pre-training may not transfer well to a downstream task when there is very little labeled data.

The Colab Notebook for this experiment is available here $\rightarrow$

## Section 18

This concludes our results section. Be sure to check out the Colab Notebooks to try them out yourself if you get stuck in making sense of these results.

## Conclusion

Thank you for sticking to the end. Self-supervised visual representation learning has started gaining quite a lot of attention from the research community. We sincerely hope that we were to provide you with some concrete evidence of that. We saw how the self-supervised learning framework provides us with a systematic way to model our unlabeled data and transfer the learned representations to downstream tasks.

So far, the de-facto way of transfer learning in computer vision has been like the following -

• Gather a vast annotated dataset.
• Formulate a supervised learning problem and train a well-performing model with the dataset.
• Transfer the learned features to downstream tasks.

Self-supervised learning eliminates the supervised part from this setup -

• Gather a vast unlabeled dataset.
• Formulate a supervised learning problem from the unlabeled data and train a well-performing model with the dataset.
• Transfer the learned features to downstream tasks.

The authors demonstrate the results from downstream tasks where SwAV pre-trained (with the ImageNet dataset) features are used, and the results are excellent. In some cases, the supervised counterparts are even worse.

If you found the report to be useful, we would love to hear from you. Also, please feel to let us know if you have any improvement pointers to share with us.

## Acknowledgements

Thanks to Mathilde Caron for providing insightful pointers that helped us minimally implement SwAV. We have already pointed out the differences between our implementations and thus we point the readers to their official implementation. This report is written with the motivation to provide easy access to this magical work.

Check out the official PyTorch implementation here $\rightarrow$

Thanks to Jiri Simsa of Google for providing us with tips that helped us improve our data input pipeline.

Thanks to the Google Developers Experts program for providing us with GCP credits.

## References

1. Chen, Ting, Simon Kornblith, Kevin Swersky, et al. “Big Self-Supervised Models Are Strong Semi-Supervised Learners.” ArXiv:2006.10029 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/2006.10029.
2. Grill, Jean-Bastien, et al. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” ArXiv:2006.07733 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/2006.07733.
3. Chen, Xinlei, et al. “Improved Baselines with Momentum Contrastive Learning.” ArXiv:2003.04297 [Cs], Mar. 2020. arXiv.org, http://arxiv.org/abs/2003.04297.
4. Caron, Mathilde, et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments.” ArXiv:2006.09882 [Cs], July 2020. arXiv.org, http://arxiv.org/abs/2006.09882.
5. Jing, Longlong, and Yingli Tian. “Self-Supervised Visual Feature Learning with Deep Neural Networks: A Survey.” ArXiv:1902.06162 [Cs], Feb. 2019. arXiv.org, http://arxiv.org/abs/1902.06162.
6. Caron, Mathilde, et al. “Deep Clustering for Unsupervised Learning of Visual Features.” ArXiv:1807.05520 [Cs], Mar. 2019. arXiv.org, http://arxiv.org/abs/1807.05520.
7. Asano, Yuki Markus, et al. “Self-Labelling via Simultaneous Clustering and Representation Learning.” ArXiv:1911.05371 [Cs], Feb. 2020. arXiv.org, http://arxiv.org/abs/1911.05371.
8. Xie, Qizhe, et al. “Unsupervised Data Augmentation for Consistency Training.” ArXiv:1904.12848 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/1904.12848.
9. Zhirong Wu, et al. “Unsupervised Feature Learning via Non-Parametric Instance Discrimination.” ArXiv:1805.01978 [cs.CV], May 2018. arXiv.org, https://arxiv.org/abs/1805.01978.
10. Ishan Misra, et al. “Self-Supervised Learning of Pretext-Invariant Representations.” ArXiv:1912.01991 [cs.CV], Dec. 2019. arXiv.org, https://arxiv.org/abs/1912.01991