Rising to the Reproducibility Challenge with Weights & Biases

How I used Weights & Biases for rigorous reproduction and benchmarking of ML research
Created on January 26|Last edited on January 27
Comment
﻿
Why Reproducibility?Research is to see what everybody else has seen, and to think what nobody else has thought. - Albert Szent-Györgyi, 1937 Nobel Prize Winner in Physiology
While Albert Szent-Györgyi's quote captures the contribution of novelty to the meaning and purpose of research,
it misses one crucial aspect: reproducibility.
If the things we've thought cannot be later thought by others, then we've failed to push research forward.
As the field of deep learning matures, the relative importance of novelty will fade, and work will need to prove itself on grounds of transparency and reproducibility. The ML Reproducibility Challenge is an important step in this direction. The challenge rewards participants for doing the hard and unglamorous work of reproducing a paper published at a top AI conference and validating its central claim.
In this post, I will describe the work I did to reproduce a paper from the CVPR 2020 conference: ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks by Wang et. al.
Most critically, I focused on establishing fair benchmarking: comparing performance against other related methods in the same domain as rigorously as possible. In the uncharted waters of deep learning, verifying the stability, consistency, versatility, and efficiency of a proposed method can only be done empirically. This is often viewed as incremental or as drudge work, but it is part of the true nature of science.
Why Weights & Biases?What started as a simple attempt of reproducing a comparatively plain paper turned to be an uphill battle owing to the sheer number of ablation experiments needed to be conducted to verify and validate the paper's results and claims. During the early days of my advent into academic research, for any new ideas that I constructed, I used to maintain a notepad file on my laptop where I jotted down the results of each run of each rounds of experiments I conducted. I lost countless reference runs because of this non-optimal habit. Considering the structure of the submission for the Reproducibility Challenge, it was imperative for me to efficiently manage and track my experiments to ensure every experiment is accounted for. For this, I went ahead with Weights & Biases as my default workspace environment to maintain all information and data about the experiments I am running for the reimplementation. Although many might consider Weights & Biases to be a tool, I believe it is more synonymous to that of a Workspace because of the fine-tuned and expertly crafted environment it provides. 
Succinctly,
Weights & Biases is a catalyst in forging transparency in research and upholding the core values of open science by empowering users with an incredibly efficient and powerful workspace equipped with fine crafted tools to reduce friction in experimentation allowing one to focus more on the idea than on logging.
Following the Reproducibility Challenge official guidelines and my prior experience in research documentation, I started reproducing the paper step by step while adding my own set of ablation experiments to add more weight to the central claim of the paper. At every step, I integrated Weights & Biases into my experimentation flow which helped me organise all my results and track the runs on an interactive web dashboard. Added to this, Weights & Biases allowed me to extensive hyper-parameter optimisation with the Sweeps feature. Combining all of this, I was able to construct a thoroughly detailed and transparent reproducibility attempt. In the following sections, I dissect each of the features I used from the WandB suite and how one can replicate the same. 
﻿
Reproducing a Paper with Weights & BiasesJust Track It!As I had previously published a work in the domain of the paper I had selected for this challenge, I was already aware of the structure I should construct in terms of my experimentation. Before jumping on to large scale experiments, it's often a good idea to do comparative analysis on smaller scale datasets which might reflect the efficiency of the method in its corresponding larger datasets. To view the codebase for reproducing all the results and experiments described below please visit my repository on GitHub. 
To kickstart the experimentation process, I first ran the proposed model in the paper on the small CIFAR-10 dataset for image classification and compared it to other models in the same domain. Each model category had 5 individual runs to provide some statistical insight into the consistency and stability of the performance of the models. This is a very efficient way of comparing two different models and determining whether the difference between the two is significant or resultant of random variation. Often statisticians use P-values or Cohen's D score to compute the significance of the difference between two sample sets. Weights & Biases makes this super easy to do with simple and few lines of code. For example, I used the grouping runs feature of Weights & Biases to visualise the mean and standard deviation of each run. To do the same use the following code snippet:
num_runs = 5 #Number of runs for each group

# Looping through the group
for i in range(0, num_runs):
    wandb.init(project='My Reproducibility Project', group='My group')
    #<Training Loop>
    wandb.log({'Val_Acc':val_acc, 'Val_Loss':loss}) #Log values to WandB Dashboard

wandb.finish.run()
For more information, I highly recommend to go through the documentation for Grouping along with my code which is used to obtain the result below:
﻿
﻿
﻿
Run set26
﻿
As seen in the above graphs you can hover at any point on the individual line plots and investigate what is the mean of the values at that iteration for every run along with the maximum and minimum of all runs at that iteration for that run. This is extremely handy because via this information I was able to understand that the method proposed in the paper I had selected for this challenge called as Efficient Channel Attention although had the highest accuracy at the final epoch, it was less stable and had a higher standard deviation which is why CBAM had a higher mean accuracy at the final epoch with a much lower standard deviation. This serves to be a crucial component in deep learning research where often small margins can be attributed to pure randomness because of the way deep neural networks are initialised. 
﻿
Sweep the Board!Deep learning research in many ways is a recipe made of multitude ingredients which in more scientific jargon is referred to as hyper-parameters. These hyper-parameters can  make the difference in the proposed method providing a performance boost over other standard methods or can lead it to its grave. Often these are selected and fixed based on intuition, prior literature and hypothesis with often a tincture of luck sprinkled on it. Sweeps by Weights & Biases offer a perfect suite of analysing how different hyper-parameters combined with a proposed method works. This is extremely important to understand the robustness of the method and also find out the best set of hyper-parameters to accompany it. 
Since the paper I picked involved a channel attention mechanism used in convolutional neural networks, there are several hyper-parameters that can affect the results in varying magnitude like batch size, learning rate, optimisers, filter size, et cetera. I used sweeps to investigate the effect of varying batch size and optimisers on the loss value when combined with different attention mechanisms. To use it in your own experiment, there are two steps involved:
Create a config .yaml file containing the hyper-parameter settings. Example - 
program: train_cifar.py
method: bayes
metric:
  name: loss
  goal: minimize
parameters:

  att:
    values: ["ECA", "CBAM", "SE", "Triplet"]
  optimizer:
    values: ["adam", "sgd"]
  batch_size:
    values: [64, 128, 256]
Obtain the sweep ID and initialise the sweep agent using the following code where config.yaml is the previously defined config file-
wandb sweep config.yaml
Run the sweep agent using the previously obtained sweep ID-
wandb agent {insert sweep ID}
For more details please refer to the Sweeps documentation or take a look at the jupyter notebook I used in my reproducibility of ECANets to obtain the results below.
Sweeps is that tool that you didn't know you would need at the start but once you employ it in your research pipeline, it can become the most impactful and important weapon in your arsenal. As shown in the plot above, I was able to verify the best combination of hyper-parameters that obtains the lowest loss for my ResNet model. ECA combined with SGD optimiser on a batch size of 128 provided me with the lowest loss value of 0.57. I also observed most of the combinations using ECA over the other attention variants specified obtained comparatively lower loss. This was confirmed by the parameter importance chart where ECA is the parameter with the highest importance and most negative correlation to loss. 
﻿
﻿
﻿
Sweep: z61h01i4 1106
﻿
Media? Media!Often deep learning papers are attributed and identified by an unique graphic which may be that of the schematic diagram of the structure of the method proposed or visualisation of some notable results with the latter being more prominent for methods in the space of generative modelling or object detection and instance segmentation. Since ECANets had a section dedicated for results obtained using different object detection and instance segmentation models like Mask RCNN, Faster RCNN, Retina Net, et cetera., it became imperative to log some of the results on the dashboard to visually validate the efficiency of these new models using ECA. 
﻿
﻿
﻿
Run set162
﻿
As shown above, I logged the segmentation maps and bounding box results of a Mask RCNN with a ImageNet pre-trained ECANet-50 on MS-COCO dataset. An easter egg is present in the top left of the media panel where upon clicking the gear icon you can use the slider to visualise the results for the same model on each of the 12 epochs it was trained for, thus, giving you a complete overview of the model's training evolution and stability. We used the MMDetection framework to train the MS-COCO and you can use the following snippet in your inference pipeline to log results in the same way:
wandb.init(project = 'My Reproducibility Project')
for checkpoints in checkpts:     #checkpts is the directory containing each epoch weights
   log_img = []
   model = init_detector(config_file, '../work_dirs/mask_rcnn_r50_fpn_1x_coco/'+checkpoints, device='cuda:0')
   for i in img_list:       #img_list contains the path to the images we obtained the results for
       img = '../data/coco/test2017/'+i
       result = inference_detector(model, img)
       x = show_result_pyplot(model, img, result)
       log_img.append(wandb.Image(x))
   wandb.log({"Segmentation and BBox Results": log_img})
       
wandb.run.finish()
For the full inference pipeline we used, you can go through our inference notebook. There are many more additional features available for Bounding Box and Segmentation Maps logging in Weights & Biases which you can go through in the documentation.
To Infinity and beyondWeights & Biases is an assembly of tools that make reproducibility more research and more fun. The power of logging literally anything and everything at the expense of a few lines of code is probably one of the most powerful weapon a researcher can wield. There are many more amazing toolset that we didn't sail through in this reproducibility channel but can be extremely useful for many researchers. Some of them include:
Artifacts: Probably my favourite out of the lot, Artifacts allows you to store your model checkpoints and dataset versions. To add cherry on the icing, artifacts allows you to visualise it all in a computational graph in the dashboard. Visit the docs to get started. 
DSViz: Currently in development, DSViz is an amazing tool which breathes a fresh air into data visualisation and Exploratory Data Analysis (EDA). DSViz gives you complete coverage of your dataset, inspect model's evaluation on samples and debug them more efficiently. 
Custom Charts: As the name suggests, Custom Charts is a very powerful tool that allows any user to log even the most complicated charts, plots and graphs. Designed and custom tailored for the power users, get started with logging anything ranging from ROC curves to attention maps by following this expertly crafted report. 
The beauty of deep learning is in exploration, the more you dive into it, the more vibrant it gets. Weights & Biases is similarly in parallel with limitless bounds to making your deep learning/ machine learning research project stand out and allow complete transparency and reproducibility. 
Free LunchAlong with all the tools described above, Weights & Biases has still more to offer. With every run, Weights & Biases automatically logs system/ hardware metrics realtime  as shown in the plots below and allows you to analyse model complexity in terms of GPU usage or memory allocated. This provides unbiased fairness to any research project and makes the idea more concrete and validated. 
Additionally as a final flourish, one can write reports which includes all the graphs and panels from the dashboard in simple markdown. And to add on top, you can export the report to LaTex by clicking on the "Download Report" button at the top and Weights & Biases does all the heavy lifting to provide you with the .zip file containing all the LaTex files with your report markdown converted into the required template for the ML Reproducibility Challenge which you can then further edit in a LaTex editor or submit directly. 
﻿
﻿
﻿
Run set26
﻿
VerdictAs a researcher, I can admit that there is a lot of room for improvement and modification in the space of academia and the field of deep learning and machine learning is yet to fully mature. The ML Reproducibility Challenge serves as an extremely crucial initiative to keep check of the fast paced research being conducted and published at top conferences considering many of these research might not be initially accessible to everyone. As we day by day, learn and understand the potential of deep learning, tools like Weights & Biases serve as an important reminder that research without appropriate tools is a drink without the glass to contain it and thus it spills all over. Weights & Biases is an important catalyst and in my opinion a de-facto no-brainer tool to include in every researcher's stack.
Thank You!
﻿
﻿
Add a comment