How to Build a Robust Medical Model Using Weights & Biases

A tutorial on how to build a cancer detection model all while tracking of your experiments, storing model weights, optimizing hyperparameters, and more in W&B
Aman Arora
Created on February 9|Last edited on February 6
Comment
﻿
IntroductionDataset Model Training How Do We Store Our Model Weights?Which Hyperparameters are Best? What Did Our Models Really Learn? Conclusion 
﻿
Introduction I am going to show you how "Weights and Biases (W&B)" helped me create a solution that ended up ranking 150 out of a total of 3308 teams that participated in the SIIM-ISIC Melanoma Classification Kaggle competition.
Figure-1: Rank 150 of 3308 total teams 
Skin cancer is the most prevalent type of cancer. In the US, melanoma is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates nearly 100,000 new melanoma cases would be diagnosed this year and that more than 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective.
In actual practice, dermatologists evaluate every one of a patient's moles to identify outlier lesions or “ugly ducklings” that are most likely to be melanoma. The problem is that existing AI approaches have not adequately considered this clinical frame of reference.
As part of the Kaggle competition I mentioned above, we were tasked to build a robust medical model that could differentiate malign (positive case of melanoma) from benign (negative case of melanoma) images.
You can find my complete solution code in this GitHub repository﻿﻿. As part of this report, I am going to show how W&B helped me track all my experiments and in the end create a top 5% solution! 
Dataset The first thing I always do when I start out with a project is to have a cursory look at the dataset. I usually log all the values and images to a Weights and Biases table (like the one below) which also makes it really easy to sort, compare, and play around with the dataset. 
﻿
﻿
﻿
Just by following this one simple step, I can ensure that not only am I able to "casually" browse through the data, but also anyone on my team can do so too! All I need to do is share the link to this W&B table with my colleagues/ or medical specialists (who are generally non-technical folks) and they'll be able to play around with the dataset too! 
Just by looking at the table above, one can notice how "malign" images are very ugly compared to "benign" images. This is referred to as the Ugly Duckling sign.
The Ugly Duckling is a warning sign of melanoma. This is a recognition strategy based on the concept that most normal moles on your body resemble one another, while melanomas stand out like ugly ducklings in comparison. 
💡
Model Training After having a look at the dataset, the next step is to start training models. 
Since the dataset is highly skewed – around 98% of the total cases are "benign" whereas only about 2% are "malign" – there are many things we'd like to try to get a high score when it comes to model training:
There are many losses to choose from - is the typical Binary Cross-Entropy ﻿Loss going to work or might we need to try focal loss?
Would a weighted loss work better for our case?
Are models pre-trained with ImageNet going to be helpful or should we train the models from scratch?
How should we pre-process or resize the images?
Would we need to try some kind of specific preprocessing such as "color constancy" that might give a boost to our scores?
How should we use the metadata such as "gender", "age", "sex" in our models? Would it even be beneficial?
What data augmentations should we add for our model training to make them robust?
With hundreds of model architectures to choose from, which ones would work best for Melanoma classification?
Is there another external dataset that we could perhaps pre-train our models on before fine-tuning on melanoma classification?
What learning rate should we use for our model training? 
What kind of learning rate schedule would work best? 
What should be the training batch size in accordance with the learning rate? 
Should we use gradient accumulation? 
Which parameters correlate best with the final validation "AUC" metric?
What image size should we use for model training?
There is no straight answer to any of those questions except really to "try and find out." That in turn calls for a large number of experiments that we'll have to run if we want our final solution to generalize well! So, how do we keep track of such a long list of experiments?
In my early days of the competition, I tried to keep all the logs as part of an excel sheet as below:
Figure 2: Experiment tracking using Microsoft Excel
As you can imagine, this very quickly got out of hand! Also, since everything was manually typed, I made a lot of errors and things got really messy. 
So, I guess the question you're asking is "Is there a better way to track your experiments?" And the answer is YES! 
Figure-3: Experiment tracking using Weights and Biases
We could use W&B to automatically log metrics and get a beautiful-looking dashboard to boot! The best part? It's all automated! Once I started using W&B for experiment tracking, my code was over 10x more structured and I could very quickly note which experiments give the best validation score! 
﻿
﻿
Since at the end of the day I only care about experiments that do well on the validation AUC metric (and because each experiment has its own unique name) I can just tell that the experiment ethereal-blaze-106 is the one that performs the best of all the 66 experiments that I ran as part of this demo.
Thanks to W&B, I can now find everything in one place. 
Figure 4: Comparing experiments 
Just by looking at all the experiments at once, I can tell that the difference between the best performing  ethereal-blaze-106 and other experiments are: 
﻿﻿Use of weighted focal loss
The model was trained for a total of 25 epochs
"EfficientNet-B0" model architecture used for training 
Model pre-trained on "ImageNet"
All images were resized to 224x224 using random crop
The use of color constancy gives a massive boost to the AUC score
A train batch size of 64 is ideal 
1e-4 is the ideal learning rate 
In other words: we're slowly able to find answers to all the different questions we had in the beginning! And more, we can also compare all our experiments from a birds-eye view in a single table to know which experiments perform best and why! 
How Do We Store Our Model Weights?From the hundreds (if not thousands) of models that we trained every epoch, we want to be able to store the models that performed the best on the validation metric so we can eventually create a solution that is an ensemble of the best performing models. 
So how do we do this? One way would be to locally save all the models but then the question is - how do you know which models are the ones that are really performing the best?
Another, far superior way would be to use W&B artifacts. As part of the training process, we could just log the best-performing models to W&B. From our docs:
W&B Artifacts was designed to make it effortless to version your datasets and models, regardless of whether you want to store your files with W&B or whether you already have a bucket you want W&B to track.

Once you've tracked your datasets or model files, W&B will automatically log each and every modification, giving you a complete and auditable history of changes to your files. 
This lets you focus on the fun and important parts of evolving your datasets and training your models, while W&B handles the otherwise tedious process of tracking all the details.So since we know ethereal-blaze-106 was just the best performing experiment, and I had already logged all my models during the training process, I could just go and find a list of saved models in the W&B ecosystem! 
Figure 5: Best model weights as artifacts
You can also find all the model weights stored from all the experiments in a single W&B workspace with proper versioning! Isn't this great? 
Therefore, now I can just download the best performing models from the best performing experiments, and finally, create an ensemble and submit it to the Kaggle competition as my final solution. In fact, this is precisely what I did! 
It is really simple to use a model artifact once it has been stored to the W&B ecosystem with just two lines of code: 
artifact = run.use_artifact('user_name/project_name/new_artifact:v1', type='my_dataset') 
artifact_dir = artifact.download()
And that's it! Now we have downloaded the model locally and can use it for inference. The power of storing model weights as artifacts to W&B is that you could also share it with your colleagues so they can experiment with it too!
Which Hyperparameters are Best? There are multiple hyperparameters to choose from such as: 
train_batch_size: The training batch size 
arch_name: The model architecture to be used for training 
epochs: Number of epochs to train the model for 
learning_rate: The model's learning rate
loss: Which loss function to use for model training 
sz: The image size to be used for training 
Not only do the individual values of these hyperparameters matter, but is there a certain combination of these hyperparameters that might help our model to perform the best on the validation metric? 
Enter W&B Sweeps!
﻿
﻿
With W&B Sweeps, you can select a wide range of values for your hyperparameters to choose from and W&B will run multiple experiments with various combinations of these hyperparameters and report validation metric results as well.
This is really helpful as it helps us find great hyperparameter combinations that give us the best models. Again, from our docs:
It is way better than running  a sweep manually, before W&B, I used to run multiple experiments with various values of learning rates, or other hyperparameters, but now, I can just launch a W&B sweep, come back after a few hours and find the best hyperparameter values!
What Did Our Models Really Learn? Next, I usually like to use the W&B embedding projector to see what did my final model really learned?
﻿
﻿
By logging embeddings to a W&B table, I can easily use any dimensionality reduction technique from - PCA, T-SNE & UMAP to visualize examples on a 2D plane as above. 
I recently also wrote a report on "Interpret any PyTorch Model Using W&B Embedding Projector" if you'd like to learn more about the W&B embedding projector! 
Conclusion As part of this report, I hope that I have been able to showcase how W&B helped in every step of my journey to the top-5% solution in the SIIM-ISIC Melanoma Classification Kaggle competition.
To replicate every table, chart, and figure that has been shown in this report, please follow the steps mentioned in this GitHub Repository - https://github.com/amaarora/melanoma_wandb.
﻿
Add a comment
Tags: Intermediate, Computer Vision, Classification, PyTorch, Experiment, W&B Meta, Github, Artifacts, Panels, Plots, Sweeps, Tables, Kaggle, Health Care, Exemplary
Iterate on AI agents and models faster. Try Weights & Biases today.