How to prepare the dataset for the Melanoma Classification?

As part of this report I showcase how to prepare the dataset for the ISIC Melanoma competition and upload it as a W&B artifact.
Aman Arora
Created on March 1|Last edited on April 7
Comment
The How to Build a Robust Medical Model Using Weights & Biases showcases how I integrated all the various features of the Weights and Biases (W&B) product in the Melanoma competition to get into the top-5% of the Kaggle leaderboard.  
As part of this report, I showcase how to prepare the dataset including all preprocessing code, and upload it as a Weights and Biases artifact.
Downloading the datasetThe first step is to download the dataset. You could easily do it using the Kaggle API if you have a Kaggle key. If you're not sure how to use the Kaggle API or haven't used it before, refer to the docs here.
We can easily download the dataset by then running this single line of code: 
kaggle competitions download -c siim-isic-melanoma-classification
Please note that you should have at least 200 GBs of storage on your workstation to be able to work with this dataset. 
Preprocessing the datasetGenerally, all medical images are in the .dcim format. This is the standard go to format for sharing medical data such as X-rays and CT-Scans. The difference between a simple .jpg and a .dcim file format is that the .dcim format also has patient metadata such as age, patient ID, gender alongside the image. 
You could follow the script which is part of this Kaggle kernel - DICOM to JPG/PNG on steroids to convert Dicoms to standard .jpg format.
As part of this competition, we are already provided with .jpg images in the /jpeg folder. We will be using that. The data structure for this competition looks like this: 
├── jpeg
│   ├── test
│   └── train
├── test
Please note that I already removed the /tfrecords folder as we won't be needing as part of the data preparation. 
We will be applying the function resize_and_mantain below to each of the images. This function resizes the images such that the shorter side of the image is equal to 224px while maintaining the aspect ratio of the image. Keeping the same aspect ratio is critical because we want to make sure that the Melanoma cancer images are not squished or expanded. 
def resize_and_mantain(path, output_path, sz: tuple, args):
    fn = os.path.basename(path)  
    img = Image.open(path)
    size = sz[0]  
    old_size = img.size
    ratio = float(size)/min(old_size)
    new_size = tuple([int(x * ratio) for x in old_size])
    img = img.resize(new_size, resample=Image.BILINEAR)
    if args.cc: 
        img = color_constancy(np.array(img))
        img = Image.fromarray(img)
    img.save(os.path.join(output_path, fn))
Something that you notice that we are doing as part of this function is to apply color_constancy. For an introduction to what color constancy is, please refer to the Wikipedia page here - Color Constancy. I got the idea of applying color constancy to all Melanoma cancer images from the research paper - The effect of color constancy algorithms on semantic segmentation of skin lesions. Preprocessing the images by adding color constancy really gave me the edge in the competition and all experiments that I ran with color constancy preprocessed images had better scores compared to the ones that didn't. 
The code for color constancy looks like below: 
def color_constancy(img, power=6, gamma=None):
    """
    Parameters
    ----------
    img: 3D numpy array
        The original image with format of (h, w, c)
    power: int
        The degree of norm, 6 is used in reference paper
    gamma: float
        The value of gamma correction, 2.2 is used in reference paper
    """
    img_dtype = img.dtype
﻿
    if gamma is not None:
        img = img.astype('uint8')
        look_up_table = numpy.ones((256,1), dtype='uint8') * 0
        for i in range(256): look_up_table[i][0] = 255*pow(i/255, 1/gamma)
        img = cv2.LUT(img, look_up_table)
﻿
    img = img.astype('float32')
    img_power = numpy.power(img, power)
    rgb_vec = numpy.power(numpy.mean(img_power, (0,1)), 1/power)
    rgb_norm = numpy.sqrt(numpy.sum(numpy.power(rgb_vec, 2.0)))
    rgb_vec = rgb_vec/rgb_norm
    rgb_vec = 1/(rgb_vec*numpy.sqrt(3))
    img = numpy.multiply(img, rgb_vec)
    
    return img.astype(img_dtype) 
Just by applying this one simple function to all the images, we can make sure that all training and validation data images look similar - thus the model has a better chance of generalizing. 
Generally when we are working with medical data, it is collected from various sources/hospitals. Since each X-ray machines used at various hospitals have different parameter settings, therefore images from different sources turn out to be looking very different from each other. Thus, the model has a hard time to generalize. 
Applying color constancy makes sure that all images even from different sources look similar to each other thus making it easier for the model to generalize. 
💡
We can now apply the function to each of the images and store them in a new directory by using the simple script below: 
import os
import glob
from tqdm import tqdm
from PIL import Image
from joblib import Parallel, delayed
import argparse
﻿
if __name__ == '__main__': 
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--input_folder", 
        default=None, 
        type=str, 
        required=True, 
        help="Input folder where images exist."
    )
    parser.add_argument(
        "--output_folder", 
        default=None, 
        type=str, 
        required=True, 
        help="Output folder for images."
    )
    parser.add_argument("--mantain_aspect_ratio", action='store_true', default=False, help="Whether to mantain aspect ratio of images.")
    parser.add_argument("--sz", default=256, type=int, help="Whether to pad and resize images.")
    parser.add_argument("--cc", default=False, action='store_true', help="Whether to do color constancy to the images.")
﻿
    args = parser.parse_args()
﻿
    if args.sz: 
        logging.info("Images will be resized to {}".format(args.sz))
        args.sz= int(args.sz)
﻿
    images = glob.glob(os.path.join(args.input_folder, '*.jpg'))
        logging.info("Resizing images to mantain aspect ratio in a way that the shorter side is {}px but images are rectangular.".format(args.sz))
        if not os.path.exists(args.output_folder):
            os.makedirs(args.output_folder)
        logging.info(args)
        Parallel(n_jobs=32)(
            delayed(resize_and_mantain)(i, args.output_folder, (args.sz, args.sz), args) for i in tqdm(images))
All information on how to run the preprocessing script has also been provided in the Melanoma W&B GIT repository.
Once we have preprocessed the images, the new folder structure should look like below: 
├── jpeg
│   ├── test
│   └── train
├── test
└── usr
    └── resized_train_256_cc 
We will be using the resized images under the resized_train_256_cc directory for all our experiments for training the models. 
We are now ready to upload the images as a W&B table/artifact. 
Why do we need W&B tables?From the W&B docs, You can 
Use W&B Tables to log and visualize data and model predictions. 
Interactively explore your data: Compare changes precisely across models, epochs, or individual examples 
Understand higher-level patterns in your data 
Capture and communicate your insights with visual samples
Under this section, we will be creating the training and validation data tables as below:
﻿
﻿
But, before I start and show you how to create such a table, can you think of the possible advantages of having data represented in a way as above using W&B table? 
Well, here are a few that I can think of that really helped me get to the top 5% in the competition: 
Easy browsing through the dataset: By storing the dataset as a W&B table, I can now easily browse through the dataset, sort it by Diagnosis type or even filter images based on patients' sex metadata.
Easy to share with colleagues: As part of the competition, I had two other teammates. By adding the dataset as a W&B table, I could also easily share my findings with my teammates. 
Everything in one place: It is much easier for us humans to interpret data when everything is in one place. Since as part of the W&B table, I can have images and metadata all in one place - I can now more easily recognize patterns in the dataset that will help further with creating more robust models.
How to store data as W&B tableUnder this section, I am going to show you how to create a W&B table and store it as part of the project so that you too can share all your medical data with your colleagues and enjoy the advantages (with your own data) as I did in the Melanoma competition.
We can use the simple function below to log data from a directory as a W&B table.
def log_wandb_table(data_dir, df, table_name, n_sample=100):
    wandb_table = wandb.Table(columns=['Image Name', 'Image', 'Target', 'Diagnosis', 'Sex'])
    sample_df   = df.sample(n=n_sample).reset_index(drop=True).copy()
    for idx, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
        img_path = os.path.join(data_dir, row.image_name+'.jpg')
        wandb_table.add_data(
            row.image_name, 
            wandb.Image(Image.open(img_path)), 
            row.benign_malignant, 
            row.diagnosis, 
            row.sex
        )
    wandb.run.log({table_name: wandb_table})
The above function takes inputs such as data directory, W&B table name and also the training data frame. Here's how you log data as a W&B table in steps: 
Create an empty table with column names: We do this in the first line of code in the function above. 
Sample your dataset if you want to store a part of your dataset. 
Add data to the table: We simply iter through the dataset, and add data to the table for each corresponding column. 
A key thing to note here is that we add Images to the table by wrapping the image inside a wandb.Image data type. We can also add as much metadata as we like if we have more columns. 
It's really that simple! Just by running the one function above, we can now log the data as a Weights and Biases table. Try it for your own datasets and see how you go! 
More information about W&B tables and other data types can be found here - Log Tables. 
It might surprise you to know that you can also log other media types such as Audio, Video and then also later retrieve the dataset in two simple lines of code! All information is already a part of the docs that I've linked above. Please do try! 
💡
And that's really it - we have now successfully logged our Melanoma images as a Weights and Biases table. 
ConclusionAs part of this report, we first started with downloading the dataset. Once we had downloaded the dataset from Kaggle, I showcased how using a simple technique such as color constancy during pre-processing can really help improve our model performance. 
Next, I also showcased how to log data to the W&B table such that you can get many advantages like - ease of access, ease of sharing, and also ease of finding patterns in the dataset. 
If you have any questions - please feel free to reach out to me at aman@wandb.com. Thanks for reading!
﻿
Add a comment