Skin Lesion Classification on HAM10000 with HuggingFace using PyTorch and W&B
Explore the use of HuggingFace, PyTorch, and W&B for classifying skin lesions with the HAM10000 dataset. We will build, train, and evaluate models for medical diagnostics!
Created on January 15|Last edited on February 8
Comment
Welcome to a tutorial on image classification using HuggingFace, PyTorch, and Weights & Biases! We're diving straight into a real-world challenge: classifying skin lesions from the HAM10000 dataset. This dataset is a large collection of dermatoscopic images used widely in machine learning for skin lesion analysis. By the end of this guide, you'll learn how to build, train, and evaluate models that can differentiate between various skin conditions, a task vital in medical diagnostics.

What We'll Cover
What Is Image Classification?HuggingFace Examples Of Image Classification with HuggingFaceThe Promise of Early Skin Cancer DetectionThe HAM10000 DatasetThe Data The Code The Swin TransformerThe Training Loop Recommended Reading
What Is Image Classification?
Image classification is a cornerstone of computer vision, aiming to categorize images into predefined classes. It has broad applications, ranging from social media photo tagging to medical imaging analysis. With the advancement of deep learning, image classification techniques have enjoyed a substantial evolution, especially using convolutional neural networks (CNNs) and more recently, transformer-based models.
HuggingFace offers an extensive range of pre-trained transformer models that are exceptionally adept at various NLP tasks. However, their application isn't limited to just text. In this tutorial, we leverage HuggingFace's MobileViT model for image classification. The MobileViT combines the power of transformers with the efficiency of CNNs, making it ideal for image-related tasks. We'll explore the inner workings of this model and demonstrate how to utilize it for classifying skin lesions from the HAM10000 dataset, a common benchmark in medical image analysis.
HuggingFace
Hugging Face streamlines the sharing and loading of AI models by offering a centralized hub where researchers can upload and access a vast array of pre-trained models for diverse natural language processing tasks. This platform features user-friendly APIs, including Transformers, enabling easy integration of these models into various projects. It also supports version control and collaboration, which enhances the development and management of models. By providing these capabilities, HuggingFace democratizes access to state-of-the-art AI models, fostering collaboration and innovation within the AI research community.
Examples Of Image Classification with HuggingFace
Image classification has wide-ranging applications in the medical field. It can be used for detecting abnormalities in radiology images such as X-rays and MRIs, analyzing pathological slides for signs of cancer, classifying skin lesions for dermatological assessments, and interpreting retinal scans to identify eye diseases. And this just scratches the surface of the potential.
The Promise of Early Skin Cancer Detection
Skin cancer, one of the most common cancers globally, has high treatment success rates when detected early. This is where image classification technology excels. It can analyze vast numbers of skin images rapidly and cheaply, identifying potential malignancies that might be missed by the human eye. Such automated systems can serve as invaluable tools for dermatologists, enhancing their diagnostic capabilities and ensuring timely intervention for patients. This technological advancement not only promises to revolutionize skin cancer diagnosis but also underscores the broader potential of AI in medical diagnostics.
The HAM10000 Dataset
The HAM10000 dataset, with its extensive collection of over 10,000 dermatoscopic images, is a powerful resource for automated skin lesion analysis, especially in the context of skin cancer detection. This dataset stands out for its diverse range of skin lesion types, including Melanoma, Melanocytic nevi, Benign keratosis-like lesions, Basal cell carcinoma, Actinic keratoses, Vascular lesions, and Dermatofibroma.
Such variety is crucial not only for training models to distinguish between benign and malignant lesions but also for differentiating among various skin diseases. The breadth of the dataset, encompassing a wide array of common skin cancer types, enhances its utility in training robust models capable of generalizing across different cases.
Additionally, its real-world clinical data reflects the diverse imaging conditions encountered in clinical practice, further aiding in the development of a model that performs consistently across various scenarios. This makes the HAM10000 dataset an excellent choice for projects aimed at the early detection of skin cancers, ultimately contributing to improved treatment outcomes.
The Data
We will first need to obtain the HAM10000 dataset, which can be obtained here. You will need to first download the zip file, extract it, and then further extract the two training set zip files which contain the images for the test set, along with a third zip file that corresponds to the images for the test set. The folder will also contain two csv files which contain the labels for the train and test sets.
The Code
We will now walk through the code for training our classifier. We will first build a torch dataset which will be used for iterating over the data to pass it to the model. I went ahead and moved all of the images for both the training set and the test set into a single folder.
class HAM10000DatasetBalanced(Dataset):def __init__(self, csv_file, img_dir, augment=True):self.skin_df = pd.read_csv(csv_file)self.img_dir = img_dirself.augment = augmentself.transform = transforms.Compose([# transforms.Resize(224),transforms.RandomRotation(20), # Random rotation between -20 to 20 degreestransforms.RandomHorizontalFlip(), # Random horizontal fliptransforms.RandomVerticalFlip(), # Random vertical fliptransforms.ToTensor(), # Convert to PyTorch Tensortransforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), # Normalize])self.base_transform = transforms.Compose([# transforms.Resize(224),transforms.ToTensor(), # Convert to PyTorch Tensortransforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), # Normalize])def __len__(self):return len(self.skin_df)def label_to_int(self, label):label_dict = {'nv': 0, # Melanocytic nevi'mel': 1, # Melanom`a'bkl': 2, # Benign keratosis-like lesions'bcc': 3, # Basal cell carcinoma'akiec': 4, # Actinic keratoses'vasc': 5, # Vascular lesions'df': 6 # Dermatofibroma}return label_dict.get(label, -1)def __getitem__(self, idx):attempts = 0max_attempts = 100while attempts < max_attempts:try:img_name = os.path.join(self.img_dir, self.skin_df.iloc[idx, 1] + '.jpg')image = Image.open(img_name)if self.augment:image = self.transform(image)else:image = self.base_transform(image)# image = self.feature_extractor(images=image, return_tensors="pt")['pixel_values'].squeeze(0)label = self.label_to_int(self.skin_df.iloc[idx, 2])return image, labelexcept (FileNotFoundError, UnidentifiedImageError):print(f"Error opening image: {img_name}. Trying another image.")attempts += 1idx = random.randint(0, len(self.skin_df) - 1)raise Exception(f"Failed to load an image after {max_attempts} attempts.")
The HAM10000DatasetBalanced class initializes with a CSV file containing metadata and a directory where images are stored. The class offers optional data augmentation (used only for training and not testing), which includes random rotations, horizontal flips, and vertical flips, alongside standard transformations like tensor conversion and normalization.
Note, I didn't resize the images, however, you can definitely try resizing the images to smaller sizes, and reduce computational requirements!
💡
The label_to_int method maps string labels to integer values, corresponding to different types of skin lesions. This essentially maps our classes to an integer value, which is later encoded into a one-hot vector by the torch cross entropy method.
The __getitem__ method retrieves a specific item from the dataset. It attempts to load an image and its label, applying the appropriate transformations. If the image file is not found or cannot be opened, it retries with a different image up to a maximum number of attempts. If it fails to load any image after these attempts, it raises an exception. This approach ensures robust handling of file-related errors and enhances the dataset's usability for training machine learning models.
Now, we can initialize both our training and test sets using the following code:
tr_ds = HAM10000DatasetBalanced(csv_file="/home/brett/Desktop/ham10000/HAM10000_metadata.csv", img_dir="/home/brett/Desktop/ham10000/imgs")tst_ds = HAM10000DatasetBalanced(csv_file="/home/brett/Desktop/ham10000/ISIC2018_Task3_Test_GroundTruth.csv", img_dir="/home/brett/Desktop/ham10000/imgs", augment=False)def make_weights_for_balanced_classes(dataset):class_counts = dataset.skin_df['dx'].value_counts()num_samples = len(dataset)class_weights = {i: num_samples/class_counts[i] for i in range(len(class_counts))}weights = [class_weights[label] for label in dataset.skin_df['dx'].map(dataset.label_to_int)]return weights# Calculate weights for balanced samplingweights = make_weights_for_balanced_classes(tr_ds)sampler = WeightedRandomSampler(weights, len(weights))# Create your DataLoaderstrain_loader = DataLoader(tr_ds, batch_size=12, sampler=sampler) # Use sampler heretst_loader = DataLoader(tst_ds, batch_size=12 )
Balancing the dataset using WeightedRandomSampler in tr_ds ensures equal representation of all classes during training, improving model generalization and preventing bias towards dominant classes. The `tst_ds` remains unbalanced to reflect real-world class distributions for accurate model evaluation.
The Swin Transformer
In the world of computer vision, where each model exhibits distinct advantages and limitations, the Swin Transformer is a promising candidate for specific applications like skin lesion classification. This general-purpose computer vision backbone, known for its robust performance across various tasks such as object detection, semantic segmentation, and image classification, integrates key visual priors into the traditional Transformer encoder.
These include hierarchy, locality, and translation invariance, blending the Transformer's inherent modeling capabilities with adaptations suited for visual tasks. Although the Swin Transformer has not yet been benchmarked on the HAM10000 dataset, its potential application in this area is intriguing. For those interested in exploring this model further, a deeper dive into its structure and capabilities can be found in the linked paper below.
We can load the pretrained model with HuggingFace, which I will show below. We simply load the model, and add a single linear layer on top of the last hidden state of the model.
class SwinV2Classifier(nn.Module):def __init__(self, num_classes):super(SwinV2Classifier, self).__init__()self.model = Swinv2Model.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")self.classifier = nn.Linear(218880, num_classes)def forward(self, x):outputs = self.model(x)features = outputs.last_hidden_state.flatten(start_dim=1)return self.classifier(features)model = SwinV2Classifier(num_classes=7)
The Training Loop
To train our model, we will define a few functions that will be responsible for training and evaluating our model.
def train(model, train_loader, criterion, optimizer):model.train()running_loss = 0.0correct = 0total = 0for images, labels in train_loader:optimizer.zero_grad()images, labels = images.to('cuda'), labels.to('cuda')outputs = model(images)loss = criterion(outputs, labels)loss.backward()optimizer.step()_, predicted = torch.max(outputs.data, 1)total += labels.size(0)correct += (predicted == labels).sum().item()running_loss += loss.item()train_loss = running_loss / len(train_loader)train_accuracy = correct / totalreturn train_loss, train_accuracydef evaluate(model, data_loader, criterion):model.eval()val_loss = 0.0correct = 0total = 0all_preds = []all_labels = []with torch.no_grad():for images, labels in data_loader:images, labels = images.to('cuda'), labels.to('cuda')outputs = model(images)loss = criterion(outputs, labels)val_loss += loss.item()_, predicted = torch.max(outputs.data, 1)total += labels.size(0)correct += (predicted == labels).sum().item()all_preds.extend(predicted.cpu().numpy())all_labels.extend(labels.cpu().numpy())accuracy = correct / totalreturn val_loss / len(data_loader), accuracy, all_preds, all_labels
Now that these functions are in place, we will initialize our model, and write a training loop that will train the model over the entire dataset and also evaluate the performance on the test set on every epoch. Since we are mainly looking to evaluate our model on the the HAM10000 test set benchmark, I won't use a validation set, however, in a production setting, usually a portion of the training set is set aside for validating the model's performance on data it hasn't seen.
This validation set is used to benchmark the model during the development process without running the risk of overfitting the test set.
current_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")models_dir = f'./runs/run_{current_time}'if not os.path.exists(models_dir):os.makedirs(models_dir)criterion = nn.CrossEntropyLoss()epochs = 1000lr = 0.000003 # low lroptimizer = Adam(model.parameters(), lr=lr)best_train_loss = float('inf')best_val_accuracy = 0.0wandb.init(project="skin-lesion-classification")label_dict = {'nv': 0, # Melanocytic nevi'mel': 1, # Melanom`a'bkl': 2, # Benign keratosis-like lesions'bcc': 3, # Basal cell carcinoma'akiec': 4, # Actinic keratoses'vasc': 5, # Vascular lesions'df': 6 # Dermatofibroma}class_names = [key for key, value in sorted(label_dict.items(), key=lambda item: item[1])]for epoch in range(epochs):train_loss, train_accuracy = train(model, train_loader, criterion, optimizer)val_loss, val_accuracy, val_preds, val_labels = evaluate(model, tst_loader, criterion)print(f"LR: {lr}, Epoch {epoch+1}/{epochs} - Train loss: {train_loss:.4f}, Val loss: {val_loss:.4f}, Val Accuracy: {val_accuracy:.4f}")wandb.log({"lr": lr, "epoch": epoch+1, "train_loss": train_loss,"train_accuracy": train_accuracy, "test_loss": val_loss, "test_accuracy": val_accuracy})wandb.log({"conf_mat_test": wandb.plot.confusion_matrix(probs=None, y_true=val_labels, preds=val_preds, class_names=class_names)})if train_loss < best_train_loss:best_train_loss = train_losstorch.save(model.state_dict(), os.path.join(models_dir, f'best_train_model_lr{lr}.pth'))if val_accuracy > best_val_accuracy:best_val_accuracy = val_accuracytorch.save(model.state_dict(), os.path.join(models_dir, f'best_val_model_lr{lr}.pth'))wandb.finish()
In this training process, the setup begins by creating a distinct directory for each model run, using the current date and time for differentiation. This approach ensures that each training session's data is stored separately. The model employs Cross-Entropy Loss as its loss function and the Adam optimizer. We use a fairly low learning rate, which seemed to work best for this task.
Each epoch involves training the model with the training dataset and evaluating it against a test dataset. Key metrics like training loss, validation loss, and validation accuracy are monitored and logged after every epoch, providing insights into the model’s performance.
Throughout the training, Weights & Biases is used for logging performance metrics. W&B helps in tracking the progress and performance of the model over time, providing a clear view of how the model evolves during the training. This visualization is instrumental in identifying trends, patterns, and potential issues in the training process, such as overfitting or underfitting
During evaluation, predictions and true labels are compared to form a confusion matrix, visualized using Weights & Biases. This matrix helps identify model performance across different classes, highlighting areas like false positives and false negatives. It's crucial for understanding class-specific accuracy and diagnosing classification errors.
Here are the results of my training run! Here I log the training loss and test accuracy for each epoch.
Run: expert-cherry-65
1
Run: expert-cherry-65
1
Run: expert-cherry-65
1
Overall, we achieve impressive results given some of the reported results from other methods. On the HAM10000 leaderboard on Paperswithcode.com, our model would rank 5th on the leaderboard for the HAM10000 dataset! I’m excited to see where further advancements in AI models can take us in the field of medical diagnostics. Feel free to leave comments with your insights, questions, or suggestions for future improvements. Also, here is a link to the code on Github.
Recommended Reading
The Softmax Function: The Workhorse of Machine Learning Classification
In this article, we explore how to implement the Softmax function in Python, and how to make good use of it — giving some background and context along the way.
A Gentle Introduction to Image Classification
In this article, we explore the complex subject of image recognition, from understanding the basics of CNNs to implementing your own image classification model.
Leveraging Pre-Trained Models for Image Classification
In this article, we fine-tune a pre-trained model on a new classification dataset, to understand how well transfer learning helps the model train on new data.
How to Fine-Tune BERT for Text Classification
A code-first reader-friendly kickstart to finetuning BERT for text classification, tf.data and tf.Hub
Sources
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.