Deep Learning and MLOps for Health Care: A look into MedSAM
Created on January 24|Last edited on October 9
Comment
Introduction: MedSAM's practical applications

In the bustling corridors of a modern hospital, radiologists and clinicians are often seen poring over a myriad of medical images — from X-rays to MRIs — seeking to unravel the mysteries hidden within these complex visuals. Each image is a puzzle, where accurately identifying and delineating regions of interest (ROIs) can be the key to a correct diagnosis, an effective treatment plan, or even a life-saving intervention. This is the world of medical image segmentation, a domain where precision meets the critical needs of patient care.
Traditionally, this segmentation has been a manual and labor-intensive process, requiring hours of expert attention for each image. The advent of semi-automatic and fully automatic segmentation methods offered a glimmer of hope, promising to reduce the time and labor involved. However, these methods, until recently, have been plagued by a lack of versatility and consistency, especially when faced with varying tasks or imaging modalities.
Enter MedSAM, a groundbreaking deep learning model that stands at the forefront of a new era in medical image analysis. MedSAM, or Medical Segment Anything Model, is a universal, foundation model designed to bridge the gap in the current landscape of medical image segmentation. It leverages the power of a large-scale dataset encompassing over 1.57 million image-mask pairs, covering an extensive range of more than 30 cancer types and 10 imaging modalities.
What sets MedSAM apart is its exceptional ability to adapt and perform across a diverse spectrum of segmentation tasks, demonstrating superior accuracy and robustness compared to traditional modality-specific models. This leap forward in technology is not just a theoretical advancement; it has tangible, real-world implications. With MedSAM, clinicians can look forward to more accurate diagnoses, personalized treatment plans, and efficient monitoring of diseases — all achieved with unprecedented speed and precision.
The development of MedSAM is inspired by the strides made in natural image segmentation, particularly the segment anything model (SAM) and its counterparts. These models showed remarkable versatility across various segmentation tasks but fell short when applied to the nuanced and complex world of medical images. MedSAM, through extensive fine-tuning and evaluation, overcomes these challenges, offering a refined model specifically tailored for the medical domain.
Through rigorous evaluation on 86 internal and 146 external validation tasks, MedSAM has consistently outshined state-of-the-art segmentation models, proving its worth as a versatile, high-performing tool in medical image analysis.
A recap on SAM: Segment Anything

Task and Pre-training:
SAM's core is the promptable segmentation task, drawing from NLP's prompt-based learning. It involves returning a valid segmentation mask for various prompts, including points, boxes, masks, or text. The pre-training algorithm adapts interactive segmentation techniques, simulating sequences of prompts for each image and assessing the model's ability to predict valid masks for ambiguous prompts.
Model Architecture:
Image Encoder: SAM employs a Vision Transformer (ViT) pre-trained with Masked Autoencoders (MAE), adapted for high-resolution input processing. It runs once per image to create a detailed image embedding.
Prompt Encoder: This component handles different types of prompts. Sparse prompts (points, boxes, text) use positional encodings and learned embeddings, while dense prompts (masks) are embedded using convolutions. Text prompts are processed with a CLIP-based text encoder.
Mask Decoder: It efficiently maps the image and prompt embeddings to a segmentation mask using a modified Transformer decoder block. This block incorporates prompt self-attention and cross-attention mechanisms and concludes with a dynamic mask prediction head for computing mask probabilities.
Ambiguity Resolution and Efficiency:
SAM can generate multiple valid masks for a single, ambiguous prompt, each with a confidence score. It achieves this through a model modification that allows the prediction of several output masks. The design prioritizes efficiency: with precomputed image embeddings, the prompt encoder and mask decoder can operate in ∼50ms on standard computing environments, facilitating real-time interactions.
Training and Losses:
SAM's training integrates a blend of geometric and text-based prompts. The model is supervised using a linear combination of focal loss and dice loss, simulating an interactive setup with random prompt sampling. This approach ensures seamless integration into diverse segmentation tasks, allowing SAM to adapt to various practical segmentation scenarios via prompt engineering.
Focal Loss is primarily used to address class imbalance in object detection tasks. It was introduced in the context of training deep neural networks for object detection in scenarios where there is a significant imbalance between the background and foreground classes.
Problem Addressed: In many object detection tasks, the majority of the anchor boxes are negative (representing background), leading to an imbalance between the background and object classes. This imbalance can result in the model being overwhelmed by the sheer number of easy negatives, causing it to perform poorly on detecting actual objects.
How It Works: Focal Loss modifies the standard cross-entropy loss by adding a factor that reduces the loss for well-classified examples. The idea is to focus the model's training on hard negatives. It is computed as:
Where:
- is the model's estimated probability for each class.
- is a balancing parameter.
- is a focusing parameter that adjusts the rate at which easy examples are down-weighted. The higher the , the more focus on hard, misclassified examples.
Dice Loss is a loss function often used for segmentation tasks, particularly in medical image processing. It's effective for data with a high imbalance between the object and background pixels.
Problem Addressed: In segmentation, especially medical imaging, the region of interest (like a tumor in an MRI scan) often occupies a much smaller area compared to the background. Standard loss functions like cross-entropy may not perform well due to this imbalance.
How It Works: Dice Loss is based on the Dice Coefficient, which is a measure of overlap between two samples. It is particularly useful for measuring the similarity between the predicted segmentation and the ground truth. The Dice Coefficient is calculated as:
Where is the predicted set of pixels and is the ground truth. The Dice Loss is then formulated as:
This loss function works well for segmentation because it directly maximizes the overlap between the predicted segmentation and the ground truth, making it less sensitive to class imbalance.
- Focal Loss is designed to address the class imbalance by focusing more on difficult, misclassified cases in classification tasks,
- Dice Loss is used in segmentation tasks to maximize the overlap between the prediction and the ground truth in scenarios with a high imbalance.
Teaching SAM Medicine: MedSAM

TL;DR - Results at a glance:

Dataset Composition:
- MedSAM trained on a dataset of 1,090,486 medical image-mask pairs.
- Covered 15 imaging modalities and over 30 cancer types.
- Predominant modalities: CT, MRI, and endoscopy, with inclusion of ultrasound, pathology, fundus, dermoscopy, mammography, and OCT.
Internal Validation Results:
- Focused on 12 representative segmentation tasks.
- Evaluated using Dice Similarity Coefficient (DSC):
- Median DSC for key tasks: Intracranial hemorrhage CT (94.0%), Glioma MR T1 (94.4%), Pneumothorax CXR (81.5%), Polyp endoscopy (98.4%).
- MedSAM surpassed U-Net models' performance in most tasks.
- Comparable performance with U-Net in tasks with clear boundaries, e.g., skin cancer segmentation (95.2% vs 95.1% for U-Net).
External Validation Results:
- Involved over 30 new segmentation tasks from unseen datasets.
- MedSAM showed superior generalization, outperforming both SAM and U-Net specialist models in diverse tasks.
- Notable DSC improvements: Nasopharynx cancer segmentation (90.3%), outperforming SAM by 53.3% and U-Net by 24.5%.
- MedSAM outperformed in unseen modalities like abdomen T1 Inphase and Outphase by 3-7% over SAM and U-Net.
Quantitative Analysis:
- MedSAM demonstrated higher precision in tumor burden quantification, with a Pearson correlation of 0.99 compared to expert evaluations.
- In prostate segmentation, MedSAM's performance matched or exceeded that of six human experts.
Qualitative Observations:
- MedSAM effectively segmented objects with weak or missing boundaries.
- Superior performance in segmenting challenging targets like liver and cervical cancers.
- Saliency maps analysis revealed MedSAM’s features rich in semantic information relevant to anatomical structures.
What makes MedSAM unique
lorum ipsum
Training our own MedSAM for Breast Cancer Detection
Constructing an end to end MLOps pipeline
lorum ipsum
Training Dataset
nielsr-breast-cancer-train
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Training Results - Logged Metrics
Run set
166
Automatic GPU Tracking
Validation Dataset
nielsr-breast-cancer-val
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Validation Results - Logged Metrics
Run set
166
Qualitative Analysis via Tables
Run set
166
Hyperparameter Optimization via Sweeps
We can use our parallel coordinates plot to help determine and select our best model for usage in downstream tasks or for deployment via our model registry
Run set
166
Looking into our Model Registry
Orchestrating our ML workflow - Launch
Launch description
Automating our ML workflow - Webhooks and Github Actions
Walkthrough of GitHub actions
Add a comment