Choosing a Model for Detecting Surgical Instruments
Faster R-CNN, RetinaNet, YOLOv5, YOLOX and VFNet
Created on November 15|Last edited on December 13
Comment

Pre-detection image courtesy of Dr. J. Aro.
The rapid advancements in AI research have provided practitioners numerous models from which to choose from. Sometimes, one 'state-of-the-art' is released before the previous one has been thoroughly understood.
These models are tested on a benchmark to enable comparison of performance. However, most of these benchmark datasets do not represent real-world data. And models may perform differently based on the particular dataset's characteristics.
In this study, we will look at the performance of a baseline model (Faster R-CNN) and some recent, flashier models like RetinaNet, YOLOv5, YOLOX and VFNet). As a sub-study, we will experiment on two types of backbones for the VFNet model.
We will perform the coding on Colab with GPU and high RAM settings. We will use the PyTorch, Fast.ai and IceVision frameworks. The Weights and Biases platform provide an excellent way of monitoring the training, as well as integrating information about the models' hyperparameters and train run.
Outline
OutlineI. Installation and ImportsII. DatasetA. Loading B. ParsingC. TransformsIII. ModellingA. MetricB. Model Types 1. Faster R-CNN 2. RetinaNet 3. YOLOv5 4. YOLOX 5. VFNet 6. Choosing a modelC. Model Backbones 1. Resnet50_fpn_1x 2. Resnet50_fpn_mstrain_2xD. Saving the modelIV. InferenceA. Local runB. Gradio deploymentV. Conclusion
I. Installation and Imports
Here's a bit of what we'll need for our study today:
!wget https://raw.githubusercontent.com/airctic/icevision/master/icevision_install.sh!bash icevision_install.sh cuda11 master
Let the above installations finish. It will take a couple of minutes.
import IPython IPython.Application.instance().kernel.do_shutdown(True)
Wait for a few seconds until the kernel has restarted.
import icevision.all import *
Note: IceVision frequently updates its contents. If the above codes are not working, check the IceVision forum for the latest installation instructions. It presently supports torch v 1.10 and torchvision v 0.11.
II. Dataset
I have created the Surgical_instruments dataset and stored it in Google Drive (subsets are available on Github). It is composed of 1,714 train, 474 valid and 21 test images which have been annotated using Roboflow.
A. Loading
from google.colab import drivedrive.mount('/content/gdrive', force_remount = True)
!ls gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco# output: test train valid
It's good practice to check your images first prior to proceeding further. On this set, I found that some scraped images in the dataset were in grayscale. Tensor shapes of images on grayscale would disrupt the modeling, and thus need to be converted to the RGB channel configuration.
image_path = Path('gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco/train')img_files = get_image_files(image_path)img = PIL.Image.open(img_files[1501])img = img.convert('RGB')img.to_thumb(150,150)

Image courtesy of Dr. V. Quinto
classes = ['Army_navy', 'Bulldog', 'Castroviejo','Forceps', 'Frazier','Hemostat','Iris','Mayo_metz','Needle','Potts','Richardson','Scalpel','Towel_clip', 'Weitlaner','Yankauer']class_map = ClassMap(classes) # len = 16
The total number of classes here will be 16: the 15 surgical instruments classes plus the background.
B. Parsing
Information such as image filenames, image sizes, bounding box sizes and annotated classes will be collected and sorted through the parsing method.
path = Path('gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco')train_parser = parsers.COCOBBoxParser(annotations_filepath = path/'train/_annotations.coco.json',img_dir = path/'train')valid_parser = parsers.COCOBBoxParser(annotations_filepath = path/'valid/_annotations.coco.json',img_dir = path/'valid')
whole = SingleSplitSplitter()train_records, *_ = train_parser.parse(data_splitter = whole)valid_records, *_ = valid_parser.parse(data_splitter = whole)
The parser defaults to splitting a dataset to 80 train / 20 valid. If your dataset is already split into the different sets, assign a parser for each. Using the SingleSplitSplitter() will keep your sets intact.
show_records(train_records[0:3],ncols=3, font_size=30, label_color = '#ffff00')

C. Transforms
presize = 512 #image_size = 384 #train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=presize), tfms.A.Normalize()])valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(size=image_size), tfms.A.Normalize()])
- Allowing a relative big presize enables all or most of the image features to be incorporated in the processing.
- Most models are flexible with image_size. (Side Note: keeping the image_size value divisible by 128 facilitates appropriate tensor shape conversions for EfficientDet).
- Transformations utilize the albumentations library. Transforms include scaling, flips, rotations, RGB shifts, lighting shifts, blurring, cropping and padding.
train_ds = Dataset(train_records, train_tfms)valid_ds = Dataset(valid_records, valid_tfms)
- The PyTorch Dataset is the collection of the original images and the transformed images.
III. Modelling
from icevision.models.checkpoint import *from fastai.callback.wandb import *from fastai.callback.tracker import SaveModelCallback
A. Metric
metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]
- The COCOMetric refers to the mean Average Precision, taking into account the predicted bounding box location and the enclosed object's classification.
B. Model Types
1. Faster R-CNN
Faster R-CNN is a 2-step algorithm for object detection. It first takes into account the whole image then creates anchor (or proposal) boxes. Each of the proposed boxes have a probability of containing an object ('objectness') versus being a background. The model is then able to focus on regions in an image with a high objectness value.
After creating the Region Proposal Networks, it proceeds to the 2nd step which involves regression and classification. Regression evaluates the difference in the 4 coordinates of the proposed box from those of the ground truth box. The coordinates for the COCO format involves the x- and y- coordinates on the upper-left corner (as 0, 0), the box width and the box height. Classification identifies the class of the object enclosed within the bounding box.
model_type = models.torchvision.faster_rcnnmodel = model_type.model(num_classes=len(class_map))train_dl = model_type.train_dl(train_ds, batch_size=16, num_workers=4, shuffle=True)valid_dl = model_type.valid_dl(valid_ds, batch_size=16, num_workers=4, shuffle=False)
- The DataLoaders module facilitates iteratively feeding batches of data from the PyTorch Dataset created in the previous step. A bigger batch_size enables the learner to compare more representatives for each iteration (16 is usually a good size).
wandb.init(project = 'Surgical_instruments_models_', name = 'Faster_RCNN', reinit = True)
- The Weights and Biases platform includes a dashboard where you can track and record the metrics and hyperparameters of the models being trained.
- 'project' will be the folder where each of the training runs will be stored in.
- 'name' refers to a logged training run. Use a name that represents the variable that you want to compare, so that you are able to make sense of the plots once they are interposed. The variable we are testing are different models' performance, thus, we choose the models' names.
- This will output a link where you can access the corresponding dashboard.
learn = model_type.fastai.learner(dls = [train_dl, valid_dl_frcnn],model = model,metrics = metrics,cbs = [WandbCallback(),SaveModelCallback()])
- 'learn' will be the functional model, incorporating the particular model's configuration, weights (if it is pretrained), the data, metrics and the optional callbacks.
learn.lr_find()
- lr_find() is a functionality in Fast.ai that test runs a portion of the data. It plots the losses associated with each learning rate. This enables you to determine which lr could best optimize the training. Note that this varies slightly depending on the portion of data that it sees.
learn.fine_tune(100, 2e-04, freeze_epochs =1)
Run set
1
- The COCOMetric, i.e the Mean Average Precision, for Faster R-CNN was 69.7, and the best validation loss was 0.100. The graphs have been smoothened to provide a better sense of the metric and loss trends.
- The fine_tune() functionality is best used on pretrained models. For one or a few runs, the weights from the pretrained model is preserved (or 'frozen') because these are already considered good. Only the new inputs are processed to adjust the initialized weights.
- After the designated number of frozen runs, the hidden layers are 'unfrozen' and the weights are adjusted according to the new inputs.
- The lr charts correspond to the different parameter groups of the model (hidden layers, neck and head).
- The lr in fine_tune() is adaptive and is inversely related to the momentum.
model_type.show_results(model,valid_ds)

- At mAP at 69.7 after 100 epochs of training at lr 2e-04, the Faster R-CNN shows a fair detection of the objects.
In a real-world situation, it is crucial for the Operating Room personnel to have exact identification and counts of the instruments. Therefore, a 'fair' detection such as that resulting from the Faster R-CNN model is not yet adequate. Various steps can be used to improve the metric and loss. We will experiment on varying the model used.
As of now, we will focus on detection of instruments that are not densely packed. Counts will not be addressed in this tutorial.
💡
2. RetinaNet
RetinaNet is a single-stage detector. It does not use the region proposal step to generate the 'objectness'. Instead, it uses the focal loss to address the imbalance between a foreground ('object') and the background. Backgrounds are considered as easy examples and are down-weighed. This enables the model to focus on harder ('object') examples.
The single-stage model consists of a backbone and two subnetworks for classification and regression. One of the backbones for RetinaNet is resnet50_fpn. Resnet comprises the network's hidden layers. Its intermittent assimilation of the identity function enables the construction of deeper networks and a higher capacity for learning.
The Feature Pyramid Network (fpn) comprises the neck of the network. It connects the different parts of the architecture. This enables the model to detect different scales of an object. Small objects are better featured in the earlier/ bottom feature maps, and the reverse for large objects.
model_type = models.mmdet.retinanetbackbone = model_type.backbones.resnet50_fpn_1x(pretrained=True)model = model_type.model(backbone=backbone(pretrained=True),num_classes=len(class_map))
- Most of the codes for training models using IceVision are similar to that of the previous. The differences are in the model source, model name and the backbone (and occasionally the need to indicate the image size).
train_dl = model_type.train_dl(train_ds, batch_size=16, num_workers=4, shuffle=True)valid_dl = model_type.valid_dl(valid_ds, batch_size=16, num_workers=4, shuffle=False)wandb.init(project = 'Surgical_instruments_models_', name = 'RetinaNet', reinit = True)
- For every new run, initialize wandb. Otherwise, it will be considered as a continuation of the previous training.
Use the same commands for learner as previously discussed.
Run set
2
RetinaNet reached a best mAP of 64.7 after 100 epochs of training at lr 5 e-05. It had higher losses compared to the 2-stage detector, and in general had a similar pattern. The output for the validation set is fair for the RetinaNet_resnet50_fpn model.

3. YOLOv5
The You-Look-Only-Once (YOLO) family divides the image into grids. Each grid is responsible for detecting objects within its borders.
YOLOv5 combines a modified Cross Stage Partial Network (CSPNet) backbone and a Path Aggregation Network (PANet) neck which enables a more efficient computation. It includes an additional mosaic augmentation to stimulate more robust learning. The classification and localization of objects undergo regression processing and the output is coupled in a single head. These features enable ease-of-use and real-life mobile applicability.
model_type = models.ultralytics.yolov5backbone = model_type.backbones.small # yolov5smodel = model_type.model(backbone = backbone(pretrained=True),num_classes=len(class_map),img_size = image_size) #
Run set
2
The YOLOv5s reached a best mAP of 67.5 after 100 epochs of training at lr 2e-03. This model with a small backbone almost approached the metric of a 2-stage detector, at a quarter of the processing time. The detection for the validation set is fair-good.

4. YOLOX
YOLOX bases an objectness on a single center-point of a box, instead of using anchor boxes. It is able to free up post-processing for non-maximal suppression that are needed in models that rely on anchors. It uses an additional mosaic and mix-up augmentation for robustness. And it has separated the localization and classification portions in the network's head.
model_type = models.mmdet.yoloxbackbone = model_type_yolox.backbones.yolox_tiny_8x8model = model_type_yolox.model(backbone = backbone_yolox(pretrained=True),num_classes=len(class_map,))
Run set
3
YOLOX_tiny reached a best mAP of 66.2 after training for 100 epochs at lr 1e-03. It had a rapid increase in mAP during the first epochs, then started to approach the performance of YOLOv5s. It was a fraction slower than YOLOv5. Notably, the losses for the validation set are high. By visual checking, the quality of detection for the validation set is fair-good.

5. VFNet
The VarifocalNet attempts to address the discrepancy between localization and classification that might arise during the post-processing. This discrepancy is notable in the non-maximum suppression step where well-localized boxes may be eliminated because of a poor classification score.
The IoU-aware classification score uses the Varifocal Loss which down-weighs negative examples (background), and focuses on the positive, high quality examples.
It utilizes a fixed 9-point star-shape to determine the points' distance from the ground-truth, and adjusts the offsets accordingly.
It uses the resnet - fpn combination for its backbone, similar to RetinaNet.
model_type = models.mmdet.vfnetbackbone = model_type.backbones.resnet50_fpn_1x(pretrained=True)model = model_type.model(backbone=backbone_vf_r50_1x(pretrained=True),num_classes=len(class_map))
Run set
3
VFNet_resnet50_fpn_1x reached a best maP of 81.9 after 100 epochs of training at lr 2e-03. It performed better than the 2-stage detector Faster R-CNN. Having a similar backbone as the particular RetinaNet that we used, the VFNet had better metrics. However, it took longer to train, and had a higher validation set loss. A quick visual check shows that despite the better metric, the prediction outcomes still need improvement.

6. Choosing a model
Run set
5

Each model has its particularities. Choosing a model should be guided by the project objectives. The surgical instruments project could accept a slightly longer training and inference time, if it will result in reliable detections. Among the modern models, VF net provided a good mAP and validation loss. We will choose VFNet and explore if another backbone can further improve the output.
C. Model Backbones
1. Resnet50_fpn_1x
This backbone was used in the example above. The pretrained model has undergone 1 cycle of training on the COCO dataset.
2. Resnet50_fpn_mstrain_2x
This pretrained subtype underwent 2 cycles of training, thus we expect more refined weights.
model_type = models.mmdet.vfnetbackbone = model_type.backbones.resnet50_fpn_mstrain_2x(pretrained=True)model = model_type.model(backbone=backbone(pretrained=True),num_classes=len(class_map))
Run set
2
The VFNet_resnet50_fpn_mstrain_2x variant reached a best mAP of 82.8 after 100 epochs of training at lr 5e-04. The mAP and loss metrics were slightly better and the training run was slightly faster than the 1x variant. Visual checking of predictions for the validation set showed good results.

For prototyping purposes, these would be considered as acceptable baseline results.
D. Saving the model
root_dir = Path('/content/gdrive/My Drive/')from icevision.models import *checkpoint_path = root_dir/'Surgical_instruments/Models/VFNet_nov13_mAP81.5.pth' #100 epochssave_icevision_checkpoint(model,model_name='mmdet.vfnet',backbone_name='resnet50_fpn_mstrain_2x',classes = train_parser.class_map.get_classes(),img_size=image_size,filename=str(checkpoint_path),meta={'icevision_version': '0.9.1'})
IV. Inference
A. Local run
We will load the test set, perform the preparatory steps including the limited transforms.
path = Path('gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco')test_parser = parsers.COCOBBoxParser(annotations_filepath = path/'test/_annotations.coco.json',img_dir = path/'test')whole = SingleSplitSplitter()test_records, *_ = test_parser.parse(data_splitter = whole)#show_records(test_records[0:3],ncols=3, font_size=30, label_color = '#ffff00')valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])infer_ds = Dataset(test_records, valid_tfms)
We will then get the predictions based on the model that we have chosen.
model_type = models.mmdet.vfnetbackbone = model_type.backbones.resnet50_fpn_mstrain_2xinfer_dl = model_type.infer_dl(infer_ds, batch_size=4, shuffle=False)preds_saved_test = model_type.predict_from_dl(model, infer_dl,keep_images=True, detection_threshold = 0.5)
- The detection threshold pertains to the NMS threshold.
show_preds(preds=preds_saved_test[0:], font_size=25, label_color = '#3050ff')

The model performed very well on the held-out test set!
B. Gradio deployment
!echo "- Installing gradio"!pip install gradio -U -qimport icedataimport PIL, requestsimport torchfrom torchvision import transformsimport gradio as grdef show_preds_gradio(input_image, display_label, display_bbox, detection_threshold):if detection_threshold==0: detection_threshold=0.5img = PIL.Image.fromarray(input_image, 'RGB')pred_dict = model_type.end2end_detect(img,valid_tfms,model,class_map=class_map,detection_threshold=detection_threshold,display_label=display_label,display_bbox=display_bbox,return_img=True,font_size=16,label_color="#FF59D6")return pred_dict['img']display_chkbox_label = gr.inputs.Checkbox(label="Label", default=True)display_chkbox_box = gr.inputs.Checkbox(label="Box", default=True)detection_threshold_slider = gr.inputs.Slider(minimum=0, maximum=1, step=0.1,default=0.5, label="Detection Threshold")outputs = gr.outputs.Image(type="pil")gr_interface = gr.Interface(fn=show_preds_gradio,inputs=["image", display_chkbox_label, display_chkbox_box, detection_threshold_slider],outputs=outputs,title='Surgical Instruments Detection and Identification Tool')gr_interface.launch(inline=False, share=True, debug=True)
- This will output a link to the Surgical instruments detection and identification application. The link will run as long as this kernel is running, for a maximum of 72 hours. You can share this link to others and you can see errors that they flag.
V. Conclusion
Using a pre-annotated dataset containing ~2K images, we were able to test different models and backbones. A VFNet model with a ResNet50 backbone and an FPN neck was then fine_tuned, generating good detection of sparsely distributed surgical instruments with a mean Average Precision of 82.8.
I hope you had fun!
Maria
Github: https://github.com/yrodriguezmd?tab=repositories. The notebook for this tutorial will be in the folder Deep_Learning_tutorials :)
Add a comment
Tags: Intermediate, Computer Vision, Object Detection, Experiment, R-CNN, VFNet, Github, Plots, Health Care, YOLO
Iterate on AI agents and models faster. Try Weights & Biases today.