Choosing a Model for Detecting Surgical Instruments

Faster R-CNN, RetinaNet, YOLOv5, YOLOX and VFNet. Made by Maria Rodriguez using Weights & Biases
Maria Rodriguez
Pre-detection image courtesy of Dr. J. Aro.
The rapid advancements in AI research have provided practitioners numerous models from which to choose from. Sometimes, one 'state-of-the-art' is released before the previous one has been thoroughly understood.
These models are tested on a benchmark to enable comparison of performance. However, most of these benchmark datasets do not represent real-world data. And models may perform differently based on the particular dataset's characteristics.
In this study, we will look at the performance of a baseline model (Faster R-CNN) and some recent, flashier models like RetinaNet, YOLOv5, YOLOX and VFNet). As a sub-study, we will experiment on two types of backbones for the VFNet model.
We will perform the coding on Colab with GPU and high RAM settings. We will use the PyTorch, and IceVision frameworks. The Weights and Biases platform provide an excellent way of monitoring the training, as well as integrating information about the models' hyperparameters and train run.


I. Installation and Imports

Here's a bit of what we'll need for our study today:
!wget!bash cuda11 master
Let the above installations finish. It will take a couple of minutes.
import IPython IPython.Application.instance().kernel.do_shutdown(True)
Wait for a few seconds until the kernel has restarted.
import icevision.all import *
Note: IceVision frequently updates its contents. If the above codes are not working, check the IceVision forum for the latest installation instructions. It presently supports torch v 1.10 and torchvision v 0.11.

II. Dataset

I have created the Surgical_instruments dataset and stored it in Google Drive (subsets are available on Github). It is composed of 1,714 train, 474 valid and 21 test images which have been annotated using Roboflow.

A. Loading

from google.colab import drivedrive.mount('/content/gdrive', force_remount = True)
If you are using a dataset from Github, use !git clone . See an example here.
!ls gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco # output: test train valid
It's good practice to check your images first prior to proceeding further. On this set, I found that some scraped images in the dataset were in grayscale. Tensor shapes of images on grayscale would disrupt the modeling, and thus need to be converted to the RGB channel configuration.
image_path = Path('gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco/train')img_files = get_image_files(image_path)img =[1501]) img = img.convert('RGB')img.to_thumb(150,150)
Image courtesy of Dr. V. Quinto
classes = ['Army_navy', 'Bulldog', 'Castroviejo','Forceps', 'Frazier', 'Hemostat','Iris','Mayo_metz','Needle','Potts', 'Richardson','Scalpel','Towel_clip', 'Weitlaner','Yankauer']class_map = ClassMap(classes) # len = 16
The total number of classes here will be 16: the 15 surgical instruments classes plus the background.

B. Parsing

Information such as image filenames, image sizes, bounding box sizes and annotated classes will be collected and sorted through the parsing method.
path = Path('gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco')train_parser = parsers.COCOBBoxParser( annotations_filepath = path/'train/_annotations.coco.json', img_dir = path/'train')valid_parser = parsers.COCOBBoxParser( annotations_filepath = path/'valid/_annotations.coco.json', img_dir = path/'valid')
whole = SingleSplitSplitter()train_records, *_ = train_parser.parse(data_splitter = whole)valid_records, *_ = valid_parser.parse(data_splitter = whole)
The parser defaults to splitting a dataset to 80 train / 20 valid. If your dataset is already split into the different sets, assign a parser for each. Using the SingleSplitSplitter() will keep your sets intact.
show_records(train_records[0:3],ncols=3, font_size=30, label_color = '#ffff00')

C. Transforms

presize = 512 #image_size = 384 #train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=presize), tfms.A.Normalize()])valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(size=image_size), tfms.A.Normalize()])
train_ds = Dataset(train_records, train_tfms) valid_ds = Dataset(valid_records, valid_tfms)

III. Modelling

from icevision.models.checkpoint import *from fastai.callback.wandb import *from fastai.callback.tracker import SaveModelCallback

A. Metric

metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]

B. Model Types

1. Faster R-CNN

Faster R-CNN is a 2-step algorithm for object detection. It first takes into account the whole image then creates anchor (or proposal) boxes. Each of the proposed boxes have a probability of containing an object ('objectness') versus being a background. The model is then able to focus on regions in an image with a high objectness value.
After creating the Region Proposal Networks, it proceeds to the 2nd step which involves regression and classification. Regression evaluates the difference in the 4 coordinates of the proposed box from those of the ground truth box. The coordinates for the COCO format involves the x- and y- coordinates on the upper-left corner (as 0, 0), the box width and the box height. Classification identifies the class of the object enclosed within the bounding box.
model_type = models.torchvision.faster_rcnn model = model_type.model(num_classes=len(class_map))train_dl = model_type.train_dl(train_ds, batch_size=16, num_workers=4, shuffle=True)valid_dl = model_type.valid_dl(valid_ds, batch_size=16, num_workers=4, shuffle=False)
wandb.init(project = 'Surgical_instruments_models_', name = 'Faster_RCNN', reinit = True)
learn = model_type.fastai.learner(dls = [train_dl, valid_dl_frcnn], model = model, metrics = metrics, cbs = [WandbCallback(), SaveModelCallback()])
learn.fine_tune(100, 2e-04, freeze_epochs =1)
In a real-world situation, it is crucial for the Operating Room personnel to have exact identification and counts of the instruments. Therefore, a 'fair' detection such as that resulting from the Faster R-CNN model is not yet adequate. Various steps can be used to improve the metric and loss. We will experiment on varying the model used.As of now, we will focus on detection of instruments that are not densely packed. Counts will not be addressed in this tutorial.

2. RetinaNet

RetinaNet is a single-stage detector. It does not use the region proposal step to generate the 'objectness'. Instead, it uses the focal loss to address the imbalance between a foreground ('object') and the background. Backgrounds are considered as easy examples and are down-weighed. This enables the model to focus on harder ('object') examples.
The single-stage model consists of a backbone and two subnetworks for classification and regression. One of the backbones for RetinaNet is resnet50_fpn. Resnet comprises the network's hidden layers. Its intermittent assimilation of the identity function enables the construction of deeper networks and a higher capacity for learning.
The Feature Pyramid Network (fpn) comprises the neck of the network. It connects the different parts of the architecture. This enables the model to detect different scales of an object. Small objects are better featured in the earlier/ bottom feature maps, and the reverse for large objects.
model_type = models.mmdet.retinanetbackbone = model_type.backbones.resnet50_fpn_1x(pretrained=True) model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(class_map))
train_dl = model_type.train_dl(train_ds, batch_size=16, num_workers=4, shuffle=True)valid_dl = model_type.valid_dl(valid_ds, batch_size=16, num_workers=4, shuffle=False)wandb.init(project = 'Surgical_instruments_models_', name = 'RetinaNet', reinit = True)
Use the same commands for learner as previously discussed.
RetinaNet reached a best mAP of 64.7 after 100 epochs of training at lr 5 e-05. It had higher losses compared to the 2-stage detector, and in general had a similar pattern. The output for the validation set is fair for the RetinaNet_resnet50_fpn model.

3. YOLOv5

The You-Look-Only-Once (YOLO) family divides the image into grids. Each grid is responsible for detecting objects within its borders.
YOLOv5 combines a modified Cross Stage Partial Network (CSPNet) backbone and a Path Aggregation Network (PANet) neck which enables a more efficient computation. It includes an additional mosaic augmentation to stimulate more robust learning. The classification and localization of objects undergo regression processing and the output is coupled in a single head. These features enable ease-of-use and real-life mobile applicability.
model_type = models.ultralytics.yolov5 backbone = model_type.backbones.small # yolov5smodel = model_type.model(backbone = backbone(pretrained=True), num_classes=len(class_map), img_size = image_size) #
The YOLOv5s reached a best mAP of 67.5 after 100 epochs of training at lr 2e-03. This model with a small backbone almost approached the metric of a 2-stage detector, at a quarter of the processing time. The detection for the validation set is fair-good.


YOLOX bases an objectness on a single center-point of a box, instead of using anchor boxes. It is able to free up post-processing for non-maximal suppression that are needed in models that rely on anchors. It uses an additional mosaic and mix-up augmentation for robustness. And it has separated the localization and classification portions in the network's head.
model_type = models.mmdet.yolox backbone = model_type_yolox.backbones.yolox_tiny_8x8 model = model_type_yolox.model(backbone = backbone_yolox(pretrained=True), num_classes=len(class_map,))
YOLOX_tiny reached a best mAP of 66.2 after training for 100 epochs at lr 1e-03. It had a rapid increase in mAP during the first epochs, then started to approach the performance of YOLOv5s. It was a fraction slower than YOLOv5. Notably, the losses for the validation set are high. By visual checking, the quality of detection for the validation set is fair-good.

5. VFNet

The VarifocalNet attempts to address the discrepancy between localization and classification that might arise during the post-processing. This discrepancy is notable in the non-maximum suppression step where well-localized boxes may be eliminated because of a poor classification score.
The IoU-aware classification score uses the Varifocal Loss which down-weighs negative examples (background), and focuses on the positive, high quality examples.
It utilizes a fixed 9-point star-shape to determine the points' distance from the ground-truth, and adjusts the offsets accordingly.
It uses the resnet - fpn combination for its backbone, similar to RetinaNet.
model_type = models.mmdet.vfnet backbone = model_type.backbones.resnet50_fpn_1x(pretrained=True)model = model_type.model(backbone=backbone_vf_r50_1x(pretrained=True), num_classes=len(class_map))
VFNet_resnet50_fpn_1x reached a best maP of 81.9 after 100 epochs of training at lr 2e-03. It performed better than the 2-stage detector Faster R-CNN. Having a similar backbone as the particular RetinaNet that we used, the VFNet had better metrics. However, it took longer to train, and had a higher validation set loss. A quick visual check shows that despite the better metric, the prediction outcomes still need improvement.

6. Choosing a model

Each model has its particularities. Choosing a model should be guided by the project objectives. The surgical instruments project could accept a slightly longer training and inference time, if it will result in reliable detections. Among the modern models, VF net provided a good mAP and validation loss. We will choose VFNet and explore if another backbone can further improve the output.

C. Model Backbones

VFNet has various types of resnet_fpn backbones.

1. Resnet50_fpn_1x

This backbone was used in the example above. The pretrained model has undergone 1 cycle of training on the COCO dataset.

2. Resnet50_fpn_mstrain_2x

This pretrained subtype underwent 2 cycles of training, thus we expect more refined weights.
model_type = models.mmdet.vfnet backbone = model_type.backbones.resnet50_fpn_mstrain_2x(pretrained=True)model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(class_map))
The VFNet_resnet50_fpn_mstrain_2x variant reached a best mAP of 82.8 after 100 epochs of training at lr 5e-04. The mAP and loss metrics were slightly better and the training run was slightly faster than the 1x variant. Visual checking of predictions for the validation set showed good results.
For prototyping purposes, these would be considered as acceptable baseline results.

D. Saving the model

root_dir = Path('/content/gdrive/My Drive/')from icevision.models import *checkpoint_path = root_dir/'Surgical_instruments/Models/VFNet_nov13_mAP81.5.pth' #100 epochssave_icevision_checkpoint(model, model_name='mmdet.vfnet', backbone_name='resnet50_fpn_mstrain_2x', classes = train_parser.class_map.get_classes(), img_size=image_size, filename=str(checkpoint_path), meta={'icevision_version': '0.9.1'})

IV. Inference

A. Local run

We will load the test set, perform the preparatory steps including the limited transforms.
path = Path('gdrive/MyDrive/Surgical_instruments/Second_set.v1i.coco')test_parser = parsers.COCOBBoxParser( annotations_filepath = path/'test/_annotations.coco.json', img_dir = path/'test')whole = SingleSplitSplitter()test_records, *_ = test_parser.parse(data_splitter = whole)#show_records(test_records[0:3],ncols=3, font_size=30, label_color = '#ffff00')valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])infer_ds = Dataset(test_records, valid_tfms)
We will then get the predictions based on the model that we have chosen.
model_type = models.mmdet.vfnet backbone = model_type.backbones.resnet50_fpn_mstrain_2xinfer_dl = model_type.infer_dl(infer_ds, batch_size=4, shuffle=False)preds_saved_test = model_type.predict_from_dl(model, infer_dl, keep_images=True, detection_threshold = 0.5)
show_preds(preds=preds_saved_test[0:], font_size=25, label_color = '#3050ff')
The model performed very well on the held-out test set!

B. Gradio deployment

!echo "- Installing gradio"!pip install gradio -U -qimport icedataimport PIL, requestsimport torchfrom torchvision import transformsimport gradio as grdef show_preds_gradio(input_image, display_label, display_bbox, detection_threshold): if detection_threshold==0: detection_threshold=0.5 img = PIL.Image.fromarray(input_image, 'RGB') pred_dict = model_type.end2end_detect(img, valid_tfms, model, class_map=class_map, detection_threshold=detection_threshold, display_label=display_label, display_bbox=display_bbox, return_img=True, font_size=16, label_color="#FF59D6") return pred_dict['img']display_chkbox_label = gr.inputs.Checkbox(label="Label", default=True)display_chkbox_box = gr.inputs.Checkbox(label="Box", default=True)detection_threshold_slider = gr.inputs.Slider(minimum=0, maximum=1, step=0.1, default=0.5, label="Detection Threshold")outputs = gr.outputs.Image(type="pil")gr_interface = gr.Interface(fn=show_preds_gradio, inputs=["image", display_chkbox_label, display_chkbox_box, detection_threshold_slider], outputs=outputs, title='Surgical Instruments Detection and Identification Tool')gr_interface.launch(inline=False, share=True, debug=True)

V. Conclusion

Using a pre-annotated dataset containing ~2K images, we were able to test different models and backbones. A VFNet model with a ResNet50 backbone and an FPN neck was then fine_tuned, generating good detection of sparsely distributed surgical instruments with a mean Average Precision of 82.8.
I hope you had fun!
Github: The notebook for this tutorial will be in the folder Deep_Learning_tutorials :)