Exploring Bounding Boxes for Object Detection With Weights & Biases

In this article, we take a look at how to log and explore bounding boxes with Weights & Biases
Created on April 23|Last edited on October 11
Comment
﻿
If you're training models for object detection, you can interactively visualize bounding boxes in Weights & Biases. This short demo focuses on driving scenes, testing a YoloV3 net pretrained on MSCOCO on images from the Berkeley Deep Drive 100K dataset. The API for logging bounding boxes is flexible and intuitive. Below, I explain the interaction controls for this tool and a few ways you might use it to analyze your models. 
This approach can help in object detection on many other kinds of images, from microscope slides to x-rays to satellite and beyond. You can read more about understanding driving scenes in this report and more about the Lyft's self-driving car dataset in this report. ﻿
High-Level View: Many Examples On Validation Data﻿
BD100K Validation Images 30-555
﻿
﻿
Logging a few validation images per run gives you an overall sense of the model's performance. This model does pretty well, especially considering the dataset transfer from MSCOCO to Berkeley Deep Drive without finetuning. 
You can use the controls to turn each class label visualization on or off and focus on the most relevant predictions. One issue with the current model is that it often labels larger cars (like vans or SUVs) as both "truck" and "car"—this might be easier to notice if you toggle the red "truck" and blue "car" labels on and off. 
Another pattern is that in some images, detection boxes are systematically lower than the ground truth. If you toggle the blue "car" label, you may notice that the bounding box sometimes leaves out the top parts of the car. This may be an issue with tuning anchor boxes on MSCOCO versus BDD. Reducing the stride between anchor boxes, increasing their total number, or learning a set specific to the BDD data and aspect ratio may help.
Zooming In: Different Classes in a Specific Model﻿
﻿
﻿
BD100K Validation Images 30-555
﻿
Setting up a full-screen media panel lets you see the details of individual photos (scroll down in the panel for more). Now it's easier to read the numerical scores and confirm that visually smaller/more distant objects tend to have lower confidence scores and notice what kinds of mistakes the model tends to make, like missing the vans in the first example. 
On the bright side, when focusing on the third photo, we can see that the overlapping truck/car detection might not be a concern: the truck confidence score is only 48.96, while for the car (which we can read if we turn off the red "truck" label) it's 96.4. Also, in the very last image, we can find a bizarre false positive "train" on the left-hand side.
ControlsIf you click on the Settings icon in the top left corner of a media panel, you will see this pop-up for interacting with the images:
 
﻿
﻿
metric filtering: for each box, you can log numerical metrics, such as overall confidence score, IOU, accuracy, or some other indicator of box quality. For each metric, you can specify a viewable range as a filter on the boxes: for example, only show boxes with "score" > 90 to see the high-confidence labels, or only show boxes with "score" < 50 to see if some false negatives are actually low confidence positives. This can be paired with class/label selection to see how your detection quality compares across different classes.
class/label selection: if you click on a class label like "car", you can toggle it on and off, independently across the layer types. This helps you focus your analysis on different classes, and the on/off flip is easier to notice visually than if the annotations were permanently saved on top of the image.
layer type selection: toggle the eye icon to the left of the layer name to turn the layer on or off. If you log the "ground truth" bounding boxes alongside your model's predictions, this will let you compare selectively across the two.
class search : to see a class label that doesn't fit in the menu, you can type it in the search bar to the right of the layer name. For these models, the full label set is: 'car', 'truck', 'person', 'traffic light', 'stop sign', 'bus', 'bicycle', 'motorbike', 'parking meter', 'bench', 'fire hydrant', 'aeroplane' (yes the fancy spelling), 'boat', and 'train'. These are some hand-picked labels from MSCOCO that I thought would be most likely to show up in street scenes of driving (if any planes do show up, they are most likely too small).
CodeYou can find the full API documentation here. It enables flexible logging in many configurations:
coordinate domain: "domain" : pixel  if your box coordinates and dimensions are integers corresponding to the pixels values themselves versus image fraction domain (default) if your box coordinates and dimensions are floats that represent distance as fractions of the image, normalized to 1.0
box representation: each box can be expressed as the four corner values (minX,  minY, maxX, maxY) or as the center coordinate plus width and height of the box  (middle, width, height)
Here is the logging code I use in this report, given three lists returned by the pretrained YoloV3 model, the filename of the input validation image, and its width & height. My pretrained YoloV3 code returns three lists:
v_boxes: predicted bounding boxes for a given input image
v_labels: the corresponding predicted labels as strings
v_scores: the confidence score for each box (which I could choose to threshold in this code)
# this is the order in which my classes will be displayed
display_ids = {"car" : 0, "truck" : 1, "person" : 2, "traffic light" : 3, "stop sign" : 4,
               "bus" : 5, "bicycle": 6, "motorbike" : 7, "parking meter" : 8, "bench": 9,
               "fire hydrant" : 10, "aeroplane" : 11, "boat" : 12, "train": 13}
# this is a revese map of the integer class id to the string class label
class_id_to_label = { int(v) : k for k, v in display_ids.items()}
﻿
def bounding_boxes(filename, v_boxes, v_labels, v_scores, log_width, log_height):
    # load raw input photo
    raw_image = load_img(filename, target_size=(log_height, log_width))
    all_boxes = []
    # plot each bounding box for this image
    for b_i, box in enumerate(v_boxes):
        # get coordinates and labels
        box_data = {"position" : {
          "minX" : box.xmin,
          "maxX" : box.xmax,
          "minY" : box.ymin,
          "maxY" : box.ymax},
          "class_id" : display_ids[v_labels[b_i]],
          # optionally caption each box with its class and score
          "box_caption" : "%s (%.3f)" % (v_labels[b_i], v_scores[b_i]),
          "domain" : "pixel",
          "scores" : { "score" : v_scores[b_i] }}
        all_boxes.append(box_data)
﻿
    # log to wandb: raw image, predictions, and dictionary of class labels for each class id
    box_image = wandb.Image(raw_image, boxes = {"predictions": {"box_data": all_boxes, "class_labels" : class_id_to_label}})
    return box_image
﻿
﻿