How Nanit Improves and Develops Models

The following case study suggests a workflow for the development/improvement of models from the Proof of Concept phase until the deployment phase.
Krisha Mehta

Introduction

At Nanit, we develop a smart baby monitoring system. The system consists of a camera that helps track and guide the baby's sleep cycle, and breathing motion monitoring (BMM) that monitors the baby's breathing motion.

Overview

These are the steps we used to improve our existing object detector.
1. Evaluated the current baseline:
We started by challenging the existing evaluation metric and performing an up-to-date evaluation of the baseline model.
We previously used an open-source version of Mask R-CNN but have not switched to using the TensorFlow Object Detection API that is more robust and less prone to implementation bugs.
2. Created a PoC (Proof of Concept):
We built a PoC model to assess the potential for improvement between the baseline model and the PoC model, and we started with a small amount of data in order to quickly get an initial model to see the potential for improvement.
The PoC consists of a spatial post-processing phase applied to the detection objects of an object detection model.
3. Trained on the full dataset:
Since the PoC showed that there is potential for improvement, we trained on all of the data (including hyper-parameter tuning).
Overall we trained two models:
4. Compared models:
We compared between the lightweight and heavyweight models in term of run-time and the expected performance using Evaluation Metric #2, an internally decided metric.

Dataset

For the initial PoC, we used a small dataset.
Later, we used a dataset one order of magnitude larger than the PoC dataset with a data distribution identical to the production environment.

Evaluate the Baseline

We started by evaluating our baseline with the existing evaluation metric and continued by analyzing several evaluation metrics and choosing the best model selection metric for us.
Choosing the best metric can be influenced by the impact of each mistake. For example, if a "False Positive" is more crucial than "False Negative" we would like to strengthen the "Precision" over the "Recall" metric.

Model Selection Metric

While evaluating the baseline model, we compared the performance with two evaluation metrics, denoted as Evaluation Metric #1 and Evaluation Metric #2.
The chosen metric for evaluating performance throughout the project (from the current baseline until the chosen model) was Evaluation Metric #2.
For each comparison, we built a Precision-Recall curve based on our chosen evaluation metric and the detection threshold of the object detector.

Create a Proof of Concept (PoC)

Our objective is a little different than the COCO evaluation metrics calculated in TensorFlow's Object Detection API. Despite the differences though, there is a correlation between the COCO metric mAP@0.5 and our metrics.

Results Analysis

Post-Processing Phase
We applied a post-processing phase for the output detection of the detector. In the graph below we can see the calculated Precision-Recall curves with respect to the tolerance parameter of the Post-Processing phase, and see the effect of the tolerance parameter at the post-processing phase.
From the graph, we can see that Tolerance #1 is better at all working points of the Precision-Recall curve.
PoC vs. Baseline
The final step was to compare the performance (Precision-Recall curves) between the baseline model and the PoC model.
From the graph, we can see that the PoC outperformed the current baseline at all working points, meaning that for every chosen work point in the baseline, we can find a better work point in the PoC.

Train on the full dataset

In the graphs below we display all of the training experiments we've done for the full dataset and the PoC.
Different experiments under the same model have different hyper-parameter configurations.
Training conclusions
We can see from the graphs that the heavyweight experiments showed the best performance in terms of training losses and in terms of the COCO mAP@IOU=0.5 evaluation metrics.
The lightweight experiments showed slightly less good performance compared to the heavyweight experiments (92.8% vs 94.7%).

Compare Models

Performance Evaluation
From the graphs below we can see that over the validation set, the heavyweight model outperformed the lightweight model at all working points.
Run-time Performance
Both models (heavyweight and lightweight) were exported and measured in a "production-like" environment with the following configurations:
Conclusions:

Project Summary

The project's purpose is to improve the current baseline in terms of performance. We started with an existing baseline model and existing performance. Afterward, we modified the evaluation metric in order to reflect what we want to measure in a better way.
We proposed a PoC that has proven to have great improvement potential. Once the PoC showed a great chance of improvement, we did a full training process with several models and selected the model which suited us best.