Augmentations During Inference Experiments
Results
The inference augmentation parameters for each experiment can be found here.
The experiment results can be found in the fourth page of this spreadsheet.
1. Introduction
Based on the research paper, CNN Training with Twenty Samples for Crack Detection via Data Augmentation, an image can be augmented before inference to improve recall. These experiments aim to determine which augmentations results in the best model performance. The augmentations that will be tested are different image scales, flips (horizontal and vertical flips), and colour jitters. Each of the parameters will be altered individually to isolate their effects on model performance.
A total of 22 experiments are conducted.
2. Inference Settings
The model used to run inference on is the best performing model from the previous Fod Augmentation Experiments. More specifically, the model from Experiment 38 is used.
In every experiment, the model makes inference on the Toyota-Twill test set.
The test images are resized to 480x480 with a batch size of 16.
The confidence and IoU thresholds used are:
- conf_thresh = 0.3974
- iou_thresh = 0.5393
3. Baseline Performance
The model was first tested on the Toyota-Twill test set, without any inference augmentations, to determine the baseline performance.
The baseline metrics are:
- Precision = 0.961
- Recall = 0.862
- mAP_0.5 = 0.935
- mAP_0.5:0.95 = 0.627
- F1-score = 0.909
4. Experiments
A total of 22 experiments were run to determine the effects of augmentations during inference. The augmentations that will be tested are different image scales, flips (horizontal and vertical flips), and colour jitters.
The specific inference augmentation settings for each experiment can be found here.
The experiment results are summarized in page 4 of this spreadsheet.
4.1 Changing Individual Augment Parameters
Experiments which improved performance
-
Experiment 2 (decreasing image scale to 75%) improved F1-score by 1.69%.
- Improved precision by 0.62%
- Improved recall by 2.67%
- Improved mAP_0.5 by 1.18%
- Improved mAP_0.5:0.95 by 0.48%
-
Experiment 18 (decreasing image scale to 87.5%) improved F1-score by 0.87%.
- Decreased precision by 0.21%
- Improved recall by 1.86%
- Improved mAP_0.5 by 0.32%
- Improved mAP_0.5:0.95 by 0.48%
-
Experiment 3 (up/down image flip) improved F1-score by 1.40%.
- Improved precision by 1.15%
- Improved recall by 1.62%
- Improved mAP_0.5 by 1.18%
- Improved mAP_0.5:0.95 by 0.16%
-
Experiment 4 (left/right image flip) improved F1-score by 0.83%.
- Decreased precision by 3.30%
- Improved recall by 4.87%
- Improved mAP_0.5 by 1.28%
- Decreased mAP_0.5:0.95 by 3.19%
Experiments which decreased performance
- Increasing the image scale greater than 100% decreased performance (Experiment 1).
- Decreasing the image scale less than 75% of original image decreased performance (Experiments 16 and 17).
- Any colour jitter applied to the image during inference decreased performance.
4.2 Combining the Best Augments
Experiments that improved F1-score were chosen to test their additive effects.
The chosen experiments for further testing are:
- Experiment 2 (decreasing image scale to 75%)
- Experiment 3 (up/down image flip)
- Experiment 4 (left/right image flip)
From these experiments, a total of 4 more experiments were conducted:
-
Experiment 19 (Experiments 2 + 3 + 4) decreased F1-score by 0.46%.
- Decreased precision by 3.85%
- Improved recall by 2.78%
- Improved mAP_0.5 by 1.28%
- Decreased mAP_0.5:0.95 by 1.44%
-
Experiment 20 (Experiments 2 + 3) improved F1-score by 0.45%.
- Decreased precision by 0.83%
- Improved recall by 1.62%
- Improved mAP_0.5 by 1.50%
- Decreased mAP_0.5:0.95 by 0.16%
-
Experiment 21 (Experiments 2 + 4) improved F1-score by 0.16%.
- Decreased precision by 3.75%
- Improved recall by 3.94%
- Improved mAP_0.5 by 0.96%
- Decreased mAP_0.5:0.95 by 1.91%
-
Experiment 22 (Experiments 3 + 4) improved F1-score by 0.56%.
- Decreased precision by 3.85%
- Improved recall by 4.87%
- Improved mAP_0.5 by 1.82%
- Decreased mAP_0.5:0.95 by 1.60%
Although 3/4 experiments improved F1-score, the additive effects of the augmentations resulted in worse performance than the individual augments.
The additive augments all reduced precision.
However, similarly with the research paper, the augments increased the baseline recall performance.