Skip to main content

BDD 1K Semantic Segmentation Final Report

In this report, we discuss the data split method chosen and the metric used for evaluation. Then, we evaluate the best model on the test set and link it to the model registry with a staging tag.
Created on September 30|Last edited on October 6

Data Split Validation

During the data split of data that contains temporal dependencies, we need to make sure that all images are from the same run/instance in order to avoid data leakage. In our case, this means that all images from the same drive should be in the same split. We do this by separating the file names of the images into two parts using '-'. The first part of the split is a unique identifier that identifies the drive. We then use this information while splitting the dataset into train, validation, and test.
In order to do the split, we use StratifiedGroupFold with 10 folds. We stratify the splits using the bicycle column since it has the least instances. StratifiedGroupFold ensures that the percentage of bicycle-positive instances is the same across folds. This counters the class imbalance problem.

0.bicycle.sum / 0.bicycle.count
0.bicycle.count
0.File_Name
0.Images
0.Split
0.background
0.road
0.traffic light
0.traffic sign
0.person
0.vehicle
0.bicycle
1.File_Name
1
train
2
valid
3
test
1.Stage
From the table above, we can see that the data is split into train, validation, test at 80%/10%/10%. We can also see that the ratio of the bicycle-positive instances in each split is approximately 0.06%.

Metric Selection

The metric that we select for evaluation is miou. Since we have trained the models for only 10 epochs (due to limited computational resources), we do not set any minimum class-wise threshold. we only consider the overall model performance through the miou metric.

Evaluation on Hold-Out Set

When the optimized model, 'jumping-pyramid-129' is evaluated on the test set, we get the following results. This run is linked to the model registry and tagged for staging.

Run set 2
1

From the confusion matrices above, we see a similar trend between the validation set and the test set. This means that th/e two splits are similar in nature, which we aimed to do through the StratifiedGroupFold function.
We can further see that the classes with more instances such as background and road are performing very well with accuracy of 0.88 and 0.73, respectively. The classes with few instances are not performing well which indicates the need for longer training.
File<(joined table)>