Comparison of different metrics

The report is composed to compare the performance of the initial pipeline and new pipeline for the image detection approach. The data is from the 5K-Compliance problem, Zalo AI Challenge 2021, where we need to predict whether people are wearing mask and are standing far enough from each other.
Tiến Đoàn Anh
Created on February 26|Last edited on February 26
Comment
﻿
AccuracyLossF1 Score5K ModelTest (self-made)
﻿
Parts of the history records are retrieved in different dates, not in a single day. Thus it is not so reliable, but can provide a perspective of how these approaches differ from each other.
💡
Accuracy﻿
All runs
﻿
For the first approach, at first I intended to use the K-Fold validations to achieve a better experiement results. Why bother use it?
Model will improved its performance based on the loss generated by data. In many scenarios, there will be many sample distribution, which mean there will be always a difference in the train data and reality data. For the problem of image detection, or even in general purposes, if we train the model long enough, it will adapt itself to the dataset. The longer the run, the impact of difference comparing to the real-word patterns will get larger.
This may lead to the overfitting problem.
﻿
Therefore, if we use splits between training data and testing data, it may ensure that after we used the train data, there will be always a test data that is a presentative of real-word data and has not be seen yet, to evaluate the model's performance. If the model is poorly working, we may notice that during the evaluation step. As said, I decided to use Stratified Cross Validation, an extension of KFold Validation, which ensure that there will be an equal distribution of the target classes (binary in this case) over the splits.
﻿
HOWEVER, training a large dataset like in the 5K Compliance require too much GPU resources. I also managed to switch to TPU but it does not support some parts of my implementation. The GPUs provided by Google Colab or even Kaggle (6GB more ram) did not handle the computational process enough and the run always run out of memory (even 2 folds did not make it). As a result, I decide to not use the K-Fold method, but it is actually very efficient and work well since the accuracy of the first fold that I did luckily captured has a better overall accuracy and loss. So, if anyone intend to work on the image detection problems, I would highly recommend them to use the mentioned method to achieve better performance and validation.
💡
﻿
﻿
Back to the results, when you zoom the graph in the few last epoches, it seems like that in overall the first approach (1stA) have a better mask task with the accuracy of 0.9, following by the values of first approach (1stA-2F) with 2-folds, and lastly by the values of second approaches (2ndA). Nevertheless, as I have mentioned about the ability of K-Fold validation, the validation accuracy of 1stA-2F is the highest, which is 0.842, followed by 0.825 of 1stA, and 0.81 of 2ndA, respectively. While 1stA has the highest accuracy, but it witnessed the lowest val_accuracy compared to the other two runs, thus a proof of overfitting (from my perspective).
﻿
What about the distancing task?
﻿
﻿
Run set17
﻿
Similarly to the mask task, 1stA always has the highest accuracy while training yet lowest validation accuracy. The 2ndA has its accuracy being the smallest compared to other runs, however, the gap between the training data and validation data is extremely close, which somehow indicate that our model was performing very well and is capable of avoid overfitting.
The 1st-2F did not appear in the distance detection task. As I had mentioned, its unfortunate that the K-Fold validation does not fit with the constraint of the IDE that I used, therefore I had stopped using it after train the mask task model (using first approach).
Conclusion: The accuracy of 1stA is very high, but it face the problem of overfitting which can be seen in the dramatically lower accuracy in validating/testing step. 2ndA has the accuracy slight lower than 1stA, but it has consisten performance on the validation/test data and can be improved further if we put some adjustment (epochs, batch_size, optimizer, regularization) on it.
Loss﻿
﻿
In general, the losses of training process of the mask and distancing tasks are in the decrease trend, except for the 1stA of mask task where it witnessed a slight increase of the loss value (probably increases more if we set more epoch). 
When we look at the 1stA (blue color), the mask task has less loss compared to distance task and it is true for the hypothesis from my notebook (indicating that distance labels has nearly double the time the number of missing values than mask labels, thus distance task may perform worse than label task).
For the 2ndA (red color), we saw the same fashion just like 1stA where the mask task has less loss than the distance task. The 1stA-2F has the lost values approximately equal to those of 2ndA.
﻿
Run set17
﻿
Now, let's compare the loss and validation loss from the complex graph above. In this case, we will skip the 1stA-2F since we did not use it anymore when changing to distance task.
Mask task: When look at the dot-dashed line and the dotted line (represent loss and val_loss values for mask task), we can see that the 1stA has its validation loss higher than the train loss. Meanwhile, the 2ndA has its validation loss values significantly lower than the train loss, and in the last epoch the train loss witnessed a sudden increase and is slightly higher than the validation loss.
Distance task: When we look at the straight line and the dashed line (represent loss and val_loss values for distance task), we can see that the 1stA (blue color) also has its validation loss higher than the train loss. Meanwhile, the 2ndA has its validation loss values lower than train loss in almost all epoches, except for the last epoches where these two values equal (intersect).
Conclusion: For the loss metric, the 1stA dominate the later approach in the training process, however, it face the overfitting problem that can be seen in the validation/testing step. The 2ndA has its loss slightly higher than the former approach yet still perform very well on the unseen data, avoid overfitting and prove its effectiveness in spite of the complex pipeline.
F1 Score﻿
Besides of the common metrics, I also used the F1 score to evaluate my model since it involve the binary classification.
F1 Score: A measurement which is a weighted average of the precision and recall values
Precision:  The share of the predicted positive cases which are correct
Recall: The share of the actual positive cases which we predict correctly﻿
Run set17
﻿
The closer to 1 the F1 Score, the better the model in term of predicting the right label (binary classification). Here we see the F1 score of the mask task worked better than the distance task. And comparing the two approaches, the first approach has higher F1 scores in both case of tasks, which also reinforced by the accuracy and loss metrics that we have examined above.
﻿
Run set17
﻿
Mask task: When look at the dot-dashed line and the dotted line (represent f1 and val_f1 values for mask task), we can see that the 1stA has its validation F1 score lower than the train F1 score. Meanwhile, the 2ndA has its validation F1 score values significantly higher than the train values, and in the last epoch the train score witnessed a sudden decrease yet it is still higher than the validation figure.
Distance task: When we look at the straight line and the dashed line (represent f1 and val_f1 values for distance task), we can see that the 1stA (blue color) also has its validation F1 scores higher than the train ones. Meanwhile, the 2ndA has its validation F1 score higher than train F1 scores in almost all epoches, and sometime it fluacuted to the point being lower than the train values.
Conclusion: For the F1 score metric, the 1stA dominate the later approach, however similarly to the previous metric, it may face the overfitting problem that can be seen in the validation/testing step. The 2ndA has its F1 scores slightly lower than the former approach yet still perform very well on the unseen data, too.
5K ModelIn this section, since only second approach has the third model in its pipeline, so I will draw only one graph to demonstrate the metrics after finalize and running the last part of the 2ndA pipeline.
﻿
Run set17
﻿
In the end, the 5k model has a decent accuracy to demonstrate the effectiveness of the 2ndA pipeline. We can see that the val_accuracy of the 5k task is slightly higher than the train accuracy, but the gap between them is small and not too far. In the original report of my work, Section Modeling, I had described the data augmentation which is a part of the model structure that will form new and different examples of the train dataset (random flip, random zoom, random width, ...). In fact, if we use data augmentation to "noisify" the training data, then it can make sense that we are getting better accuracy on the validation set, because it will be an easier dataset. Linearly, the validation loss also lower than the train loss in the 5k model.
﻿
Test (self-made)If you read my notebooks, you will see that in the initial approach I did make some testing step for the mask and distance task. And for the second approach I also make it for the all three models. Unfortunately, I cannot use the provided test set from the organizer since the challenge has closed it portal and I cannot try and submit my work.
For the first approach, I did a test by myself not from the keras built-in model.evaluate method, so some metrics does not included here. But as I has said, the results should be seen as references only.
﻿
Figure 1. The metrics for testing data
Ignoring the F1 Score and Loss, it can bee seen that the accuracy for the mask task of both approaches are approximately equal, but for the distance task, the 2ndA perform better since the missing values are handled in the pipeline. The accuracy of 5k model is lower than the mask task and higher than the distance task, so we can assume it is an average of the accuracy of previous models.
﻿
And this the end of the results section and if you navigate to here from the slide, please make sure to have a read on my report since it provides more information in details and I believe it would help the reader to understand more about the model that I used in this report. Thank you for reading!﻿
﻿
Add a comment