Model Evaluation and Error Analysis

A final evaluation and analysis of our text classifier of user complaints using a CNN 1D.
Created on September 25|Last edited on October 8
Comment
﻿
Validation of the registered model after tuningAnalyze some training considerationsBalanced data partitioningSelect the evaluation metricError analysis on the validation datasetEvaluate the model on the test dataset
﻿
Validation of the registered model after tuningIt is a good practice to check that the model uploaded and registered in our model repository, in this case, W&B, is the right one and is working as expected. First, we download the model registered, evaluate it on the validation dataset, and then check that the performance metrics are the same as the ones we obtained in the fine-tuning stage.
Using wandb API this process is very simple, we just need to open our best run in the Sweep job and take the model with the val_accuracy that we register as the best one. In the notebook we download the artifact and repeat the evaluation process we did previously.
Now we can compare both evaluations, in the next figure we show the Classification report for the registered model and we can confirm that is the same as the one we uploaded in the Sweep job. Our registered model is right and working fine!.
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_evaluation").membershipForAlias("v0").artifactVersion.file("classification_report.table.json")
 - 7 of 8
category
precision
recall
f1-score
support
1
2
3
4
5
6
7
﻿
Analyze some training considerations﻿
Balanced data partitioningIn our first report, we performed an EDA, an analysis of our dataset, and then we observed that the dataset was unbalanced. To avoid future problems and the negative impact of that feature, we balanced the dataset and reduced the count of examples in three categories. In the end, every category contains about 10,000-15,000 samples and this dataset was split into the training, validation, and test datasets. All datasets are equally distributed, and the count of samples of every category is very similar.
﻿
Select the evaluation metricThis step is crucial in a classification problem to get a model that you can rely on. Once the dataset is balanced, we selected accuracy as our metric, it is very simple to obtain and it can easily be understood by the stakeholders. You do not need to explain the maths behind it, non-technicians know perfectly what it means.
Error analysis on the validation datasetWhen you want to understand and deepen how your model performs, it is recommended you spend some time reviewing your prediction failures in order to identify weaknesses.
In our scenario, we are interested in the distribution of the failures:
﻿
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_evaluation").membershipForAlias("v0").artifactVersion.file("error_product_classes.png")
The first bar plot shows the count of errors by real product category. Error predictions for category "2" (about 450), are higher than the others but categories "0" and "1" are also hard to predict for our model. The second bar plot, count of error by predicted category confirms that the first three classes contain most of the errors.
Let's check the errors for category "2". We can compare the predicted probabilities for every class for these errors, maybe we can identify any relevant behavior.
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_evaluation").membershipForAlias("v0").artifactVersion.file("error_product_class_2_probs.png")
It seems that many products of category "2" are predicted as being of category "1", almost half of them. Well, we can try to find a reason for this confusion. A simple explanation might be that both categories of complaints include similar words.
In the next figures, we count the most common words in both categories:
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_evaluation").membershipForAlias("v2").artifactVersion.file("most_common_words_error_2.png")
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_evaluation").membershipForAlias("v2").artifactVersion.file("most_common_words_error_2_prob1.png")
As we suspected, the most common words are shared in both categories (4 of 5). This scenario can confuse the predictor, we should try to get more samples from these categories but containing different words to fine-tune our model.
﻿
Evaluate the model on the test datasetFinally, once we have a winner model, we should calculate its performance on an unseen dataset that gives us a better intuition of the expected results on production.
We execute a new run marked with the type "test_evaluation" that will include the figures, metrics, and evaluation table for the test dataset.
﻿
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_test_evaluation").membershipForAlias("v0").artifactVersion.file("classification_report.table.json")
 - 8 of 8
category
precision
recall
f1-score
support
1
2
3
4
5
6
7
8
The accuracy for the test dataset is 0,8418, it is slightly lower than the validation accuracy (0,853). 
Let's print the confusion matrix for the test data:
﻿
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_test_evaluation").membershipForAlias("v0").artifactVersion.file("confussion_matrix.png")
We can confirm that the model is performing as desired based on the performance of the evaluation stage. 
﻿
Add a comment