NLU training report
Created on May 25|Last edited on June 1
Comment
Introduction
In this report, we will evaluate the performance of an NLU system trained to recognise messages from consumers belonging to general intents like FindRestaurants, GetWeather, CheckBalance, LookupMusic, etc. It has been extracted from schema_guided_dstc8 dataset, publicly available at HuggingFace Datasets.
Model training performance
Deep Learning models make use of loads of data and, even if it goes through a preprocessing pipeline, we are not certain about data quality and distribution. Because of that, a common strategy to follow to check models consistency is Stratified K-Fold. In it, data is partitioned in K equally-sized portions, reserving in each iteration one for validation and the rest for training. In our example, we have performed a Stratified 5-Fold strategy (80% train-20% validation), maintaining the same proportion of phrases per intent in each bucket.
Run set
6
Out-Of-Folder analysis
The main advantage of following this approach is that we can infer how our model would perform in all our dataset, as at some point any phrase has been part of the validation set in any of the stages (folds). By considering all of those predictions, we can build the so called Out-Of-Folder metrics (OOF), which entangles the behaviour of our model in all possible scenarios within our data.
For fine-grained vision of model errors though OOF, a confusion matrix is provided as well:
Run set
6
Hands-on data
However, this is "just" a numerical outcome of model performance. Sometimes it's convenient to take a look at data to see if there's any reason for those errors beyond model inaccuracy. For that reason, we provide an example of each error type contained in the previous confusion matrix; i.e., there will be an element in the table per non-negative value out of the diagonal of the matrix.
Run set
6
Add a comment