Skip to main content

Sentiment Analysis on Goodreads Reviews: Part 3

In this article, we look at a stratified group split of the Goodreads dataset and determine how successful our model is predicting review scores from text.
Created on March 11|Last edited on May 19
This project is a community submission from a practitioner who took our free MLOps course. It's a great preview of what you can expect to learn in the course and is the third and final edition in a three part series about this particular project. You'll find the other report below.
💡
This is the final report in a series of three inspired by the Weights & Biases course "Effective MLops: Model Development." In this article we'll revisit some of our earlier assumptions on the proper way to analyze the Kaggle Goodreads dataset.
Specifically, we will improve our previous data split by using a stratified group split of the data and retrain our HuggingFace transformer models on this new split dataset. We will then take the model which performs the best on the validation set and see how it performs on the test set. We will conclude this report by trying to understand where our model is failing or underperforming on the test data. If you've missed our previous selections, you can find those below.


Data Split and Metrics

As detailed in the previous two reports, we downsampled the original Kaggle Goodreads dataset (see here) such that all ratings appeared an equal number of times. The original dataset consists of 900,000 reviews written by 12,188 users while the downsampled dataset consists of 171,312 reviews written by 11,164 users. Since each rating appears an equal number of times in our downsampled dataset, we can use accuracy as a metric for our model's performance.
The choice of how to split the data between the train, validation, and test sets can be a difficult problem. Previously we split our data such that the train, validation, and test set each contained different books. Specifically, in the Goodreads dataset each review comes with an associated book_id and we used the GroupShuffleSplit function from scikit-learn so that each split contained unique values of book_id.
Using book_id is an imperfect choice because it is possible a single book is actually associated with multiple book_ids, e.g. if Goodreads assigns different editions of the same book different book_id's. However, since the Kaggle Goodreads dataset does not contain the actual title of the books, this is the best we can do.
One potential issue we missed in the earlier reports is that some users on Goodreads are more active than others. The most active reviewer in our (downsampled) dataset wrote a total of 499 reviews, while the least active reviewers wrote a single review. Below we include a table which lists the number of reviews written by a reviewer, from most active user to least active user:


A more detailed breakdown of the user statistics is shown below:
>>> user_review_counts.describe()

count 11164.000000
mean 15.345038
std 22.940507
min 1.000000
25% 3.000000
50% 8.000000
75% 19.000000
max 499.000000
Name: user_id, dtype: float64
We see that the average user wrote around 15 reviews and that 75% of reviewers wrote 19 reviews or less. However, there is a large standard deviation (approximately 22.94) due to the users who wrote over 100 reviews. Although most people wrote a fairly small number of reviews, the most active users can have an outsize effect on the dataset. In fact, the top 10% of the most active reviewers produced 43.22% of all reviews. This can potentially be an issue if we want our model to learn the general characteristics of positive and negative reviews and not the potentially unique characteristics of a particular class of users.
To take into account this imbalance among the users, we will split the data using the StratifiedGroupKFold function from sk-learn. According to the documentation, this function "attempts to create folds which preserve the percentage of samples for each class as much as possible given the constraint of non-overlapping groups between splits." Although this function is designed to be used for cross-validation, we can use it twice to produce an approximate 60-20-20 train/valid/test split where we use "book-id" to group the data and we use "user-id" to assign classes.
As before, this results in a train/valid/test split where each dataset contains distinct values of "book_id". The new feature is that now any reviewer who wrote 5 or more reviews has approximately 60% of their reviews assigned to the training set and 20% of their reviews assigned to both the valid and test sets. The 3802 users who wrote 4 reviews or less have their reviews split randomly between the datasets, subject to the requirement that the train/valid/test sets contain distinct values of "book_id"..
Overall, the StratifiedGroupKFold is able to perform this split, with the above constraints, to a high accuracy. The training set consists of 102,787 reviews, the validation set consists of 34,262 reviews and the test set consists of 34,263 reviews. Each dataset split also contains distinct values for "book_id". At this point, with our improved dataset split, we can again train and evaluate our model.

Training

Following our previous report, we will train a BERT-tiny model on our dataset. We use BERT-tiny because it is relatively cheap to train as training a single model takes approximately 10 minutes in Google Colab Pro. Below we present a parallel coordinates plot for how the hyperparameters effect the final accuracy on the validation set:



As a reminder: the number of gradient accumulation steps determines how many batches of data we should evaluate in a forward pass before performing a step of back-propagation. In addition, the learning rate is sampled using a log-uniform distribution and warmup steps is the number of steps the HuggingFace Trainer uses to increase the learning rate from 0 to our set value.
Overall the results are not too different from the previous report, except that here our worst performing model performs much worse (16.4% as opposed to 32% last week). The best performing models are comparable, here the highest accuracy is 52.97% while in the previous report it was 52.81%, which is not a significant difference.
Below we summarize the importance and correlation of each hyperparameter with the accuracy on the validation set. One difference in comparison to the previous report is that now the learning rate is anti-correlated with the accuracy on the validation set, while in the previous report we saw a positive correlation. I am not sure why there is a significant change in the correlation, but I suspect this discrepancy comes from not having performed enough training runs and not the difference in how the data was split.



Finally, below we plot the accuracy and loss of the models on the validation set as a function of steps taken during gradient descent. For the most part, it looks like the best performing models behave comparably and start to plateau around 1.5-2k time steps.



Evaluation and Analysis

Finally, we can take the model which performs best on the validation set and see how well it performs on the test set. For completeness, below we give the accuracy of this model on the train, validation, and test sets:


We see that the accuracy of our model on the test set is comparable to the accuracy on the validation set, the difference between the two results is approximately 0.19%. This result signifies that the validation set and the test set are (approximately) drawn from the same distribution.
However, it is slightly surprising that the accuracy on the test set is higher than the accuracy on the validation set. We would expect that the accuracy on the validation set to be higher because by performing a hyperparameter sweep we are effectively tuning the hyperparameters to maximize the accuracy on the validation set, while there is no analogous tuning for the test set. We believe that the accuracy on the validation and test sets are comparable because we performed a relatively small hyperparameter sweep, we only tuned 3 hyperparameters and looked at 20 runs, so we are likely not overfitting to the validation set.
Next, we can look at how our model performs in more detail by forming a confusion matrix. A confusion matrix is a table where the i-th row corresponds to the true label, the j-th column corresponds to the predicted label, and the number at index (i,j) is the number of times the model predicted rating j for a review with a true rating of i. Below we present three confusion matrices corresponding to the train, validation, and test sets:


Most of the results of these confusion matrices make intuitive sense: We see that the largest entries tend to either lie on the diagonal or are adjacent to the diagonal. The diagonal elements correspond to when the model makes the correct predictions, so we see that our model is performing well on all three datasets. The elements adjacent to the diagonal correspond to when the model confuses adjacent ratings, i.e. it predicts a rating of 2-stars when the true rating was 3-stars and vice-versa. This type of error is to be expected because different people will have different standards for how to rate a book and the distinction between when to assign a book a 2- or 3-star rating may be arbitrary and depend on the personal tastes of the reviewer. Given this ambiguity in how ratings are assigned, it is not surprising that matrix elements near (but not on) the diagonal are large.
We can also note that the model is better at predicting 1- and 5-star reviews in comparison to reviews assigned a rating between 2- and 4-stars (we will come back to 0-star reviews momentarily). To see this more clearly, we present the F1-score on the test set for each rating below:


We suspect the model has a higher F1-score for 1- and 5-star reviews because they are more unambiguous: a user who assigned a book 5-stars clearly enjoyed it while a user who rated a book 1-star did not like the book at all. By contrast, a user who assigned a book a rating between 2- and 4-stars has both positive and negative things to say about the book and the review is more ambiguous. The model will therefore need to learn how to properly weigh both the positive and negative statements in order to predict the rating. Although the attention mechanism helps transformer models learn to understand ambiguous texts, this is still a difficult classification problem.
Finally, and most interestingly, our model tends to make large errors for reviews whose true rating is 0. Although the F1-score for 0-star reviews is higher than the F1-score for 2,3, and 4-star reviews, our model often predicts that 0-star reviews should actually have a rating of 4- or 5-stars. In our confusion matrix for the test set, we see our model predicts that 734 reviews with a true rating of 0 should actually have a rating of 5. On the other hand, our model only predicts that 150 reviews with a true rating of 1 should be assigned a rating of 5. This discrepancy was also observed in the first report where a TextBlob sentiment classifier predicted that 0-star reviews tended to be more positive than 1-star reviews (see the final plot before the acknowledgements). To understand where these large errors are coming from, below we present a table of reviews where the true rating is 0 but the model predicts the rating should be 5:


Surprisingly, many of these reviews are actually very positive! For example the second reviewer talks about how they loved the book and all the details, but according to Goodreads they assigned the book a rating of 0. Clearly the issue is that different users assign a book 0-stars for different reasons, some users may assign a book 0-stars because they truly did not like the book while others may simply be leaving the rating blank. The problem is that the Goodreads dataset does not distinguish between these two cases and the "true" label given in the dataset does not always correspond to the actual sentiment of the review.
How should we handle this error in the data itself? One option is to try and clean our data to remove all 0-star reviews where the actual review is positive. To do this, we can flag any review where the true label is 0 but the predicted label from the neural net is 4 or 5. To avoid flagging too many reviews we can require that a review is flagged only if the model is sufficiently confident that the review should be assigned a rating of 4 or 5 (i.e. the probability is above a certain threshold).
If a review is flagged we can either have a human decide to remove the review or have the review removed automatically by our model. The benefit of removing a review automatically is it saves time, but the downside is we may accidentally remove reviews that should be kept in the dataset. In addition, if we use a neural net to determine when to keep or remove a review in our dataset we may accidentally overfit to our test set.
Another, potentially simpler option, is to just remove all reviews where the assigned rating is '0' and instead study a 5-fold classification problem. By removing reviews where the "true" label is not reflective of the actual text we will likely attain better accuracies for the remaining reviews.
In addition, we do not lose a lot of information by going from a 6-fold to a 5-fold classification problem. In both cases our model is learning to make a fine-grained distinctions between different reviews based solely on the text of the review. Of course, the downside to this approach is we are potentially throwing away useful data which could be used to train our model.

Summary

To wrap up, in this report we have re-done our train/valid/test split of the (downsampled) Kaggle Goodreads dataset to take into account how active different users are on Goodreads.com. We then performed a hyperparameter sweep for the BERT-tiny model and selected the model which performed the best on the validation set. Next, we used this model on the test set and observed that our model achieved a final accuracy of around 53%.
Using this trained model, we then studied in more detail how well it performed on the train, valid, and test sets. From the confusion matrices, we found that the model was the most accurate at predicting 1- and 5-star reviews, while it had more difficulty with reviews assigned a rating between 2- and 4-stars.
Using the confusion matrix, we also noticed that our model was often predicting that reviews assigned a rating of 0-stars by the user should actually have a rating of 4- or 5-stars. By inspecting our dataset we found the cause of this effect: many users wrote very positive reviews for a book but then simply did not assign a rating to the book. Therefore, our model had trouble making predictions for these reviews because the vast majority of users who gave a rating of 0-stars wrote negative reviews.
To wrap up, there is clearly a lot of analysis that remains to be done. The simplest extension of our analysis would be to train a larger model on the full Goodreads dataset. By increasing the amount of data, the size of the model, and the amount of compute, we can expect to achieve a higher accuracy for our 6-fold classification problem. In addition, more work needs to be done on cleaning the dataset.
If we want to build a model that accurately predicts the sentiment of a review solely from the text, we need to ensure that that the ratings in our training data actually reflect the sentiment of the review. We mentioned two ways to clean our dataset to remove these mislabelled reviews and it would be interesting to study how cleaning the data effects the performance of the transformer models studied in this report.

Acknowledgements

We would like to thank Prajjwal Bhargava for making his implementation of BERT-tiny available on HuggingFace, see here, and Kayvane Shakerifar for making public his nicely written code on combining HuggingFace models and WandB, see here.
Iterate on AI agents and models faster. Try Weights & Biases today.