Sentiment Analysis on Goodreads Reviews: Part 1

Analyzing and prepping a Goodreads dataset for modeling
Created on February 3|Last edited on April 18
Comment
﻿
This project is a community submission from a practitioner who took our free MLOps course. It's a good preview of the things you'll learn there and is the first edition in a three part series about this particular project. 
💡
IntroductionIn this report we'll study EDA and baseline models for a dataset of Goodreads reviews.  
Specifically, we'll study the Kaggle Goodreads dataset, which can be found here﻿﻿. As someone who has used this website to organize what books I wanted to read next and find recommendations for new books, I was curious to see how we could use NLP to understand the Goodreads community. As a website dedicated solely to books, it is likely that reviews on Goodreads.com are qualitatively different in comparison to other websites like Amazon. In this report we will just be looking at how to perform sentiment analysis for Goodreads reviews, but I think it is an interesting question to understand how reviews differ across different websites and communities.
The main goal of this report is to see how Huggingface transformer models can be used in combination with Weights & Biases to perform sentiment analysis. Specifically, we will try to predict the rating of the book (which can range from 0 to 5) by using solely the text of the review. Transformer models are a versatile class of neural networks whose development has lead to remarkable progress in the fields of computer vision and natural language processing. The Huggingface transformer library is a beautiful and simple library which allows users to use state of the art transformer models and upload their own fine-tuned models. We will show that, with minimal data cleaning, we can get up to 60% accuracy on this 6-fold multi-classification problem using a relatively small transformer model.
Table of ContentsIntroductionTable of ContentsData AnalysisCleaning and Splitting the DataBaseline ModelsDistilbertBERT-tinyExtra EDAAcknowledgements
﻿
Data Analysis
Cleaning and Splitting the DataWe will start by discussing some general aspects of the Kaggle Goodreads dataset. First note that this dataset is very large. It consists of 900,000 distinct reviews written for 25,474 different books by 12,188 different users. The user ratings are also not balanced and are highly skewed towards higher values:  
﻿
﻿
﻿
There are a few options to handle this class imbalance. We can consider using metrics designed for unbalanced datasets, e.g. the F1 score, we can upsample the more infrequent classes, or we can downsample the more frequent classes. Given that the smallest class, which corresponds to books with a rating of "1" has 28718 entries, we think it is safe to downsample the other classes so that we have an equal number of reviews for all ratings.
Before doing this, we will do some mild data cleaning. The first thing we will do is lower-case all the reviews and remove all extra whitespaces, newline characters, and tab characters. After doing this, we will drop all duplicates in the 'review_text' column. In the table below we list the 100 most common reviews:
﻿
﻿
The most common reviews tend to be short and promise that a review is forthcoming. Dropping the duplicated reviews ensures there is no accidental overlap between the valid/test sets and the training set (i.e. we want to avoid data leakage) and also ensures our model has access to more non-trivial training data. After removing duplicate reviews the total number of reviews shrinks from 900000 to 889610. 
We can now downsample our data so all ratings are equally represented. Our dataset now has 171,312 reviews which are equally distributed among the classes (see the plot below). In addition, the dataset now consists of 23410 distinct books written by 11164 distinct users.
﻿
﻿
At this point we can now split the data into train/valid/test splits. We will choose our split such that the train, valid, and test splits contain distinct book_ids. We do this to prevent data leakage, i.e. we want to avoid the situation where we accidentally pick a model which memorizes the average rating of a given book without learning the content of the review itself. For example, a model may simply learn that a review with "Anna Karenina" in the text tends to receive high ratings and if "Anna Karenina" appears frequently in the validation set we may overestimate the actual accuracy of the model on genuinely new data. 
In practice, we perform the train/valid/test split conditioned on the value of 'book_id' using the GroupShuffleSplit function from sklearn. If we ask this function to produce a 60-20-20 split, subject to the above conditions, it cannot produce a 60-20-20 split exactly but it comes very close, the exact split is approximately to 59-20.5-20.5. In the following section we will train Huggingface models on the training set and measure its performance on the validation set.
Baseline Models
DistilbertHere we take out baseline model to be the distilbert-base-uncased model. We train the model for a maximum of 3 epochs and use a train/valid/test batch sizes of 32. The initial learning rate is 5*10^-5 and we take 500 warmup steps. We use half precision in order to reduce the overall memory requirements. For the loss function we use cross-entropy loss and use the validation set to determine when/if to perform early stopping. Finally, we use accuracy on the test set to measure the overall performance of the model.
The plot for the training and evaluation loss is shown below. We see there is a monotonic, but relatively slow, decrease in the training loss for most steps, but there is a sharp decrease near the beginning and end of training. This corresponds to when the model starts to overfit, as can be seen by the increase in the evaluation loss. 
﻿
﻿
﻿
Finally, the accuracy on the validation set is:
﻿
﻿
The final accuracy on the validation dataset is around 60%, which is significantly higher than the result from random guessing, which is ~1/6~16.67%.
We have avoided evaluating the accuracy of the model on the test dataset because we intend to run further experiments where we scan over different hyperparameters. For this reason, we will avoid looking at the test dataset until we have tuned all parameters. 
BERT-tinyIn the previous section we saw that DistilBERT gave nice results on the training and validation sets. However, DistilBERT also took a relatively long time to run. In total it took around 4.7 hours to train the model on Google Colab. While this is not terrible if we are only running our model once, it will become a problem if we want to consider sweeps of our model for various learning rates (unless we upgrade our GPUs or try to distribute the training across multiple GPUs). Instead of doing this, we will try to consider an even smaller model.
In total, DistilBERT has 66M parameters. For context, the original BERT-Large has 340M parameters and RoBERTa has 355M parameters. While DistilBert is 40% smaller than the original BERT model, it is still fairly large. Therefore, in this section we will use BERT-tiny, which has only 4.4M parameters. BERT-tiny is 1/5th the size of DistilBERT and takes only 18 minutes to train when we run 6 complete epochs. The cost of this reduced runtime is that the final accuracy is lower and is around 50%. The loss curves for the training and validation sets as well as the accuracy on the validation set  are provided below:
﻿
Run set1
﻿
We see that 7-8k training steps the model has reached its best performance and after that time the model starts to overfit.
Extra EDAOne benefit of using transformers is that they require less data cleaning than simpler models, such as logistic regression or vanilla RNNs. Normally in NLP one has to do operations such as removing stopwords and stemming or lemmatizing the remaining words. Removing stopwords, such as "the" or "a", is useful because these words are not informative as to the actual content of the text. Therefore, we do not want our model to infer any accidental correlations between the presence of these words and the actual review. It's also useful to stem or lemmatize the text so that the model can use that the words "write" and "writing" carry effectively the same content.
With modern transformer architectures, such cleaning is not needed and can sometimes be harmful. With transformers we do not need to remove stopwords because of the attention mechanism: the transformer architecture learns by itself to pay less attention to words that are uninformative. In addition, we do not need to stem or lemmatize because we are using the SentencePiece tokenizer which uses BPE encoding. This tokenizer automatically learns to split words at the sub-word level so "writing" can represented as "writ"+"##ing", where # is a special character used to denote that a word has been split in two.
All that being said, there are cases where cleaning the text is still important. In particular, it is useful to clean the data when performing exploratory data analysis (EDA) so that we can glean insights directly from the text which may not be obvious from just analyzing the output of the neural net. In this section we will perform further EDA which will help give us a better understanding of the Goodreads dataset. Specifically, we will look at the train split for the downsampled dataset. At the moment this data analysis is not being used directly to perform sentiment analysis, but it can principle be included in future analyses.
The first thing we will analyze is the number of reviews of a given length. A line plot is given below where we plot reviews up to length 350 plot each rating separately. Overall, it looks like theres a strong correlation between a review being short and it being given a rating of '0'. However, once we go to longer reviews, i.e. around 200, it becomes impossible to distinguish the different curves. 
﻿
﻿
﻿
To study the same problem from a different angle we can also make box plots for review length versus rating (note to see the axes this report must be read with a white background):
﻿
We see that reviews corresponding to a rating of 0 do tend to be shorter, but given how many outliers there are this type of plot does not appear to be very useful. Finally, we can look at a correlation matrix to see how rating, review length, and average word length (in a given review) are correlated:
﻿
﻿
Overall we see relatively weak correlations, with the strongest correlation between the rating and the length of the text.
Next, we will look at common words which appear in the reviews. Before doing this we perform standard text-cleaning, removing stop words, expanding contractions, removing punctuation and HTML tags, and stemming the text. For stemming we use the PorterStemmer provided by NLTK. In general it is better to lemmatize the text instead of stemming since lemmatization takes into the context of a word while stemming truncates the word independent of context. However, since we are performing basic EDA we find that stemming is sufficient here.
To start, we can make a word cloud to see what are the most common words which appear among all the reviews (regardless of rating):
﻿
Many of the most common words here will not come as a surprise, the words "think", "feel", "love", "thought", etc. appear often since people are writing their personal opinions and words like "novel", "story", and "main character" are expected on a site about books. The words "view spoiler" and "hide spoiler" also seem to appear often as a warning to readers of the reviews
To get a better sense on how words differ by reviews, we will look at the most trigrams for all reviews of a given rating. The barplots for trigrams are shown below. One thing we can immediately note is that "exchange honest review" "and "netgalley exchange honest" appear frequently for all reviews. It appears that many reviewers on goodreads.com receive a free copy of the book if they promise to leave an honest review. Besides that, we can see how the text of the review correlates with the rating. For example, reviews with a low rating, i.e. less than or equal to 2, tend to involve phrases such as "really wanted like", "really want read", and "even get started". These phrases indicate the reviewer wanted to like the novel, but ultimately did not. Other phrases, such as "blah blah blah" clearly signal the reviewer did not like the book. On the other hand, books with ratings of 4 or 5 involve trigrams such as "love love love", "really enjoyed book", "looking forward next", which are clearly all positive phrases.
﻿
﻿
Since the phrase "'in exchange for an honest review'" appeared in so many reviews, I was curious to see whether the fact the book was free correlated with the rating given to the book. To test this, I counted the number of times this phrase appeared for all reviews of a given rating. In the plot below we observe this phrase appears more frequently in books that were given a higher rating. Note that this analysis was carried out on the downsampled training dataset, so all ratings appear an approximately equal number of times and this pattern cannot be attributed to higher ratings appearing more often than lower ratings.
﻿
﻿
﻿
Finally, we can use TextBlob to perform some basic sentiment analysis on the reviews. Below we show a box plot for the polarity of a review versus rating. Here polarity lies in the range [-1,1], where a higher number corresponds to a more positive review. With the exception of reviews where the rating is '0', there appears to be a relatively weak correlation between the rating given and the positivity of the review (at least according to the model used by TextBlob).
﻿
﻿
AcknowledgementsWe would like to thank Prajjwal Bhargava for making his implementation of BERT-tiny available on Huggingface, see here, and Kayvane Shakerifar for making public his nicely written code on combining Huggingface models and WandB, see here.
We'll be back next week with parts 2 & 3. 
﻿
Add a comment
Tags: Articles, NLP, Intermediate, Community Posts, Sentiment Analysis, Panels, Course, HuggingFace
Iterate on AI agents and models faster. Try Weights & Biases today.