Sentiment Analysis on Goodreads Reviews: Part 2

In this article, we offer a further exploration of Goodreads reviews. This is second in a three part series from a community member who took the W&B MLOps course.
David Meltzer
Created on March 7|Last edited on May 19
Comment
﻿
This project is a community submission from a practitioner who took our free MLOps course. It's a great preview of what you can expect to learn in the course and is the second edition in a three part series about this particular project. You'll find the other report below.
💡
In this article we continue the analysis of the Kaggle Goodreads dataset we analyzed in this previous article. 
Here's what we'll be covering: 
Table of ContentsTable of ContentsRefactoringResultsSummaryAcknowledgements
﻿
﻿
As a reminder, the original dataset consists of 900k reviews written by 12,188 distinct users about 25,474 books. Each book also received a rating between 0 and 5 and the goal of this project was to predict the rating given the text of the review. This is a classic multi-classification sentiment analysis problem which can be studied using transformer architectures. 
The original dataset was skewed towards higher ratings (ratings of 4 or 5 were more common than ratings of 0 or 1). To fix this, we downsampled our dataset so that there were an equal number of reviews for each rating. 
After performing some mild data cleaning, our downsampled dataset had 171,312 reviews (corresponding to 28,552 reviews per rating) written by 11,164 users about 23,410 books. Finally, we split the data into train/valid/test sets using a 60-20-20 split. We performed this split such that the train, validation, and test sets do not have any common values of "book_id" do avoid any data leakage. More details and some preliminary exploratory data analysis can be found in the previous report as well.
Refactoring﻿In the previous piece, we used two models, DistilBERT and BERT-tiny, to analyze the Goodreads dataset. We ran one experiment for each model, using the recommended hyperparameters, and carried out most of the analysis in a Jupyter notebook. For this week we simplified and expanded our previous code such that most of the code is written in python files and we sweep over different values of the hyperparameters. By cleaning up our previous code we can now more easily generalize to other datasets and models.
In this section I will briefly explain the files and code used for this weeks analysis. As in the previous week's work, we use the files params.py and requirements.txt to store default parameters for the training of the net and the required dependencies to run the code. For example, the first few lines of our params.py file are:
from ml_collections import config_dict
default_cfg = config_dict.ConfigDict()
﻿
# WANDB BASE PARAMETERS
default_cfg.PROJECT_NAME = "mlops-course-assgn2"
﻿
#WANDB JOB TYPES
default_cfg.RAW_DATA_JOB_TYPE='fetch_raw_data'
default_cfg.DATA_PROCESSING_JOB_TYPE='process-data'
default_cfg.SPLIT_DATA_JOB_TYPE='split-data'
default_cfg.MODEL_TRAINING_JOB_TYPE='model-training'
default_cfg.MODEL_INFERENCE_JOB_TYPE='model-inference'
﻿
# WANDB ARTIFACT TYPES
default_cfg.DATASET_TYPE='dataset'
default_cfg.MODEL_TYPE='model'
default_cfg.MODEL_TRAINING_JOB_TYPE='model_training'
Similarly, our requirements.txt folder is given by:
datasets==2.10.1
evaluate==0.4.0
imblearn==0.0
ml_collections==0.1.1
numpy==1.21.5
pandas==1.3.5
scikit_learn==1.2.1
torch==1.13.0
transformers==4.26.1
wandb==0.12.0
tokenizers
sentencepiece
At this point we should mention that in order to get our code to work we needed to use a previous version of W&B, version 0.12.0, in order to export our results from HuggingFace to Weights & Biases.
The three new files are train.py, process.py and sweep.yaml. The train.py file essentially contains the same information as our previous training code, which was written directly in a Jupyter notebook. The main difference is that now we can run our training file directly from the command line and we also added the option to overwrite our previous default parameters using argparser. 
A few lines of that code:
def parse_args():
    "Overriding default arguments for model"
    argparser = argparse.ArgumentParser(
        description="Process base parameters and hyperparameteres"
    )
    argparser.add_argument(
        "--MODEL_NAME",
        type=str,
        default=default_cfg.MODEL_NAME,
        help="Model architecture to use"
    )
    argparser.add_argument(
        "--NUM_EPOCHS",
        type=int,
        default=default_cfg.NUM_EPOCHS,
        help="number of training epochs"
    )
    argparser.add_argument(
        "--TRAIN_BATCH_SIZE",
        type=int,
        default=default_cfg.TRAIN_BATCH_SIZE,
        help="Train batch size"
    )
    argparser.add_argument(
        "--VALID_BATCH_SIZE",
        type=int,
        default=default_cfg.VALID_BATCH_SIZE,
        help="Validation batch size"
    )
The process.py file contains three functions which process the dataset. 
The first one log_raw_data simply downloads the Kaggle Goodreads dataset to the local directory and saves it as an artifact. Note that in order for this function to work the user needs to use the Kaggle API and have their token saved in the "/root/.kaggle" directory. 
The second function, downsample_and_log, downsamples the data so that all ratings appear an equal number of times. It also adds new columns to the Pandas dataframe, full length and mean_word_length. The former is the length of a given review and the latter is the average length of a word in the review. These features are not used in this analysis directly, but they may be of interest in future applications. After adding these new features the dataset is logged as an artifact. 
Finally, the "split_and_log function" splits the Pandas dataframe into train, validation, and test sets, converts each of them into HuggingFace Datasets, and then tokenizes each dataset using the tokenizer associated to the model being studied. After this is done each dataset is logged as a separate artifact. When the process.py file is called by the user these functions are called sequentially.
The final file used is sweep.yaml. This function is used to sweep over different choices of the hyperparameters. We use a random sweep since it is relatively cheap to perform and also explores more of hyperparameter space than a simple grid search. The sweep.yaml file is shown below:
program: train.py
method: random
project: mlops-course-assgn2
description: "Random sweep for BERTtiny."
metric:
    name: eval/accuracy
    goal: maximize
early_terminate:
    type: hyperband
    min_iter: 5
parameters:
    TRAIN_BATCH_SIZE:
        value: 32
    VALID_BATCH_SIZE:
        value: 32
    WARMUP_STEPS:
        values: [256,512,1024]
    LEARNING_RATE:
        distribution: 'log_uniform_values'
        min: 1e-5
        max: 1e-2
    GRADIENT_ACCUMULATION_STEPS:
        values: [1,4,16,32]
    MODEL_NAME:
        values:
            - 'prajjwal1/bert-tiny'
Here we configure the sweep so that we are maximizing the accuracy on the validation dataset (which in the above file is referred to as "eval/accuracy"). We have fixed the train and validation batch to be 32 because that was the largest we could fit using a Google Colab GPU. To compensate for this we introduced the key "gradient_accumulation_steps" which determines how many mini-batches our model should evaluate before performing an optimization step. By increasing the number of accumulation steps we are effectively increasing our batch size without having to use a more powerful GPU. We have also varied the number of warm-up steps and sampled the learning rate using a log-uniform distribution over the range [1e-5,1e-2]. Finally, the only model we use is bert-tiny, as implemented by Prajjwal Bhargava here.
ResultsIn this section we will summarize the main results from performing the above hyperparameter sweep for 20 runs. Below we use parallel coordinates to visualize how the accuracy on the depends on the number of gradient accumulation steps, the learning rate, and the number warm-up steps. We see that our best model has an accuracy of 52.81% while the worst performing models have an accuracy of just 32.18%. As a reminder, random guessing would give an accuracy of approximately 16.67%.
﻿
﻿
﻿
Below we additionally plot the model accuracy versus when the model was created and also a table summarizing the relative importance of each hyperparameter. From the first plot we see that most of our models have an accuracy between 45%-52% and some our best performing models were created early on in the sweep. 
The second plot summarizes how our hyperparameters affect the accuracy on the validation set. The "correlation" column captures the linear relationship between each parameter and the final performance while the "importance" column uses a random forest (where the hyperparameters are input and the evaluation metric is the target) to measure the importance of each parameter. The benefit of using this "importance" measure is that it can capture non-linear effects which are not captured by just computing the correlation. 
From this table we see that the number of gradient accumulation steps is correlated with the final accuracy, but that the learning rate appears to have the highest importance. Given that the correlation between the learning rate and the final accuracy is relatively small (it is close to 0.1), it is difficult to say whether increasing the learning rate is actually correlated with better performance. 
To be more confident we would need to run more experiments and understand any nonlinear effects or possible interaction effects with other hyperparameters.
﻿
﻿
Sweep: dqin7gdg 120
Sweep: dqin7gdg 20
﻿
Finally, below we plot the accuracy and loss of our model on the validation set as a function of step time:
﻿
Sweep: dqin7gdg 120
Sweep: dqin7gdg 20
﻿
﻿
Due to the fact we used early stopping in the definition of our sweep and in our HuggingFace trainer some runs ended early than others. Early stopping is convenient to prevent models from running too long and using GPU-time on a model which is not performing well. Nevertheless it may be interesting to revisit these experiments where we let the model run for longer (e.g. in HuggingFace we set early_stopping_patience=3 but we could also attempt sweeps over this parameter.
Finally, we will compare how well our models performed against the models trained in the previous week. As a reminder, below we present the accuracy vs gradient step-time found in the previous report:
﻿
﻿
﻿
Somewhat remarkably, performing the sweep did little to improve the accuracy of the BERT-tiny model. After performing the sweep the best model had an accuracy of 52.87% on the validation set while previously the best performing BERT-tiny model had an accuracy of 52.84%. We also see that performing a hyperparameter search on BERT-tiny did not close the gap with DistilBERT, which had an accuracy of around 60% on the validation data. This performance gap is not surprising, the BERT-tiny model has 2.2M parameters while DistilBERT has 66M parameters. 
The benefit of working with BERT-tiny is it is faster to train and run and therefore we can perform many hyperparameter searches to optimize our model. The downside however is that performing 20 runs took around 5 hours while a single run of DistilBERT took around 6 hours to run and achieved a higher accuracy of 60%. Of course, the downside to working with DistilBERT is that it is impractical to perform many hyperparameter sweeps since now each run takes 5-6 hours. In order to perform 20 sweeps using DistilBERT it would help to have more GPUs and/or GPUs that are more powerful (in Google Colab we have access to a single GPU).
SummaryTo summarize, in this report we have briefly explained how we refactored our code and performed a hyperparameter sweep. We found that in practice performing a hyperparameter sweep for BERT-tiny did not significantly improve the performance of our model. There are clearly many ways we can try and improve our model, either by working with a larger dataset, working with a larger model, or by further pre-training. 
We will leave most of this analysis to future work and in the next week's assignment we will revisit some of our assumptions on how we split the data, go into more detail on how our model actually performs on validation dataset, and finally study how our models perform on the test data.
AcknowledgementsWe would like to thank Prajjwal Bhargava for making his implementation of BERT-tiny available on HuggingFace, see here, and Kayvane Shakerifar for his nicely written code on combining HuggingFace models and W&B, see here.
﻿
﻿
﻿
Add a comment
Tags: Articles, Community Posts, NLP, Sentiment Analysis, Panels, Course, HuggingFace
Iterate on AI agents and models faster. Try Weights & Biases today.