Sentiment Analysis on Amazon Reviews Dataset
Sentiment Analysis on WandB for Effective MLOps: Model Development Course
Created on March 6|Last edited on March 13
Comment
1. Business / Research Understanding Phase2. Data Preparation/Understanding Phase3. Exploratory Data Analysis Phase4. Setup Phase5. Modeling Phase6. Evaluation PhaseFine-tuning hyperparametersAccuracy7. Deployment Phase
In this project, we follow the Data Science Methodology (DSM)
💡
1. Business / Research Understanding Phase
2. Data Preparation/Understanding Phase
Amazon product reviews dataset contains reviews in English, Japanese, German, French, Chinese, and Spanish, collected between November 1, 2015, and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.
- In this project we use English language reviews, so we have 200,000 records for the training, 5000 for the validation, and 5000 for the test set.
- There are two tables available for EDA: "eda_table" which contains the entire dataset of 210,000 records, and "eda_table_sample" which is a sample of 10,000 records from the dataset.
- For this project, we use the "review_body" column for input text and the "stars" column for labels which is five classes.
3. Exploratory Data Analysis Phase
4. Setup Phase
In this phase, we use Transformer models, e.g. Bert-based models like DistilBERT.
We use "review_body" as input text and the "stars" column for labels which is five classes so, we tokenize texts and feed them to the model.
5. Modeling Phase
Improve your baseline model
In baseline model (run magic-elevator-6) validation accuracy is 0.56 . To improve baseline model we can run the model with same training configs but more epochs e.g. cool-sweep-8 .
In this phase, we use Weights & Biases Sweeps to tune hyperparameters. In this example, we will explore different combinations of batch_size, learning_rate, and epochs using a grid search. Due to the long training time required for the BERT-based model, we will train it on only 500 batches and evaluate it on 100 batches (for batch_size=8, 4,000 training samples).
Exploring Hyperparameter Combinations With Sweeps
Run set
10
In this case, the optimal values were found in batch_size=8 and learning_rate=1e-4. For better results, we can fine-tune the model for more epochs.
6. Evaluation Phase
In this example, the validation set of the dataset contains 5000 samples, the test set contains 5000 samples and the dataset contains 1000 samples for each class, so the dataset is balanced and we can use accuracy as the evaluation metric.
For the final model, we can use optimal hyperparameters and fine-tune the model for more epochs.
Fine-tuning hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
Accuracy
The fine-tuned model was evaluated on the test set of amazon_reviews_multi
- Accuracy (exact) is the exact match of the number of stars.
- Accuracy (off-by-1) is the percentage of reviews where the number of stars the model predicts differs by a maximum of 1 from the number given by the human reviewer.
Split | Accuracy (exact) | Accuracy (off-by-1) |
---|---|---|
Dev set | 56.96% | 85.50% |
Test set | 57.36% | 85.58% |
7. Deployment Phase
You can deploy the model in HuggingFace spaces.
copy src/app.py and src/requirements.txt to the project directory (from GitHub) then commit and push files:
!git clone https://huggingface.co/spaces/<username>/<space-name>!git add app.py!git add requirements.txt!git commit -m "Add application file"!git push
Add a comment