Multilabel Emotion Classification with Weights & Biases and Hugging Face
In this report, we will use the GoEmotions dataset, containing 58K labeled Reddit posts across 28 emotion classes, to build and evaluate multi-label classification models with Hugging Face and PyTorch.
Created on October 2|Last edited on January 31
Comment

Image courtesy of istock
Table of Contents
Table of ContentsIntroduction and ObjectivesThe Data!Data ExplorationPreprocessingProject Kick-offHyperparameter Tuning with SweepsTraining SetupResults and EvaluationTraining resultsHyperparameter Importance and SweepsProblematic Classes by AUC Score
Introduction and Objectives
In Natural Language Processing, applications for classifying human emotion or response abound, from determining how people feel about a product, a situation, a response - basically any scenario where that response can help a decision be made or an action taken. However, traditional classification scenarios where a predicted class ranges from negative to neutral to positive can leave much of the nuance of human and emotion and related behavior unaccounted for.

Multi-label text classification to the rescue here, where all applicable labels, from none to one to many, are a possible output option.
We will build a PyTorch Trainer using the Transformers library's Squeezebert model, complete with its default tokenizer, which is fast and furious. The SqueezeBERT model was proposed in "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?" by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer (original paper here). It’s a bidirectional transformer similar to the BERT model, and provides a fast and effective path to evaluating multiple multi-label classifiers using Weights & Biases.
The Data!
The GoEmotions dataset, courtesy of Hugging Face's datasets Hub, contains "58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral." It is rare to find substantial labeled text datasets with more than a few emotions included, let alone 27 + neutral. These labels offer a route to much more intricate classification experimentation, including examples in classes from amusement, approval, and caring to disgust, grief, and sadness.
The dataset features a user ID column, text, and a label from 0-27. In this work we use the smaller, simplified version of the dataset with predefined train/validation/test splits, with the following distribution:
- Training: 43410 records
- Validation: 5426 records
- Test: 5427 records
Data Exploration
The table below contains a sample of 1000 records from the GoEmotions training dataset. Fair warning: it's Reddit, so here be colorful language and perspectives!
Preprocessing
Data is one-hot encoded prior to modeling, which converts each categorical value (the labels) into a new categorical column and assigns a binary value of 1 or 0 to those columns. The result looks like the following:

Project Kick-off
import wandbwandb.login()wandb_project = "HuggingFace_multilabel_emotion_classification"
Hyperparameter Tuning with Sweeps
One of the most time-consuming and guesswork-riddled processes of training machine learning models is determining the optimal hyperparameter configuration. Grid, random, and bayesian search all require research, algorithm familiarity, and the time and compute resources to test multiple configurations before landing on the right set-up for a particular problem.
Weights & Biases sweeps provide an elegant solution, allowing ease of experimentation across configurations.
sweep_config = {'method': 'random', #grid, random, bayesian'metric': {'name': 'auc_score','goal': 'maximize'},'parameters': {'learning_rate': {'values': [5e-5, 3e-5]},'batch_size': {'values': [32, 64]},'epochs':{'value': 10},'dropout':{'values': [0.3, 0.4, 0.5]},'tokenizer_max_len': {'value': 40},}}sweep_id = wandb.sweep(sweep_config, project='HuggingFace_multilabel_emotion_classification')
Training Setup
The wandb.init() statement initializes the run. Each execution of the train function is one run. We pass the sweep configs to the trainer function, which is used to set the different hyperparameters like batch_size, dropout, epochs, etc. Another wandb based line is wandb.watch(), which is used to observe the model for its gradients as to how they evolve with training. This helps in providing more explainability wherever possible.
The next occurrence of wandb is in wandb.log(), which helps us log the relevant parameters which we want to see evolve as the training happens. In our case, we log the epochs, the training and validation loss, as well as the AUC score.
Finally, to start the sweeps for different hyperparameter combinations, we will call wandb.agent()where we pass the sweep_id for configuration and the trainer function. We also pass count=6, to limit the number of runs we want to perform. Once training begins, we can monitor training results in real-time in our W&B dashboard:
def trainer(config=None):with wandb.init(config=config):config = wandb.configtrain_dataset, valid_dataset = build_dataset(config.tokenizer_max_len)train_data_loader, valid_data_loader = build_dataloader(train_dataset, valid_dataset, config.batch_size)print("Length of Train Dataloader: ", len(train_data_loader))print("Length of Valid Dataloader: ", len(valid_data_loader))device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')n_train_steps = int(len(train_dataset) / config.batch_size * 10)model = ret_model(n_train_steps, config.dropout)optimizer = ret_optimizer(model)scheduler = ret_scheduler(optimizer, n_train_steps)model.to(device)model = nn.DataParallel(model)wandb.watch(model)n_epochs = config.epochsbest_val_loss = 100for epoch in tqdm(range(n_epochs)):train_loss = train_fn(train_data_loader, model, optimizer, device, scheduler)eval_loss, preds, labels = eval_fn(valid_data_loader, model, device)auc_score = log_metrics(preds, labels)["auc_micro"]print("AUC score: ", auc_score)avg_train_loss, avg_val_loss = train_loss / len(train_data_loader), eval_loss / len(valid_data_loader)wandb.log({"epoch": epoch + 1,"train_loss": avg_train_loss,"val_loss": avg_val_loss,"auc_score": auc_score,})print("Average Train loss: ", avg_train_loss)print("Average Valid loss: ", avg_val_loss)if avg_val_loss < best_val_loss:best_val_loss = avg_val_losstorch.save(model.state_dict(), "./best_model.pt")print("Model saved as current val_loss is: ", best_val_loss)wandb.agent(sweep_id, function=trainer, count=6)
Results and Evaluation
We can see that the first run, radiant-sweep-1, performed best both in terms of AUC score and validation loss, with slightly higher loss scores across steps in training.
Training results
Run set
11
Hyperparameter Importance and Sweeps
Below we see which of our hyperparameters were the best predictors of, and highly correlated to, desirable values of AUC score.
Correlations capture the linear relationships between an individual hyperparameter and a metric value. Dropout had a negative correlation with AUC and batch size and learning rate both had positive correlation.
Run set
11
Problematic Classes by AUC Score
Below we have custom charts for different classes whose scores were middling at best. Examples from these classes help demonstrate why these may be more challenging to predict accurately.
Run set
17
Here we explore results from these categories.
Run set
11
Here we examine classes that humans have a tough time distinguishing, to see how the models handled their classification. This could give us an idea of classes that may need to be combined or re-evaluated.
Run set
17
Add a comment