Skip to main content

Multilabel Emotion Classification with Weights & Biases and Hugging Face

In this report, we will use the GoEmotions dataset, containing 58K labeled Reddit posts across 28 emotion classes, to build and evaluate multi-label classification models with Hugging Face and PyTorch.
Created on October 2|Last edited on January 31
Image courtesy of istock

Table of Contents



Introduction and Objectives

In Natural Language Processing, applications for classifying human emotion or response abound, from determining how people feel about a product, a situation, a response - basically any scenario where that response can help a decision be made or an action taken. However, traditional classification scenarios where a predicted class ranges from negative to neutral to positive can leave much of the nuance of human and emotion and related behavior unaccounted for.
Image courtesy of Arghyadeep Das (original article)
Multi-label text classification to the rescue here, where all applicable labels, from none to one to many, are a possible output option.
We will build a PyTorch Trainer using the Transformers library's Squeezebert model, complete with its default tokenizer, which is fast and furious. The SqueezeBERT model was proposed in "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?" by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer (original paper here). It’s a bidirectional transformer similar to the BERT model, and provides a fast and effective path to evaluating multiple multi-label classifiers using Weights & Biases.

The Data!

The GoEmotions dataset, courtesy of Hugging Face's datasets Hub, contains "58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral." It is rare to find substantial labeled text datasets with more than a few emotions included, let alone 27 + neutral. These labels offer a route to much more intricate classification experimentation, including examples in classes from amusement, approval, and caring to disgust, grief, and sadness.
The dataset features a user ID column, text, and a label from 0-27. In this work we use the smaller, simplified version of the dataset with predefined train/validation/test splits, with the following distribution:
  • Training: 43410 records
  • Validation: 5426 records
  • Test: 5427 records

Data Exploration

The table below contains a sample of 1000 records from the GoEmotions training dataset. Fair warning: it's Reddit, so here be colorful language and perspectives!



ID
Text
Labels
Label_Name
1
2
3
4
5
6
7
8
9
10
11

Preprocessing

Data is one-hot encoded prior to modeling, which converts each categorical value (the labels) into a new categorical column and assigns a binary value of 1 or 0 to those columns. The result looks like the following:


Project Kick-off

import wandb

wandb.login()
wandb_project = "HuggingFace_multilabel_emotion_classification"

Hyperparameter Tuning with Sweeps

One of the most time-consuming and guesswork-riddled processes of training machine learning models is determining the optimal hyperparameter configuration. Grid, random, and bayesian search all require research, algorithm familiarity, and the time and compute resources to test multiple configurations before landing on the right set-up for a particular problem.
Weights & Biases sweeps provide an elegant solution, allowing ease of experimentation across configurations.
sweep_config = {
'method': 'random', #grid, random, bayesian
'metric': {
'name': 'auc_score',
'goal': 'maximize'
},
'parameters': {

'learning_rate': {
'values': [5e-5, 3e-5]
},
'batch_size': {
'values': [32, 64]
},
'epochs':{'value': 10},
'dropout':{
'values': [0.3, 0.4, 0.5]
},
'tokenizer_max_len': {'value': 40},
}
}

sweep_id = wandb.sweep(sweep_config, project='HuggingFace_multilabel_emotion_classification')

Training Setup

The wandb.init() statement initializes the run. Each execution of the train function is one run. We pass the sweep configs to the trainer function, which is used to set the different hyperparameters like batch_size, dropout, epochs, etc. Another wandb based line is wandb.watch(), which is used to observe the model for its gradients as to how they evolve with training. This helps in providing more explainability wherever possible.
The next occurrence of wandb is in wandb.log(), which helps us log the relevant parameters which we want to see evolve as the training happens. In our case, we log the epochs, the training and validation loss, as well as the AUC score.
Finally, to start the sweeps for different hyperparameter combinations, we will call wandb.agent()where we pass the sweep_id for configuration and the trainer function. We also pass count=6, to limit the number of runs we want to perform. Once training begins, we can monitor training results in real-time in our W&B dashboard:
def trainer(config=None):
with wandb.init(config=config):
config = wandb.config

train_dataset, valid_dataset = build_dataset(config.tokenizer_max_len)
train_data_loader, valid_data_loader = build_dataloader(train_dataset, valid_dataset, config.batch_size)
print("Length of Train Dataloader: ", len(train_data_loader))
print("Length of Valid Dataloader: ", len(valid_data_loader))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

n_train_steps = int(len(train_dataset) / config.batch_size * 10)

model = ret_model(n_train_steps, config.dropout)
optimizer = ret_optimizer(model)
scheduler = ret_scheduler(optimizer, n_train_steps)
model.to(device)
model = nn.DataParallel(model)
wandb.watch(model)
n_epochs = config.epochs

best_val_loss = 100
for epoch in tqdm(range(n_epochs)):
train_loss = train_fn(train_data_loader, model, optimizer, device, scheduler)
eval_loss, preds, labels = eval_fn(valid_data_loader, model, device)
auc_score = log_metrics(preds, labels)["auc_micro"]
print("AUC score: ", auc_score)
avg_train_loss, avg_val_loss = train_loss / len(train_data_loader), eval_loss / len(valid_data_loader)
wandb.log({
"epoch": epoch + 1,
"train_loss": avg_train_loss,
"val_loss": avg_val_loss,
"auc_score": auc_score,
})
print("Average Train loss: ", avg_train_loss)
print("Average Valid loss: ", avg_val_loss)

if avg_val_loss < best_val_loss:
best_val_loss = avg_val_loss
torch.save(model.state_dict(), "./best_model.pt")
print("Model saved as current val_loss is: ", best_val_loss)

wandb.agent(sweep_id, function=trainer, count=6)

Results and Evaluation

We can see that the first run, radiant-sweep-1, performed best both in terms of AUC score and validation loss, with slightly higher loss scores across steps in training.

Training results



Run set
11


Hyperparameter Importance and Sweeps

Below we see which of our hyperparameters were the best predictors of, and highly correlated to, desirable values of AUC score.
Correlations capture the linear relationships between an individual hyperparameter and a metric value. Dropout had a negative correlation with AUC and batch size and learning rate both had positive correlation.


Run set
11


Problematic Classes by AUC Score

Below we have custom charts for different classes whose scores were middling at best. Examples from these classes help demonstrate why these may be more challenging to predict accurately.


Run set
17

Here we explore results from these categories.


Run set
11

Here we examine classes that humans have a tough time distinguishing, to see how the models handled their classification. This could give us an idea of classes that may need to be combined or re-evaluated.

Run set
17