How to Fine-Tune Hugging Face Transformers with Weights & Biases

In this report, we will learn how to easily fine-tune a HuggingFace Transformer on a custom dataset. Made by Ayush Thakur using Weights & Biases
Ayush Thakur

πŸ€— Introduction

In this report, we will take a quick look at the HuggingFace Transformers library features. The library provides easy-to-use APIs to download, train, and infer state-of-the-art pre-trained models for Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks. Some of these tasks are sentiment-analysis, question-answering, text summarization, etc. You can get a quick summary of the common NLP tasks supported by HuggingFace here.
Today, we're going to fine-tune a DistilBERT transformer for sentiment-analysis (binary classification) on an IMDB dataset. If you want to follow along check out the linked Colab notebook.

⏳ Installation and Imports

For this tutorial, we will need HuggingFace (surprise!) & Weights and Biases.
# Install HuggingFace!pip install transformers -q
(We will soon look at HuggingFace related imports and what they mean.)
We'll also be using Weights and Biases to automatically log losses, evaluation metrics, model topology, and gradients( for Trainer only). When we say it's easy to install Weights and Biases, we're telling the truth:
# Install Weights and Biases!pip install wandb -q# Import wandbimport wandb# Login with your authentication keywandb.login()# setup wandb environment variables%env WANDB_ENTITY=your-username/your-team-name%env WANDB_PROJECT=your-project-name
Now that we are ready with the required installations, let's see how easy it is to fine-tune a HuggingFace Transformer on any dataset for any task.

πŸ”§ The Ease of Data Preprocessing with πŸ€—

In this section, we will see how easy it is to preprocess data for training or inference. The main tool for this is a tokenizer which is in charge of preparing the inputs for a model. The library contains tokenizers for all the models or we can use AutoTokenizer (more on this later).

A Word on Tokenizers

Tokenizing a text is splitting it into words or subwords, which then are converted to IDs through a look-up table. But splitting a text into smaller chunks is a task that is harder than it looks. Let's look at the sentence "Don't you love Weights and Biases for experiment tracking?" . We can split the sentence by spaces, which would give:
["Don't", "you", "love", "Weights" , "and" , "Biases", "for", "experiment", "tracking?"]
This looks sensible, but if we look at the token "tracking?", we notice that punctuation is attached to it which might confuse the model. "Don't" stands for do not so it can be tokenized as ["Do", "n't"]. This is where things start to get complicated and part of the reason each model has its own tokenizer.
That's why we need to import the correct tokenizer for the model of our choice. Check out this well-written summary of tokenizers.
The conversion of tokens to ids through a look-up table depends on the vocabulary (the set of all unique words and tokens used) which depends on the dataset, the task, and the resulting pre-trained model. HuggingFace tokenizer automatically downloads the vocabulary used during pretraining or fine-tuning a given model. We need not create our own vocab from the dataset for fine-tuning.
We can build the tokenizer by using the tokenizer class associated with the model we would like to fine-tune on our custom dataset, or directly with the AutoTokenizer class. The AutoTokenizer.from_pretrained method takes in the name of the model to build the appropriate tokenizer.

Download and Prepare Dataset

In this tutorial, we're using the IMDB dataset. You can use any other dataset but the general steps here will remain the same.
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz!tar -xf aclImdb_v1.tar.gz
train_texts, train_labels = read_imdb_split('aclImdb/train')test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
MODEL_NAME = 'distilbert-base-uncased'tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME )
train_encodings = tokenizer(train_texts, truncation=True, padding=True)val_encodings = tokenizer(val_texts, truncation=True, padding=True)test_encodings = tokenizer(test_texts, truncation=True, padding=True)
train_dataset = tf.data.Dataset.from_tensor_slices(( dict(train_encodings), train_labels))
The mentioned steps might change depending on the task and the dataset but the overall methodology to prepare your dataset remains the same. You can learn more about preprocessing the data here.

🎨 HuggingFace Transformer Models

The HuggingFace Transformer models are compatible with native PyTorch and TensorFlow 2.x. Models are standard torch.nn.Module or tf.keras.Model depending on the prefix of the model class name. If it begins with TF then it's a tf.keras.Model. Note that tokenizers are framework agnostic. Check out the summary of models available in Huggingface Transformers.
The easiest way to download a pre-trained Transformer model is to use the appropriate AutoModel(TFAutoModelForSequenceClassification in our case). The from_pretrained is used to load a model either from a local file or directory or from a pre-trained model configuration provided by HuggingFace. You can find the list of pre-trained models here.
# Import required model classfrom transformers import TFDistilBertForSequenceClassification# Download pre-trained modelmodel = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3)
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME, output_hidden_states=True, output_attentions=True)
from transformers import DistilBertConfigconfig = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)model = TFDistilBertForSequenceClassification(config)

🎺 Feature complete Trainer/TFTrainer

You can fine-tune a HuggingFace Transformer using both native PyTorch and TensorFlow 2. HuggingFace provides a simple but feature-complete training and evaluation interface through Trainer()/TFTrainer().
We can train, fine-tune, and evaluate any HuggingFace Transformers model with a wide range of training options and with built-in features like metric logging, gradient accumulation, and mixed precision. It can be used to train with distributed strategies and even on TPU.

Training Arguments

Before instantiating Trainer/TFTrainer, we need to create a TrainingArguments/TFTrainingArguments to access all the points of customization during training.
training_args = TFTrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', logging_steps=10,)
Some notable arguments are:
You can learn more about the arguments here.
If you are using PyTorch DataLoader then use TrainingArguments. You can learn more about the arguments here. Note that there are some additional features that you can use with TrainingArguments like early stopping and label smoothing.

Trainer

Trainer/TFTrainer contains the basic training loop supporting the features mentioned above. This interface is easy to use and can be used to set up a decent baseline. You can always use native PyTorch or TensorFlow to build a custom training loop.
trainer = TFTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset )
trainer.train() is used to train the model while trainer.eval() is used to evaluate the model.
If you have installed Weights and Biases then it will automatically log all the metrics to a W&B project's dashboard.

πŸŽ‡ Results

πŸ’­ Conclusion and Resources

I hope you find this report helpful. I will encourage you to fine-tune a HuggingFace transformer on a dataset of your choice.
Here are some other reports on HuggingFace Transformers:
Report Gallery