Hyperparameter Optimization for Hugging Face Transformers

This report explains three strategies for hyperparameter optimization for Hugging Face Transformers. Made by Ayush Chaurasia using Weights & Biases
Ayush Chaurasia


Training an NLP model from scratch takes hundreds of hours. Instead, it is much easier to use a pre-trained model and fine-tune it for a specific task. Using the Hugging Face transformers library, we can quickly load a pre-trained NLP model with several extra layers and run a few fine-tuning epochs on a specific task. Tune provides high-level abstractions for performing scalable hyperparameter tuning using SOTA tuning algorithms.

Our Hyperparameter Tuning Experiment

In this report, we compare 3 different optimization strategies — Grid Search, Bayesian Optimization, and Population Based Training — to see which one results in a more accurate model in the shortest amount of time.
We use a standard uncased BERT model from Hugging Face transformers, and we want to fine-tune on the RTE dataset from the SuperGLUE benchmark. We will see that the hyperparameters we choose can have a significant impact on our final model performance.
We will experiment by tuning the Model using the following methods:

Ray Tune and W&B

There are many advantages to using Ray Tune with W&B:

Enable W&B tracking

There are two ways of tracking progress through W&B using Tune.
config = { .... "wandb":{ "project": "Project_name", "api_key": #Your W&B API KEY, #Additional wandb.init() parameters }}

Grid Search (Baseline):

To set up a baseline, we will perform a grid search to find the best set of hyper-parameters across the space described by the paper authors.
{ "per_gpu_batch_size": [16, 32], "learning_rate": [2e-5, 3e-5, 5e-5], "num_epochs": [2, 3, 4]}
Let us now take a look at the metric visualizations to compare the performance.
analysis = tune.run( ... resources_per_trial={'gpu': #num_gpu, 'cpu': #num_cpu }, # Tune will use this information to parallelize the tuning operation config=config ...)

Bayesian Search with Asynchronous HyperOpt

In Bayesian search, we fit a Gaussian Process model that tries to predict the performance of the parameters (i.e., the loss) and is used to inform future hyperparameters. We also combine this with an early stopping algorithm, Asynchronous Hyperband, where we stop bad performing trials early to avoid wasting resources.
For this experiment, we also search over weight_decay and warmup_steps, and extend our search space:
{ "per_gpu_batch_size": (16, 64), "weight_decay": (0, 0.3), "learning_rate": (1e-5, 5e-5), "warmup_steps": (0, 500), "num_epochs": (2, 5)}
We run a total of 60 trials, with 15 of these used for initial random searches.
Let us now look at the results.

Population Based Training

Population based training uses guided hyperparameter search but does not need to restart training for new hyperparameter configurations. Instead of discarding bad performing trials, we exploit good performing runs by copying their network weights and hyperparameters and then explore new hyperparameter configurations while continuing to train.
The basic idea behind the algorithm in layman terms:
This is the search space that we will use:
{ "per_gpu_batch_size": [16, 32, 64], "weight_decay": (0, 0.3), "learning_rate": (1e-5, 5e-5), "num_epochs": [2, 3, 4, 5]}
We run only 8 trials, much less than Bayesian Optimization since instead of stopping bad trials, they copy from the good ones.
Let's look at the results.

The Winner: Population Based Training of Our Hugging Face Model

The key takeaway here is that Population Based Training is the most effective approach to hyperparameter optimization of our Hugging Face transformer model. However, we uncovered a few other insights about hyperparameter tuning for NLP models that might be of broader interest:

A Few Final Thoughts

Some of the critical points to notice in these experiments: