Hyperparameter Optimization for HuggingFace Transformers

This article explains three strategies for hyperparameter optimization for HuggingFace Transformers, using W&B to track our experiments.
Ayush Chaurasia
Created on September 27|Last edited on January 4
Comment
Training an NLP model from scratch takes hundreds of hours. Instead, it is much easier to use a pre-trained model and fine-tune it for a specific task. 
Using the HuggingFace transformers library, we can quickly load a pre-trained NLP model with several extra layers and run a few fine-tuning epochs on a specific task. Tune provides high-level abstractions for performing scalable hyperparameter tuning using SOTA tuning algorithms. 
In this article, we compare 3 different optimization strategies — Grid Search, Bayesian Optimization, and Population-Based Training — to see which one results in a more accurate model in the shortest amount of time. 
We use a standard uncased BERT model from Hugging Face transformers, and we want to fine-tune on the RTE dataset from the SuperGLUE benchmark. We will see that the hyperparameters we choose can have a significant impact on our final model performance. 
Table of ContentsOur Hyperparameter Tuning ExperimentGrid Search (Baseline):Bayesian Search with Asynchronous HyperOptPopulation-Based TrainingA Few Final Thoughts
﻿
Our Hyperparameter Tuning ExperimentIn this article, we'll experiment by tuning the Model using the following methods:
﻿Grid Search﻿
﻿Bayesian Search﻿
﻿Population-Based Training﻿
Ray Tune and W&BThere are many advantages to using Ray Tune with W&B:
Ray Tune provides implementations of State-of-the-Art hyperparameter tuning algorithms that scale
Experiments can be scaled easily from a notebook to GPU-powered servers without any change in code
Experiments can be parallelized across GPUs in 2 lines of code
With W&B Experiment  Tracking, you have all your stats in one place for making useful inferences
Using W&B with Ray Tune, you never lose any progress
Enable W&B trackingThere are two ways of tracking progress through W&B using Tune. 
You can pass WandbLogger as a logger when calling tune.run. This tracks all the metrics reported to Tune. 
You can use @wandb_mixin function decorator and invoke wandb.log to track your desired metrics.
Tune initializes W&B run using the information passed in the config dictionary.
     config = {
       ....
          "wandb":{
            "project": "Project_name",
            "api_key": #Your W&B API KEY,
            #Additional wandb.init() parameters
        }
}
﻿
Grid Search (Baseline):To set up a baseline, we will perform a grid search to find the best hyperparameters across the space described by the paper authors.
{
  "per_gpu_batch_size": [16, 32],
  "learning_rate": [2e-5, 3e-5, 5e-5],
  "num_epochs": [2, 3, 4]
}
Let us now take a look at the metric visualizations to compare the performance.
analysis = tune.run(
    ...
    resources_per_trial={'gpu': #num_gpu,
      'cpu': #num_cpu }, # Tune will use this information to parallelize the tuning operation
    config=config
    ...
)
﻿
﻿
Run set19
﻿
Bayesian Search with Asynchronous HyperOptIn Bayesian search, we fit a Gaussian Process model that tries to predict the performance of the parameters (i.e., the loss) and is used to inform future hyperparameters. We also combine this with an early stopping algorithm, Asynchronous Hyperband, where we stop bad-performing trials early to avoid wasting resources.
 
﻿
 
For this experiment, we also search over weight_decay and warmup_steps, and extend our search space:
{
  "per_gpu_batch_size": (16, 64),
  "weight_decay": (0, 0.3),
  "learning_rate": (1e-5, 5e-5),
  "warmup_steps": (0, 500),
  "num_epochs": (2, 5)
}
We run a total of 60 trials, with 15 of these used for initial random searches.
Let us now look at the results.
﻿
Run set61
﻿
Population-Based Training Population-based training uses guided hyperparameter search but does not need to restart training for new hyperparameter configurations. Instead of discarding bad-performing trials, we exploit good-performing runs by copying their network weights and hyperparameters and then explore new hyperparameter configurations while continuing to train.
 
﻿
 
The basic idea behind the algorithm in layman's terms:
Run the hyperparameter optimization process for some samples for a given time step (or iterations) T.
After every T iterations, compare the runs and copy the weights of good-performing runs to the bad-performing runs and change their hyperparameter values to be close to the runs' values that performed well.
Terminate the worst-performing runs.
Although the algorithm's idea seems simple, there is a lot of complex optimization math that goes into building this from scratch. Tune provides a scalable and easy-to-use implementation of the SOTA PBT algorithm

This is the search space that we will use:
{
  "per_gpu_batch_size": [16, 32, 64],
  "weight_decay": (0, 0.3),
  "learning_rate": (1e-5, 5e-5),
  "num_epochs": [2, 3, 4, 5]
}
We run only 8 trials, much less than Bayesian Optimization since instead of stopping bad trials, they copy from the good ones.

Let's look at the results.
﻿
﻿
Run set9
﻿
﻿
﻿
﻿
Run set88
﻿
﻿
The Winner: Population Based Training of Our Hugging Face ModelThe key takeaway here is that Population Based Training is the most effective approach to hyperparameter optimization of our Hugging Face transformer model. However, we uncovered a few other insights about hyperparameter tuning for NLP models that might be of broader interest:
Avoiding local minima with Bayesian Optimization: When using a Bayesian Optimization method, it is essential to provide an initial set of “random guesses”. Intuitively, this provides a more informative prior for the Bayesian Optimization to start with. Otherwise, the Optimizer can be myopic and overfit to a small number of samples.
Cutting down iteration time is super important: Always ensure that you utilize all of our machine's computing resources. Anything that can be run in parallel should be run in parallel.
Tweaking the perturbation/mutation interval for PBT:  With PBT, an important consideration is the perturbation interval, or how frequently we want to exploit and explore our hyperparameters. For our experiments, we performed this mutation after every epoch. However, doing this too frequently is counterproductive since model performance is noisy if only trained for a few batch steps.
Random seeds also factor into our accuracy results. In addition to tuning the hyperparameters above, it might also be worth sweeping over different random seeds to find the best model. A two-step approach could work best here: First, use an early stopping algorithm to train over many different seeds, and then selecting just the best performing seeds, use Population Based Training to tune the other hyperparameters.
A Few Final ThoughtsSome of the critical points to notice in these experiments:
All the experiments were parallelized across 8 GPUs by using Tune.
These experiments can be scaled up or down without changing the code.
We have all the important metrics, inferences and even this report in one place and can be easily shared
These inferences can be used to accurately quantify the resources that will be saved by using the suitable search method.
This overall structure leads to more productivity among the teams  
﻿
﻿
Add a comment
Bazen Gashaw Teferra • 4 years ago*
Where do we pass the parameters - e.g.: per_gpu_batch_size? I mean you mentioned hyperparameters inside {} but you never assign them to anything
1 reply
Tags: Intermediate, NLP, HuggingFace, Ray Tune, Research, Tutorial, BERT, Sweeps, RTE, Large Models, LLM
Iterate on AI agents and models faster. Try Weights & Biases today.