Hyperparameter Optimization for HuggingFace Transformers
This article explains three strategies for hyperparameter optimization for HuggingFace Transformers, using W&B to track our experiments.
Created on September 27|Last edited on January 4
Comment
Training an NLP model from scratch takes hundreds of hours. Instead, it is much easier to use a pre-trained model and fine-tune it for a specific task.
Using the HuggingFace transformers library, we can quickly load a pre-trained NLP model with several extra layers and run a few fine-tuning epochs on a specific task. Tune provides high-level abstractions for performing scalable hyperparameter tuning using SOTA tuning algorithms.
In this article, we compare 3 different optimization strategies — Grid Search, Bayesian Optimization, and Population-Based Training — to see which one results in a more accurate model in the shortest amount of time.
We use a standard uncased BERT model from Hugging Face transformers, and we want to fine-tune on the RTE dataset from the SuperGLUE benchmark. We will see that the hyperparameters we choose can have a significant impact on our final model performance.
Table of Contents
Our Hyperparameter Tuning ExperimentGrid Search (Baseline):Bayesian Search with Asynchronous HyperOptPopulation-Based TrainingA Few Final Thoughts
Our Hyperparameter Tuning Experiment
In this article, we'll experiment by tuning the Model using the following methods:
Ray Tune and W&B
- Ray Tune provides implementations of State-of-the-Art hyperparameter tuning algorithms that scale
- Experiments can be scaled easily from a notebook to GPU-powered servers without any change in code
- Experiments can be parallelized across GPUs in 2 lines of code
Enable W&B tracking
There are two ways of tracking progress through W&B using Tune.
- You can pass WandbLogger as a logger when calling tune.run. This tracks all the metrics reported to Tune.
- You can use @wandb_mixin function decorator and invoke wandb.log to track your desired metrics. Tune initializes W&B run using the information passed in the config dictionary.
config = {...."wandb":{"project": "Project_name","api_key": #Your W&B API KEY,#Additional wandb.init() parameters}}
Grid Search (Baseline):
To set up a baseline, we will perform a grid search to find the best hyperparameters across the space described by the paper authors.
{"per_gpu_batch_size": [16, 32],"learning_rate": [2e-5, 3e-5, 5e-5],"num_epochs": [2, 3, 4]}
Let us now take a look at the metric visualizations to compare the performance.
analysis = tune.run(...resources_per_trial={'gpu': #num_gpu,'cpu': #num_cpu }, # Tune will use this information to parallelize the tuning operationconfig=config...)
Run set
19
Bayesian Search with Asynchronous HyperOpt
In Bayesian search, we fit a Gaussian Process model that tries to predict the performance of the parameters (i.e., the loss) and is used to inform future hyperparameters. We also combine this with an early stopping algorithm, Asynchronous Hyperband, where we stop bad-performing trials early to avoid wasting resources.

For this experiment, we also search over weight_decay and warmup_steps, and extend our search space:
{"per_gpu_batch_size": (16, 64),"weight_decay": (0, 0.3),"learning_rate": (1e-5, 5e-5),"warmup_steps": (0, 500),"num_epochs": (2, 5)}
We run a total of 60 trials, with 15 of these used for initial random searches.
Let us now look at the results.
Run set
61
Population-Based Training
Population-based training uses guided hyperparameter search but does not need to restart training for new hyperparameter configurations. Instead of discarding bad-performing trials, we exploit good-performing runs by copying their network weights and hyperparameters and then explore new hyperparameter configurations while continuing to train.

The basic idea behind the algorithm in layman's terms:
- Run the hyperparameter optimization process for some samples for a given time step (or iterations) T.
- After every T iterations, compare the runs and copy the weights of good-performing runs to the bad-performing runs and change their hyperparameter values to be close to the runs' values that performed well.
- Terminate the worst-performing runs. Although the algorithm's idea seems simple, there is a lot of complex optimization math that goes into building this from scratch. Tune provides a scalable and easy-to-use implementation of the SOTA PBT algorithm
This is the search space that we will use:
{"per_gpu_batch_size": [16, 32, 64],"weight_decay": (0, 0.3),"learning_rate": (1e-5, 5e-5),"num_epochs": [2, 3, 4, 5]}
We run only 8 trials, much less than Bayesian Optimization since instead of stopping bad trials, they copy from the good ones.
Let's look at the results.
Run set
9
Run set
88
The Winner: Population Based Training of Our Hugging Face Model
The key takeaway here is that Population Based Training is the most effective approach to hyperparameter optimization of our Hugging Face transformer model. However, we uncovered a few other insights about hyperparameter tuning for NLP models that might be of broader interest:
- Avoiding local minima with Bayesian Optimization: When using a Bayesian Optimization method, it is essential to provide an initial set of “random guesses”. Intuitively, this provides a more informative prior for the Bayesian Optimization to start with. Otherwise, the Optimizer can be myopic and overfit to a small number of samples.
- Cutting down iteration time is super important: Always ensure that you utilize all of our machine's computing resources. Anything that can be run in parallel should be run in parallel.
- Tweaking the perturbation/mutation interval for PBT: With PBT, an important consideration is the perturbation interval, or how frequently we want to exploit and explore our hyperparameters. For our experiments, we performed this mutation after every epoch. However, doing this too frequently is counterproductive since model performance is noisy if only trained for a few batch steps.
- Random seeds also factor into our accuracy results. In addition to tuning the hyperparameters above, it might also be worth sweeping over different random seeds to find the best model. A two-step approach could work best here: First, use an early stopping algorithm to train over many different seeds, and then selecting just the best performing seeds, use Population Based Training to tune the other hyperparameters.
A Few Final Thoughts
Some of the critical points to notice in these experiments:
- All the experiments were parallelized across 8 GPUs by using Tune.
- These experiments can be scaled up or down without changing the code.
- We have all the important metrics, inferences and even this report in one place and can be easily shared
- These inferences can be used to accurately quantify the resources that will be saved by using the suitable search method.
- This overall structure leads to more productivity among the teams
Add a comment
Where do we pass the parameters - e.g.: per_gpu_batch_size? I mean you mentioned hyperparameters inside {} but you never assign them to anything
1 reply
Tags: Intermediate, NLP, HuggingFace, Ray Tune, Research, Tutorial, BERT, Sweeps, RTE, Large Models, LLM
Iterate on AI agents and models faster. Try Weights & Biases today.