Hyperparameter Search for HuggingFace Transformer Models
In this article, we will explore how to perform hyperparameter search for pre-trained HuggingFace transformer models, making use of Weights & Biases Sweeps.
Created on June 11|Last edited on December 8
Comment
In this blog post we will learn how to leverage Weights & Biases (W&B) Sweeps 🧹 to perform hyperparameter search for HuggingFace transformer models.
Then, we'll compare some of the best hyperparameter combinations with the default values provided by HuggingFace to evaluate the benefits of running a hyperparameter search.
As for the transformer model itself, we'll use a pre-trained Vision Transformer model to perform image classification on a task-specific dataset.
Let's get going.
Table of Contents
What Are Vision Transformers (ViT)?Setting Up Our ExperimentLoading The DataLoading the Pre-Trained ModelExploring Hyperparameter Combinations With SweepsRelated Articles
What Are Vision Transformers (ViT)?
The Vision Transformer (ViT) is a transformer model proposed for the first time in "An Image is Worth 16x16 Words" a research paper which was published in 2020 by the Google Brain team that introduced a new way to pre-process images.
With Vision Transformers, images are broken down into a sequence of patches to mimic the pre-processing pipeline typical of text-based tasks where sentences are broken down into tokens. In doing so, the researchers could leverage a standard transformer encoder as the main backbone of the architecture.

Overview of ViT architecture. source: Google AI blog
The Pre-Trained Model
For the purpose of this blog post, we will use a pre-trained model available on HuggingFace, named google/vit-base-patch16-224-in21k.
The checkpoint was generated by pre-training a ViT model on ImageNet-21k which contains 14 million images and 21,843 classes. The image size was 224x224 and during the pre-processing step, each image was converted into a sequence of 16x16 patches.
The Data
We'll use the snacks dataset which is also available on HuggingFace (and yes, the dataset is exactly what it sounds like). This dataset itself contains a total of 6,745 images and 20 classes from the Google Open Images dataset.
Setting Up Our Experiment
To perform this analysis we will essentially rely on three libraries: HuggingFace's datasets and transformers and, of course, W&B's wandb. Let's install those quickly:
Please note: the underlying assumption here is that we running the code snippets in notebook-like environment.
💡
# pip install libraries!pip install datasets -Uqq!pip install transformers[sentencepiece] -Uqq!pip install -qq wandb --upgrade
Next, we'll import the W&B library and provide our personal token to associate the experiments with our account. We also set two environment variables: with WANDB_PROJECT we specify the project's name and with WANDB_LOG_MODEL we instruct W&B experiment tracking API to save the model configuration file and parameters as an artifact at the end of each experiment.
import wandbwandb.login()%env WANDB_PROJECT=vit_snacks_sweeps%env WANDB_LOG_MODEL=true
Loading The Data
We'll load the data using load_dataset from the datasets library and pass the name of the dataset we want as an argument. For this example, we pass Matthijs/snacks to load the snacks dataset.
The data is represented as a series of nested dictionaries. At the top level, we find three datasets that represent our train, validation and test sets and at the lower level, within each dataset, we have two features: image and label
We extrapolate the labels through the features attribute because we will need them later to correctly initialize the ViT model.
from datasets import load_datasetdatasets = load_dataset('Matthijs/snacks')labels = datasets['train'].features['label']
With W&B we can even log a random sample of images and create an interactive visualization. No more static matplotlib plots!
Loading the Image Feature Extractor
Before passing through a model, images require some sort of pre-processing. A "feature extractor" is just some fancy terminology to refer to that pre-processing pipeline and usually it consists of a resizing and normalization step.
The ViTFeatureExtractor has a from_pretrained method that allows us to load any pre-processing pipeline we want easily.
The one we are using will resize the images to 224x224 and normalize the pixel values across channels.
from transformers import ViTFeatureExtractorcheckpoint = 'google/vit-base-patch16-224-in21k'feature_extractor = ViTFeatureExtractor.from_pretrained(checkpoint)
Each pre-trained ViT model comes with its specific feature extractor. That's why it's extremely important, especially when experimenting with different pre-trained models, to make sure to use the same checkpoint when instantiating the feature extractor and model.
💡
Adding Data Augmentation
Next, we enhance the default feature extractor with data augmentation.
What Is Data Augmentation?
Data augmentation refers to all kinds of data transformations that allow us to slightly alter the images at each epoch. That could mean flipping an image, zooming in or out, skewing it, etc.
In doing so, we expose the model to a larger variety of images which often makes the model more robust to unseen samples.
Back To Adding Augmentation
In the section above we underlined the importance of using the right feature extractor with the right model, but now we are already changing it. It might seem contradictory, but as with almost anything in life, there are always exceptions and nuances to consider.
In fact, the model doesn't care about which transformations we apply to the images. The model only cares about the final image size and the pixels' mean and standard deviation. That's why, as long as we resize and normalize the images according to the values defined in the feature extractor, we are free to apply whatever transformation we consider most suitable for the problem at hand.
In this case, we add a set of transformations from the torchvision library, but a similar logic can be applied to almost any data augmentation library. The reason we define two distinct pre-processing pipelines is that we want to show the model a slightly different training set at each epoch, but at evaluation time we want to assess the model performances against the same set of images.
That's why the pre-processing pipeline for the training set contains random transformations whose parameters change at each iteration, whereas the one for the validation and test sets contains static transformations.
To pre-process the images according to the feature extractor parameters, we can grab the correct parameters using the size, image_mean and std_mean attributes, and use them when defining our custom pipeline.
# data augmentation transformationsfrom torchvision.transforms import (Compose,Normalize,Resize,RandomResizedCrop,RandomHorizontalFlip,RandomAdjustSharpness,ToTensor,ToPILImage)# traintrain_aug_transforms = Compose([RandomResizedCrop(size=feature_extractor.size),RandomHorizontalFlip(p=0.5),RandomAdjustSharpness(sharpness_factor=5, p=0.5),ToTensor(),Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),])# validation/testvalid_aug_transforms = Compose([Resize(size=(feature_extractor.size, feature_extractor.size)),ToTensor(),Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),])
For each transformation pipeline, we define a corresponding function that we pass to a set_transform method as an argument to efficiently pre-process images on-the-fly during training.
def apply_train_aug_transforms(examples):examples['pixel_values'] = [train_aug_transforms(img.convert('RGB')) for img in examples['image']]return examplesdef apply_valid_aug_transforms(examples):examples['pixel_values'] = [valid_aug_transforms(img.convert('RGB')) for img in examples['image']]return examplesdatasets['train'].set_transform(apply_train_aug_transforms)datasets['validation'].set_transform(apply_valid_aug_transforms)datasets['test'].set_transform(apply_valid_aug_transforms)
Finally, we rename the target feature from label to labels because by default HuggingFace's models look for this variable name to compute the loss.
datasets_processed = datasets.rename_column('label', 'labels')
Loading the Pre-Trained Model
Loading a pre-trained model is straightforward. We just need to initialize a ViTForImageClassification object and provide the name of a pre-trained checkpoint to the from_pretrained method.
We also need to set the number of classes of our dataset. In this way, the original model head is replaced with a new one that fits the dataset at hand. A model head is a linear layer on top of the encoder that projects the final embeddings onto the right vector space, which for this task has one dimension for each label.
We enclose this step inside a function because to properly perform a hyperparameter search we need to re-initialize the model at each new run.
from transformers import ViTForImageClassificationdef model_init():vit_model = ViTForImageClassification.from_pretrained(checkpoint,num_labels=labels.num_classes,id2label={index: label for index, label in enumerate(labels.names)},label2id={label: index for index, label in enumerate(labels.names)})return vit_model
Exploring Hyperparameter Combinations With Sweeps
Weights & Biases Sweeps requires a configuration file to define the hyperparameters to explore, their range of values, and the search strategy, just to name a few. In a notebook these values are stored in a nested dictionary.
In this example, we will explore different combinations of batch_size, learning_rate and weight_decay using a random search. We will evaluate each combination for a single epoch.
The reason we don't explore the number of epochs is because later we will fine-tune a model for 5 epochs using some of the best combinations of values found with Sweeps and the default hyperparameters provided by HuggingFace. In this way, we will be able to assess, to a certain extent, the benefits of running a hyperparameter search for ViT model.
💡
# methodsweep_config = {'method': 'random'}# hyperparametersparameters_dict = {'epochs': {'value': 1},'batch_size': {'values': [8, 16, 32, 64]},'learning_rate': {'distribution': 'log_uniform_values','min': 1e-5,'max': 1e-3},'weight_decay': {'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]},}sweep_config['parameters'] = parameters_dict
With our Sweep configuration ready, we call wandb.sweep to initialize the hyperparameter search. wandb.sweep takes as input the sweep_config and the project name, and it returns a Sweep ID.
sweep_id = wandb.sweep(sweep_config, project='vit-snacks-sweeps')
The next step involves defining a function containing a training loop that sets the hyperparameters based on a Sweep configuration.
The config will be passed to wandb.init and then the hyperparameter values will be delivered to the training loop, which in HuggingFace is defined by a TrainingArguments and a Trainer object.
The train method will start the training process.
Unfamiliar with Hugging Face?
💡
By default, a HuggingFace's Trainer returns only the losses. That's why we need to define an ad-hoc function if we want to compute the accuracy and other metrics during training.
# define function to compute metricsfrom datasets import load_metricimport numpy as npdef compute_metrics_fn(eval_preds):metrics = dict()accuracy_metric = load_metric('accuracy')precision_metric = load_metric('precision')recall_metric = load_metric('recall')f1_metric = load_metric('f1')logits = eval_preds.predictionslabels = eval_preds.label_idspreds = np.argmax(logits, axis=-1)metrics.update(accuracy_metric.compute(predictions=preds, references=labels))metrics.update(precision_metric.compute(predictions=preds, references=labels, average='weighted'))metrics.update(recall_metric.compute(predictions=preds, references=labels, average='weighted'))metrics.update(f1_metric.compute(predictions=preds, references=labels, average='weighted'))return metrics
Then, we also need to define an ad-hoc collate function to form the batches.
import torchdef collate_fn(examples):pixel_values = torch.stack([example['pixel_values'] for example in examples])labels = torch.tensor([example['labels'] for example in examples])return {'pixel_values': pixel_values, 'labels': labels}
Finally, we put everything together:
from transformers import TrainingArguments, Trainerdef train(config=None):with wandb.init(config=config):# set sweep configurationconfig = wandb.config# set training argumentstraining_args = TrainingArguments(output_dir='vit-sweeps',report_to='wandb', # Turn on Weights & Biases loggingnum_train_epochs=config.epochs,learning_rate=config.learning_rate,weight_decay=config.weight_decay,per_device_train_batch_size=config.batch_size,per_device_eval_batch_size=16,save_strategy='epoch',evaluation_strategy='epoch',logging_strategy='epoch',load_best_model_at_end=True,remove_unused_columns=False,fp16=True)# define training looptrainer = Trainer(# model,model_init=model_init,args=training_args,data_collator=collate_fn,train_dataset=datasets_processed['train'],eval_dataset=datasets_processed['validation'],compute_metrics=compute_metrics_fn)# start training looptrainer.train()
To actually start the hyperparameter search we call wandb.agent which takes as input the Sweep ID, the train function and the number of experiments we want to run.
After running this cell, we can relax, enjoy a cup of coffee ☕ or chai 🍵 if you prefer and go outside for a walk 🌄. In the meantime, W&B Sweeps will take care of everything.
wandb.agent(sweep_id, train, count=20)
Analyzing Sweeps Results
This is the part where W&B experiment tracking and visualization tools have no competition 🥇
This first plot shows the experiments over time and can help us adjust the number of experiments we want to run. Sometimes, just a handful of runs might be more than enough to find a good set of hyperparameter values.
In this case, the optimal values were found in the 19th experiment, but the first 4 runs were already pretty great. Actually, almost all of them were good and so there is no need to continue the search.
Our second plot is a parallel coordinates plot and it's extremely useful to compare different combinations of values visually.
It provides an additional level of granularity compared to the previous plot that allows us to actually see which set of values were associated with the best and worst runs.
For instance, we can see that the two worst experiments were characterized by a learning_rate lower than 2e-5 and a weight_decay larger than 0.25.
This last plot shows the feature's importance of the hyperparameters with respect to a certain target variable, which in this case is the validation accuracy. We can notice that the most important feature is the learning_rate and it's positively correlated with the validation accuracy. The second most important variable is the weight_decay and the correlation with the accuracy is negative.
This plot essentially confirms what we noticed by analyzing the previous plot: a low learning_rate and a high weight_decay don't work well together.
Comparing Hyperparameter Combinations vs HuggingFace's Default Values
This last section compares some of the hyperparameter combinations and the default values provided by HuggingFace. They suggest a batch_size of 8, a learning_rate of 5e-5 and they don't apply any weight_decay.
The experiment was structured in the following way: for each set of values, a ViT model was fine-tuned for 5 epochs and the results were averaged over three runs.
The table below shows the final results of this analysis.
On average, the best-performing set of hyperparameters was found with Sweeps and achieved a validation accuracy of 95.22%. As a comparison, the default set of hyperparameters, which acts as a baseline, achieved a validation accuracy of 94.76%.
Not a massive difference, but an improvement is always an improvement 🤟
As before, with a parallel coordinates plot, we can take a better look at the different sets of hyperparameters and their performances. In this case, it doesn't represent individual runs, but the average accuracy over the three runs.
Finally, this bar plot displays the train run time. As we might have expected, the smaller the batch size, the longer the train time. However, for this example, the difference is relatively small.
Conclusion
In this article, we learned how to integrate W&B Sweeps with HuggingFace's transformer library. We also compared various runs to understand the benefits of running a hyperparameter search.
We took full advantage of W&B tools to track our experiments and visualize the results. According to them, we discovered that, despite the difference being minimal, two sets of hyperparameters found with Sweeps outperformed the default values suggested by HuggingFace.
Obviously, this analysis just scratches the surface of hyperparameter tuning and has many limitations. For example, it would be interesting to consider a larger number of hyperparameters, a wider range of values and alternative search strategies.
That's it; thanks for reading and I hope you found this article useful!
Related Articles
How To Fine-Tune Hugging Face Transformers on a Custom Dataset
In this article, we will learn how to easily fine-tune a HuggingFace Transformer on a custom dataset with Weights & Biases.
An Example of Transformer Reinforcement Learning
In this article, we take a look at the logged metrics and gradients from a GPT-2 experiment that is tasked with writing favorable reviews for movies.
A Step-by-Step Guide to Tracking HuggingFace Model Performance
This article provides a quick tutorial for training Natural Language Processing (NLP) models with HuggingFace and visualizing their performance with W&B.
Fighting Plastic Pollution in Oceans with Deep Learning
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.