Skip to main content

Does Model Size Matter? A Comparison of BERT and DistilBERT

This article provides a comparison of DistilBERT and BERT from Hugging Face, using hyperparameter sweeps from Weights & Biases.
Created on May 11|Last edited on October 13

1. Getting Started

If you haven't already, check out my tutorial on training a model using Hugging Face and Weights & Biases. We'll be building on that knowledge today.
This tutorial will cover two models – BERT and DistilBERT – and explain how to conduct a hyperparameter search using Sweeps. We're going to aim to answer two questions:
  1. How does DistilBERT compare in performance to the larger BERT?
  2. Should BERT and DistilBERT be fine-tuned with different hyperparameters?

Table of Contents



BERT and DistilBERT


Source
BERT is a powerful language model that was released by Google in October 2018. BERT blew several important language benchmarks out of the water. Since its release, transformer-based models like BERT have become "state-of-the-art" in Natural Language Processing (NLP).
BERT is very powerful, but also very large; its models contain DistilBERT, a slimmed-down version of BERT, trained by scientists at Hugging Face.

Getting Started

This tutorial includes the code required for conducting a hyperparameter sweep of BERT and DistilBERT on your own. Both BERT and DistilBERT have pre-trained versions that can be loaded from the Hugging Face transformers GitHub repository. The repository also contains code for fine-tuning the models for various NLP tasks, including all of the tasks from the GLUE benchmark. We're going to conduct the hyperparameter search using Weights & Biases Sweeps, so we'll have to install the W&B Python client as well.
So we need to install both Python libraries, download the GLUE data, and download the fine-tuning script:
pip install git+https://github.com/huggingface/transformers.git
pip install wandb -qq
wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py -qq
python download_glue_data.py
wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/text-classification/run_glue.py -qq
Our folder now contains the script run_glue.py which we can run to fine-tune either BERT or DistilBERT. However, we're not going to run it directly. Once we set up Sweeps, wandb will automatically run run_glue.py over and over again with different sets of hyperparameters. So next, we need to set up Sweeps.

2. Setting up the Sweep

Models and Dataset

Before creating our sweep.yaml, we need to choose the hyperparameters we want to try out. We're going to write a list of options for each hyperparameter and let Sweeps try out every possible combination. We'll do this for both BERT and DistilBERT.
ModelNum. Parameters (millions)Inference Time (ms)
BERT110668
DistilBERT66410

Let's run this first Sweep on a single task from GLUE. I'm going to choose the RTE (Recognizing Textual Entailment) task, simply because the dataset is one of the smaller datasets in GLUE.
<br>
TaskTrain set sizeTest set size
RTE2.5k3k


Defining the Search Space

The BERT authors recommend fine-tuning for 4 epochs over the following hyperparameter options:
  • batch sizes: 8, 16, 32, 64, 128
  • learning rates: 3e-4, 1e-4, 5e-5, 3e-5
We'll run our Sweep across all combinations of these hyperparameters for each model. That'll take a total of (5 batch sizes) * (4 learning rates) * (2 models) = 40 runs for a grid search on RTE.

Starting the Sweep

Creating a Sweep takes three steps:
  1. Configuration: Define the parameters of the Sweep in sweep.yaml
  2. Initialization: Create the Sweep using wandb sweep sweep.yaml
  3. Execution: Run the Sweep on one or more machines using wandb agent

Configuring the Sweep

We need to create a YAML file that tells Sweeps to execute run_glue.py with the proper hyperparameters. To view a full list of hyperparameters, run python run_glue.py --help. For more documentation on how to write the Sweeps YAML configuration file, click here.
Here's the YAML file. I'll explain the important parts below. You can also check the Sweeps documentation for more information about any of these variables.
program: run_glue.py
command:
- ${env}
- ${interpreter}
- ${program}
- "--do_train"
- "--do_eval"
- "--evaluate_during_training"
- "--overwrite_output_dir"
- ${args}
method: grid
metric:
name: eval_acc
goal: maximize
parameters:
#
# parameters to be optimized over
#
learning_rate:
values: [3e-4, 1e-4, 5e-5, 3e-5]
per_gpu_train_batch_size:
values: [8, 16, 32, 64, 128]
model_name_or_path:
values: ["distilbert-base-uncased", "bert-base-uncased"]
#
# fixed parameters
#
task_name:
value: RTE
data_dir:
value: glue_data/RTE
output_dir:
value: /tmp/RTE/
max_seq_length:
value: 128
num_train_epochs:
value: 4
logging_steps:
value: 50
weight_decay:
value: .01
  • command:: our choice tells W&B that for each run of the sweep (with hyperparams ${args}), run our command like python run_glue.py --do_train --do_eval --evaluate_during_training --overwrite_output_dir ${args}.
  • method:: We want to do a grid search: try all possible parameter combinations from lists values. Other options are random and bayes.
  • metric:: this will allow us to compare and select runs with the maximum eval_acc
  • parameters: these are the parameters of the Sweep. Sweeps will sample from these and pass the sampled hyperparameters as command-line arguments to run_glue.py. We're sampling a variety of learning rates and batch sizes for two different models (DistilBERT and BERT). The remaining parameters (task_name, max_seq_length, num_training_epochs, logging_steps, weight_decay) have a fixed value for each run.

Initializing the Sweep

We've created our Sweep configuration and saved it to sweep.yaml. We can start the Sweep by running
wandb sweep sweep.yaml
This will tell us the ID of the sweep, which we'll use in the next step to start running it.

Executing the Sweep

This is the simplest part of all. The output of the Sweep initialization will give you a command, like
wandb agent jxmorris12/huggingface-tutorial/p4zq81qh
Run this command to start the Sweep. W&B will automatically run the Sweep over and over with different parameters of the search. You can run the Sweep on as many machines as you'd like, too! (I ran mine from a single Colab notebook, since I don't have any spare GPUs lying around...)

3. Analyzing Sweep Results

After letting wandb agent work its magic for awhile, all 40 runs of the Sweep are complete, and the results are in. Let's remind ourselves of the two questions we asked at the beginning of the post:
  1. How does the performance of DistilBERT compare to BERT?
  2. Should BERT and DistilBERT be fine-tuned with the same hyperparameters?

DistilBERT vs. BERT

Let's take a global look at the results. We can create a W&B Parallel Coordinates chart that shows us the compares the eval accuracy between runs from BERT and DistilBERT. We can also create a regular line chart to show the accuracy over time by each model name.



Sweep Results
40

This confirms what the Parameter Importance plot already told us: low learning rates are good, batch size doesn't matter very much. We'd need more evidence to confirm, but I'd say that a smaller batch size is preferable in this case, too.
Looking at the graph, the highest learning rate we tried, 3e-4, failed to train the model to greater than 50% accuracy. Unlike most entailment classes, RTE only has two classes ("entailment" and "not entailment"). This means that the model trained with a learning rate 0.0003 did worse than random guessing. This is likely because the gradient exploded during training.

4. BERT: Optimal Hyperparameters

Now that we know the rough effect and relative importance of our hyperparameters, let's do the last thing we set out to do: determine a good set of fine-tuning hyperparameters that we can recommend for each model. We can do this by plotting the effect of batch size and learning rate on accuracy again, only this time, we'll only look at runs that fine-tuned BERT.


BERT Runs
20


5. DistilBERT: Optimal Hyperparameters

We can repeat the same experiments from Section 4 to determine the recommended learning rate and batch size for fine-tuning DistilBERT.



DistilBERT Runs
20


5. Conclusion

We trained 40 models to compare fine-tuning BERT and DistilBERT. Along the way, we learned how to conduct Sweeps and visualize different metrics using Weights & Biases. We trained some state-of-the-art models on the Recognizing Textual Entailment task and showed how BERT and DistilBERT perform better with different hyperparameters. Now that you know the power of transformers and wandb together, go out and train great NLP models!

Mihai
Mihai •  
Thank you for this report! Can you please share the full script so I can reproduce this ? Thank you!
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.