Does Model Size Matter? A Comparison of BERT and DistilBERT

Comparing DistilBERT and BERT from HuggingFace, using hyperparameter sweeps from Weights & Biases. Made by Jack Morris using Weights & Biases
Jack Morris

1. Introduction and Getting Started

If you haven't already, check out my tutorial on training a model using HuggingFace and Weights & Biases. We'll be building on that knowledge today. This tutorial will cover two models – BERT and DistilBERT – and explain how to conduct a hyperparameter search using Sweeps. We're going to aim to answer two questions:

  1. How does DistilBERT compare in performance to the larger BERT?
  2. Should BERT and DistilBERT be fine-tuned with different hyperparameters?

BERT and DistilBERT

1_IFVX74cEe8U5D1GveL1uZA.png Source

BERT is a powerful language model that was released by Google in October 2018. BERT blew several important language benchmarks out of the water. Since its release, transformer-based models like BERT have become "state-of-the-art" in NLP.

BERT is very powerful, but also very large; its models contain DistilBERT is a slimmed-down version of BERT, trained by scientists at HuggingFace.

Getting Started

This tutorial includes the code required for conducting a hyperparameter sweep of BERT and DistilBERT on your own. Both BERT and DistilBERT have pre-trained versions that can be loaded from the HuggingFace transformers GitHub repository. The repository also contains code for fine-tuning the models for various NLP tasks, including all of the tasks from the GLUE benchmark. We're going to conduct the hyperparameter search using Weights & Biases Sweeps, so we'll have to install the W&B Python client as well.

So we need to install both Python libraries, download the GLUE data, and download the fine-tuning script:

pip install git+
pip install wandb -qq
wget -qq
wget -qq

Our folder now contains the script which we can run to fine-tune either BERT or DistilBERT. However, we're not going to run it directly. Once we set up Sweeps, wandb will automatically run over and over again with different sets of hyperparameters. So next, we need to set up Sweeps.

2. Setting up the Sweep

Models and Dataset

Before creating our sweep.yaml, we need to choose the hyperparameters we want to try out. We're going to write a list of options for each hyperparameter and let Sweeps try out every possible combination. We'll do this for both BERT and DistilBERT.

Model Num. Parameters (millions) Inference Time (ms)
BERT 110 668
DistilBERT 66 410

Let's run this first Sweep on a single task from GLUE. I'm going to choose the RTE (Recognizing Textual Entailment) task, simply because the dataset is one of the smaller datasets in GLUE.


Task Train set size Test set size
RTE 2.5k 3k

Defining the Search Space

The BERT authors recommend fine-tuning for 4 epochs over the following hyperparameter options:

We'll run our Sweep across all combinations of these hyperparameters for each model. That'll take a total of (5 batch sizes) * (4 learning rates) * (2 models) = 40 runs for a grid search on RTE.

Starting the Sweep

Creating a Sweep takes three steps:

  1. Configuration: Define the parameters of the Sweep in sweep.yaml
  2. Initialization: Create the Sweep using wandb sweep sweep.yaml
  3. Execution: Run the Sweep on one or more machines using wandb agent

Configuring the Sweep

We need to create a YAML file that tells Sweeps to execute with the proper hyperparameters. To view a full list of hyperparameters, run python --help. For more documentation on how to write the Sweeps YAML configuration file, click here.

Here's the YAML file. I'll explain the important parts below. You can also check the Sweeps documentation for more information about any of these variables.

  - ${env}
  - ${interpreter}
  - ${program}
  - "--do_train" 
  - "--do_eval" 
  - "--evaluate_during_training" 
  - "--overwrite_output_dir"
  - ${args}
method: grid
  name: eval_acc
  goal: maximize
  # parameters to be optimized over
    values: [3e-4, 1e-4, 5e-5, 3e-5]
    values: [8, 16, 32, 64, 128]
    values: ["distilbert-base-uncased", "bert-base-uncased"]
  # fixed parameters
    value: RTE
    value: glue_data/RTE 
    value: /tmp/RTE/
    value: 128
    value: 4
    value: 50
    value: .01

Initializing the Sweep

We've created our Sweep configuration and saved it to sweep.yaml. We can start the Sweep by running

wandb sweep sweep.yaml

This will tell us the ID of the sweep, which we'll use in the next step to start running it.

Executing the Sweep

This is the simplest part of all. The output of the Sweep initialization will give you a command, like

wandb agent jxmorris12/huggingface-tutorial/p4zq81qh

Run this command to start the Sweep. W&B will automatically run the Sweep over and over with different parameters of the search. You can run the Sweep on as many machines as you'd like, too! (I ran mine from a single Colab notebook, since I don't have any spare GPUs lying around...)

3. Analyzing Sweep Results

After letting wandb agent work its magic for awhile, all 40 runs of the Sweep are complete, and the results are in. Let's remind ourselves of the two questions we asked at the beginning of the post:

  1. How does the performance of DistilBERT compare to BERT?
  2. Should BERT and DistilBERT be fine-tuned with the same hyperparameters?

DistilBERT vs. BERT

Let's take a global look at the results. We can create a W&B Parallel Coordinates chart that shows us the compares the eval accuracy between runs from BERT and DistilBERT. We can also create a regular line chart to show the accuracy over time by each model name.

3. Analyzing Sweep Results

This confirms what the Parameter Importance plot already told us: low learning rates are good, batch size doesn't matter very much. We'd need more evidence to confirm, but I'd say that a smaller batch size is preferable in this case, too.

Looking at the graph, the highest learning rate we tried, 3e-4, failed to train the model to greater than 50% accuracy. Unlike most entailment classes, RTE only has two classes ("entailment" and "not entailment"). This means that the model trained with a learning rate 0.0003 did worse than random guessing. This is likely because the gradient exploded during training.

4. BERT: Optimal Hyperparameters

Now that we know the rough effect and relative importance of our hyperparameters, let's do the last thing we set out to do: determine a good set of fine-tuning hyperparameters that we can recommend for each model. We can do this by plotting the effect of batch size and learning rate on accuracy again, only this time, we'll only look at runs that fine-tuned BERT.

4. BERT: Optimal Hyperparameters

5. DistilBERT: Optimal Hyperparameters

We can repeat the same experiments from Section 4 to determine the recommended learning rate and batch size for fine-tuning DistilBERT.

5. DistilBERT: Optimal Hyperparameters

5. Conclusion

We trained 40 models to compare fine-tuning BERT and DistilBERT. Along the way, we learned how to conduct Sweeps and visualize different metrics using Weights & Biases. We trained some state-of-the-art models on the Recognizing Textual Entailment task and showed how BERT and DistilBERT perform better with different hyperparameters. Now that you know the power of transformers and wandb together, go out and train great NLP models!