A Step by Step Guide to Tracking Hugging Face Model Performance

A quick tutorial for training NLP models with HuggingFace and visualizing their performance with Weights & Biases. Made by Jack Morris using W&B
Jack Morris

A: Setup

This tutorial explains how to train a model (specifically, an NLP classifier) using the Weights & Biases and HuggingFace transformers Python packages.

Run the Google Colab Notebook →

1. Installation

Both packages are available on PyPI, so we can install them like this:

pip install git+https://github.com/huggingface/transformers.git
pip install wandb

2. Connecting Weights & Biases

Open the Python shell (type python in the console) and type the following:

import wandb

That's it! Our wandb package will now log runs to our Weights & Biases account. So when we train our model, we'll be able to log onto app.wandb.ai and see statistics about training, while it's going on.

3. Downloading Scripts

We're going to use two scripts provided by transformers in this tutorial: one script to download the data, and one script to train the models. We can grab the scripts from the web using wget (the -qq just downloads files "quietly", without any output to stdout):

wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py -qq

4. Downloading Data

We just downloaded download_glue_data.py, a script provided by transformers that downloads and processes datasets from the GLUE Benchmark. Run this script to download all of the GLUE datasets:

python download_glue_data.py

Run the Google Colab Notebook →

B. The GLUE Benchmark

By now, you're probably curious what task and dataset we're actually going to be training our model on. Out of the box, transformers provides great support for the General Language Understanding Evaluation (GLUE) benchmark.


GLUE is really just a collection of nine datasets and tasks for training NLP models. If you train a good NLP model, it should be able to generalize well to all nine of these tasks. There's even a public leaderboard that shows the models that have achieved top performance across all the GLUE tasks.

Here's a quick description of each task:

  1. Grammatical acceptability with CoLA
  2. Sentiment analysis with SST-2
  3. Paraphrase identification with MRPC
  4. Semantic similarity with STS-B
  5. Question duplication detection with Quora Question Pairs
  6. Textual Entailment with MNLI-matched, [MNLI-mismatched]and (https://cims.nyu.edu/~sbowman/multinli/)
  7. Recognizing Textual Entailment with RTE
  8. Question answering with SQuAD
  9. Pronoun disambiguation with Winograd Challenge dataset

C: Preparing for Training

Now, as you may have guessed, it's time to run run_glue.py and actually train the model. This script will take care of everything for us: processing the data, training the model, and even logging results to Weights & Biases. Before running it, we have two more things to decide: the dataset and the model.

Choosing our model: DistilBERT

transformers provides lots of state-of-the-art NLP models that we can use for training, including BERT, XLNet, RoBerta, and T5 (see the repository for a full list). They also provide a model hub where community members can share their models. If you train a model that achieves a competitive score on the GLUE benchmark, you should share it on the model hub!

It's up to you which model you choose to train. For this tutorial, we're going to use DistilBERT. DistilBERT is a Transformer that's 40% smaller than BERT but retains 97% of BERT's accuracy.

1_IFVX74cEe8U5D1GveL1uZA.png Source

Choosing our dataset: CoLA

Our training script (run_glue.py) supports all of the GLUE tasks. We downloaded data for all of them in Section 1. We're going to use the CoLA (Corpus of Linguistic Acceptability) dataset, but we could fine-tune our model on any of the nine datasets we've already downloaded. Feel free to choose another dataset (or model!) when you test out the training script.

C: Training the Model

In Section 2, we decided to fine-tune DistilBERT on the CoLA dataset. We're going to supply these two pieces of information as arguments to run_glue.py:

export WANDB_PROJECT=distilbert
export GLUE_DIR=glue_data

python run_glue.py \
  --model_name_or_path distilbert-base-uncased \
  --task_name COLA \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --do_train \
  --do_eval \
  --evaluate_during_training \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 6 \
  --output_dir /tmp/$TASK_NAME/ \
  --overwrite_output_dir \
  --logging_steps 50

I've just chosen default hyperparameters for fine-tuning (learning rate $2*10^{-5}$, for example) and provided some other command-line arguments. (If you're unsure what an argument is for, you can always run python run_glue.py --help.)