Overview

Human language is ambiguous to machines

Consider these two sentences, which only differ by one word:

Hopefully it's clear that in sentence A, the word "them" must refer to the "humans", while in sentence B, "them" refers to "deep nets". This task is straightforward for humans but challenging for machines. Disambiguating the pronoun "them" requires knowing more about the real world—the relationship between humans and deep nets—than entailed in the sentence. Until machine learning models can parse these types of sentences as well as humans do, we won't really be able to communicate with them in natural language, no matter what new website, app, or device we are using.

Language understanding benchmarks for measuring progress

Originally proposed by Terry Winograd, a Stanford professor who did foundational work on natural language understanding, in 1972, the Winograd Schema Challenge remains one of the main benchmarks in this space. Specifically, Winograd Schema sentences like the example above are part of the General Language Understanding Evaluation, or GLUE Benchmark for text-based deep learning models. In GLUE, the task is called Winograd Natural Language Inference, or WNLI. The baseline human performance on this task is 95.9, but the best trained models are still at 94.5, so there is room for improvement. This is the second hardest of the nine subtasks in GLUE, and in six other tasks, deep learning models have already surpassed the human baseline.

HuggingFace Transformers are a great place to start

The HuggingFace Transformers repository makes it very easy to work with a variety of advanced natural language models and try them on these benchmarks. In this report, I'll show how to use Transformers with Weights & Biases to start tuning models for a natural language task like correctly disambiguating Winograd schemas.

First pass: Fine-tuning BERT from example baseline

Short manual hyperparameter exploration

I started with the hyperparameter settings in the example provided with Transformers. Based on the loss and validation accuracy (eval_acc) curves plotted in W&B after each run, I adjusted my model to improve performance from a baseline eval_acc of 0.127 to 0.535 in fewer than 20 experiments.

Below, you can see the training loss and validation accuracy curves plotted over time. The starting baseline is in black, and the rest of the runs are colored in rainbow order from red to purple based on their creation order: my earlier experiments are reds/oranges, and the later experiments are blues/purples. The legend shows the maximum sequence length (max_len), training epochs (E), and learning rate (LR) for each run. You can also expand the "BERT variants" run set at the end of this section to see more details about each run.

First pass: Fine-tuning BERT from example baseline

Second pass: Comparing different base models

How useful is it to finetune yet another BERT model?

BERT improves substantially with some fine-tuning, and we could probably tune it further, especially with additional data or a hyperparameter sweep. However, starting with a base model that has already proven successful on this task is likely to be faster and ultimately more performant than finetuning an arbitrary model.

Fortunately, the HuggingFace model zoo has a variety of pretrained models:

Here I train a few variants of each type of model. Once again, the starting model is plotted in black, and the rainbow gradient indicates the order in which other experiments ran. You can select which model type to view by checking/unchecking the individual boxes next to the run set tabs at the bottom of this section. This makes it easy to compare different variants of the same model. You could also compare across model types, though in this section that will lead to an overwhelming number of lines.

Second pass: Comparing different base models

Transformers TL;DR: Tune less, they're already powerful

Transformers TL;DR: Tune less, they're already powerful

Concrete examples from best model

I logged the actual predictions and confidence scores for one of the best models from this exploration, an Albert-v1 variant with a low learning rate. In the first chart, I show the top 10 errors by prediction confidence score, and in the second, the top 10 most confident correct answers. A TextTable via wandb.Table() lets you easily log these predictions for all runs. You can use the arrows in the top right corner to browse all 10 examples in each section to formulate your own hypotheses about what's easy and hard for these models, and I will share mine at the end of this section.

Concrete examples from best model

Model exploration summary

Model exploration summary