Overview

Human language is ambiguous to machines

Consider these two sentences, which only differ by one word:
Hopefully it's clear that in sentence A, the word "them" must refer to the "humans", while in sentence B, "them" refers to "deep nets". This task is straightforward for humans but challenging for machines. Disambiguating the pronoun "them" requires knowing more about the real world—the relationship between humans and deep nets—than entailed in the sentence. Until machine learning models can parse these types of sentences as well as humans do, we won't really be able to communicate with them in natural language, no matter what new website, app, or device we are using.

Language understanding benchmarks for measuring progress

Originally proposed by Terry Winograd, a Stanford professor who did foundational work on natural language understanding, in 1972, the Winograd Schema Challenge remains one of the main benchmarks in this space. Specifically, Winograd Schema sentences like the example above are part of the General Language Understanding Evaluation, or GLUE Benchmark for text-based deep learning models. In GLUE, the task is called Winograd Natural Language Inference, or WNLI. The baseline human performance on this task is 95.9, but the best trained models are still at 94.5, so there is room for improvement. This is the second hardest of the nine subtasks in GLUE, and in six other tasks, deep learning models have already surpassed the human baseline.

HuggingFace Transformers are a great place to start

The HuggingFace Transformers repository makes it very easy to work with a variety of advanced natural language models and try them on these benchmarks. In this report, I'll show how to use Transformers with Weights & Biases to start tuning models for a natural language task like correctly disambiguating Winograd schemas.

First pass: Fine-tuning BERT baseline

Short manual hyperparameter exploration

I started with the hyperparameter settings in the example provided with Transformers. Based on the loss and validation accuracy (eval_acc) curves plotted in W&B after each run, I adjusted my model to improve performance from a baseline eval_acc of 0.127 to 0.535 in fewer than 20 experiments.
Below, you can see the training loss and validation accuracy curves plotted over time. The starting baseline is in black, and the rest of the runs are colored in rainbow order from red to purple based on their creation order: my earlier experiments are reds/oranges, and the later experiments are blues/purples. The legend shows the maximum sequence length (max_len), training epochs (E), and learning rate (LR) for each run. You can also expand the "BERT variants" run set at the end of this section to see more details about each run.

Second pass: Comparing different base models

How useful is it to fine-tune yet another BERT model?

BERT improves substantially with some fine-tuning, and we could probably tune it further, especially with additional data or a hyperparameter sweep. However, starting with a base model that has already proven successful on this task is likely to be faster and ultimately more performant than finetuning an arbitrary model.
Fortunately, the HuggingFace model zoo has a variety of pretrained models:
Here I train a few variants of each type of model. Once again, the starting model is plotted in black, and the rainbow gradient indicates the order in which other experiments ran. You can select which model type to view by checking/unchecking the individual boxes next to the run set tabs at the bottom of this section. This makes it easy to compare different variants of the same model. You could also compare across model types, though in this section that will lead to an overwhelming number of lines.

Observations

An important caveat for this task, at least as it is implemented in the GLUE Benchmark that Transformers uses, is that the train/dev split for Winograd NLI is "somewhat adversarial": the same sentence can appear in its two different forms across train/dev, meaning that if a model has overfit to the training set, it may perform poorly in evaluation. Also, there are only 71 examples in the evaluation set.

Transformers TL;DR: Tune less, they're already powerful

Concrete examples from best model

I logged the actual predictions and confidence scores for one of the best models from this exploration, an Albert-v1 variant with a low learning rate. In the first chart, I show the top 10 errors by prediction confidence score, and in the second, the top 10 most confident correct answers. A TextTable via wandb.Table() lets you easily log these predictions for all runs. You can use the arrows in the top right corner to browse all 10 examples in each section to formulate your own hypotheses about what's easy and hard for these models, and I will share mine at the end of this section.

Each concept has many possible Winograd schemas with binary output labels

The exact formulation of the Winograd Schemas differs a bit from my example at the start of this report to allow for a binary prediction. In this format, my example would read as follows:
For each conceptual example, there are many possible reformulated sentences that yield different labels, depending on the choice of verbs and the order of the words--e.g. the subject could be the first word in both sentences, or not. The subject could be the reference of the pronoun, or not. This makes at least 4 possible formulations of each concept, not counting the verb meanings themselves, which could increase the possible reformulations to eight. One challenging aspect of WNLI is that the same conceptual example can appear in different formulations across the train/dev dataset split. This makes it easy to overfit and learn the wrong pattern, focusing on the word tokens or syntax patterns instead of the actual meaning.

Observations

Model exploration summary

The parallel coordinates panel and parameter importance panel below summarize and confirm some of the conclusions described above. While I've conducted this short exploration by hand, choosing each experiment configuration to get a feel for the models, running a Hyperparameter Sweep would greatly accelerate the process—and generate these powerful visualizations automatically.

Observations