Human language is ambiguous to machines
Consider these two sentences, which only differ by one word:
Hopefully it's clear that in sentence A, the word "them" must refer to the "humans", while in sentence B, "them" refers to "deep nets". This task is straightforward for humans but challenging for machines. Disambiguating the pronoun "them" requires knowing more about the real world—the relationship between humans and deep nets—than entailed in the sentence. Until machine learning models can parse these types of sentences as well as humans do, we won't really be able to communicate with them in natural language, no matter what new website, app, or device we are using.
Language understanding benchmarks for measuring progress
Originally proposed by Terry Winograd, a Stanford professor who did foundational work on natural language understanding, in 1972
, the Winograd Schema Challenge
remains one of the main benchmarks in this space. Specifically, Winograd Schema sentences like the example above are part of the General Language Understanding Evaluation, or GLUE Benchmark
for text-based deep learning models. In GLUE, the task is called Winograd Natural Language Inference, or WNLI. The baseline human performance on this task is 95.9, but the best trained models are still at 94.5
, so there is room for improvement. This is the second hardest of the nine subtasks in GLUE, and in six other tasks, deep learning models have already surpassed the human baseline.
HuggingFace Transformers are a great place to start
The HuggingFace Transformers repository
makes it very easy to work with a variety of advanced natural language models and try them on these benchmarks. In this report, I'll show how to use Transformers with Weights & Biases to start tuning models for a natural language task like correctly disambiguating Winograd schemas.
First pass: Fine-tuning BERT baseline
Short manual hyperparameter exploration
I started with the hyperparameter settings in the example provided with Transformers
. Based on the loss and validation accuracy (eval_acc) curves plotted in W&B after each run, I adjusted my model to improve performance from a baseline eval_acc of 0.127 to 0.535 in fewer than 20 experiments.
Below, you can see the training loss and validation accuracy curves plotted over time. The starting baseline is in black, and the rest of the runs are colored in rainbow order from red to purple based on their creation order: my earlier experiments are reds/oranges, and the later experiments are blues/purples. The legend shows the maximum sequence length (max_len), training epochs (E), and learning rate (LR) for each run. You can also expand the "BERT variants" run set at the end of this section to see more details about each run.
Second pass: Comparing different base models
How useful is it to fine-tune yet another BERT model?
BERT improves substantially with some fine-tuning, and we could probably tune it further, especially with additional data or a hyperparameter sweep
. However, starting with a base model that has already proven successful on this task is likely to be faster and ultimately more performant than finetuning an arbitrary model.
, a lighter (parameter-reduced) version of BERT with a self-supervised loss that helps with inter-sentence coherence and especially tasks with multi-sentence inputs. This is the base model for the top entry in the regular GLUE leaderboard
, and it ties several models for best performance on WNLI at 94.5 (surprisingly better than the T5 score from SuperGLUE). You can read more in Lan et al 2019 from Google Research
. Interestingly, in the source repo, WNLI is left out of the benchmarked performance on GLUE
, and the improvement on MNLI is fairly small.
, an improvement on BERT that trains to predict all words (instead of just the 15% of masked words) in random order instead of sequential order. This might give XLNet an advantage in understanding dependencies between different words in a sentence. Though XLNet generally outperforms BERT
on natural language inference tasks, results for WNLI are not reported in the source repository.
Here I train a few variants of each type of model. Once again, the starting model is plotted in black, and the rainbow gradient indicates the order in which other experiments ran. You can select which model type to view by checking/unchecking the individual boxes next to the run set tabs at the bottom of this section. This makes it easy to compare different variants of the same model. You could also compare across model types, though in this section that will lead to an overwhelming number of lines.
An important caveat for this task, at least as it is implemented in the GLUE Benchmark that Transformers uses, is that the train/dev split for Winograd NLI is "somewhat adversarial": the same sentence can appear in its two different forms across train/dev, meaning that if a model has overfit to the training set, it may perform poorly in evaluation. Also, there are only 71 examples in the evaluation set.
overfitting happens very quickly: across these experiments, eval_acc is often noisy and defaults to _decreasing_ over training. Setting the learning rate lower helps, except the default learning rate schedule for Transformers decreases this already low learning rate further over time, making it tricky to continue improving. The evaluation loss often diverges (see chart above), but it's possible to find a combination of hyperparameters that causes the evaluation loss to converge slowly instead.
XLNet seems to learn the most: training loss doesn't decrease very much in any of these runs, but the XLNet models on average have the most canonical training curves _and_ evaluation loss curves, paired with reasonable accuracies
Roberta performs reasonably: easiest of the models to tune manually, though overfits quickly if you look at the evaluation loss curves
Albert is too complicated: this base requires more degrees of freedom or a hyperparameter sweep, not quick exploration by hand. Results seemed the least consistent here, especially because the later/supposedly better base model v2 seems to perform worse on this task than v1.
Transformers TL;DR: Tune less, they're already powerful
Concrete examples from best model
I logged the actual predictions and confidence scores for one of the best models from this exploration, an Albert-v1 variant with a low learning rate.
In the first chart, I show the top 10 errors by prediction confidence score, and in the second, the top 10 most confident correct answers. A TextTable via wandb.Table()
lets you easily log these predictions for all runs. You can use the arrows in the top right corner to browse all 10 examples in each section to formulate your own hypotheses about what's easy and hard for these models, and I will share mine at the end of this section.
Each concept has many possible Winograd schemas with binary output labels
The exact formulation of the Winograd Schemas differs a bit from my example at the start of this report to allow for a binary prediction. In this format, my example would read as follows:
"Humans train deep nets to help them in the real world. Deep nets help humans." Label: 1 (True, the second sentence follows from, or is entailed by, the first)
"Humans train deep nets to use them in the real world. Deep nets use humans." Label: 0 (False, the second sentence doesn't follow)
For each conceptual example, there are many possible reformulated sentences that yield different labels, depending on the choice of verbs and the order of the words--e.g. the subject could be the first word in both sentences, or not. The subject could be the reference of the pronoun, or not. This makes at least 4 possible formulations of each concept, not counting the verb meanings themselves, which could increase the possible reformulations to eight. One challenging aspect of WNLI is that the same conceptual example can appear in different formulations across the train/dev dataset split. This makes it easy to overfit and learn the wrong pattern, focusing on the word tokens or syntax patterns instead of the actual meaning.
the data is a bit noisy/some labels are debatable: Some of the examples may be mislabeled in the dev set, or perhaps more ambiguous to humans than an author thought. "They broadcast an announcement, but a subway came into the station and I couldn't hear over it. I couldn't hear the subway." and "Fred is the only man still alive who remembers my great-grandfather. He is a remarkable man. Fred is a remarkable man." are both marked as "True, this follows", but I don't think this is super clear in either case.
weak understanding of animal/entity relationships: Some of the harder examples involve two animals, like a duck and a minnow, or a cat and a mouse, where a human would know the correct answer based on how these animals relate in the real world (cats hunt mice, ducks probably eat minnows--perhaps this is an issue for the "police" and "gang members" example too). Explicitly increasing the knowledge representation of the model for certain entity types may help here, as seems to be the approach with Ernie.
syntax may be easier to learn than semantics: Most of the highest-confidence correct answers have a repeating pattern, where the subject is the first word in each sentence, the final clause is slightly longer and phrases identically in both sentences, and the label is "0": for example, "Sam took French classes from Adam, because he was known to speak it fluently. Sam was known to speak it fluently." When the first word in the second sentence is different and is not the subject, such as with the two examples involving Beth and Sally in the first table, the model seems to get the answer wrong.
proper nouns are a confound: "Madonna" is a fairly rare word, and two of the highest confidence correct answers involve Madonna. Several formulations of the Madonna example appear in the train set. Perhaps the model gets more Madonna examples correct because it overfits to that rare token, or perhaps to the baseline rate of the labels for sentences that include Madonna. However, in train, Madonna appears in 2 examples with label 0 and 4 with label 1, which doesn't support the overfitting hypothesis. This also brings up a broader question of which facts about the world we expect a model to know or care about it knowing: the names of specific celebrities? whether particular names refer to celebrities? just the concept of celebrities?
next steps: more rigorous data analysis: It would be useful to build a frequency histogram on all the proper names and other subjects that appear in this WNLI, as well as the different possible formulations of an example, and balance across all of these. This would help disentangle whether a model is learning accidental patterns in the construction of the examples, or actual meaning.
Model exploration summary
The parallel coordinates panel and parameter importance panel below summarize and confirm some of the conclusions described above. While I've conducted this short exploration by hand, choosing each experiment configuration to get a feel for the models, running a Hyperparameter Sweep
would greatly accelerate the process—and generate these powerful visualizations automatically.
lower learning rates are better: these models are already well-trained and they easily overfit the small dataset (note that "start_lr" and "learning_rate" are two column names for the same hyperparameter—"learning rate" is also a default variable that changes over time as a Transformer trains)
shorter sequence length is better: would need to dig into concrete examples, but they tend to be short
larger batch size is better: I did not focus on this hyperparameter, but increasing it seems to help performance—though this too is inconclusive, as larger batch size will lead to fewer log steps and less overfitting
more epochs are slightly better: again because the models ovefit easily, this is inconclusive