Who Is Them? Text Disambiguation With Transformers

In this article, we look at how to use Hugging Face to explore models for natural language understanding
Created on May 12|Last edited on October 13
Comment
In this article, we take a look at how to use Hugging Face to explore models for natural language understanding, giving a number of examples, and using Weights & Biases to record the results of our experiments. 
Table of ContentsOverviewFirst pass: Fine-tuning BERT baselineSecond Pass: Comparing Different Base ModelsTransformers TL;DR: Tune Less, They’re Already PowerfulConcrete Examples from Best ModelModel exploration summary
﻿
Overview
Human Language Is Ambiguous to MachinesConsider these two sentences, which only differ by one word:
A: "Humans train deep nets to help them in the real world."
B: "Humans train deep nets to  use them in the real world."
Hopefully, it's clear that in sentence A, the word "them" must refer to the "humans", while in sentence B, "them" refers to "deep nets". This task is straightforward for humans but challenging for machines. Disambiguating the pronoun "them" requires knowing more about the real world—the relationship between humans and deep nets—than entailed in the sentence. Until machine learning models can parse these types of sentences, as well as humans, do, we won't really be able to communicate with them in natural language, no matter what new website, app, or device we are using.
Language Understanding Benchmarks for Measuring ProgressOriginally proposed by Terry Winograd, a Stanford professor who did foundational work on natural language understanding, in 1972, the  Winograd Schema Challenge remains one of the main benchmarks in this space. Specifically, Winograd Schema sentences like the example above are part of the General Language Understanding Evaluation, or GLUE Benchmark for text-based deep learning models. In GLUE, the task is called Winograd Natural Language Inference, or WNLI. The baseline human performance on this task is 95.9, but the best-trained models are still at 94.5, so there is room for improvement. This is the second hardest of the nine subtasks in GLUE, and in six other tasks, deep learning models have already surpassed the human baseline.
HuggingFace Transformers are a great place to startThe Hugging Face Transformers repository makes it very easy to work with a variety of advanced natural language models and try them on these benchmarks. In this report, I'll show how to use Transformers with Weights & Biases to start tuning models for a natural language task like correctly disambiguating Winograd schemas.
First pass: Fine-tuning BERT baseline
Short manual hyperparameter explorationI started with the hyperparameter settings in the example provided with Transformers. Based on the loss and validation accuracy (eval_acc) curves plotted in W&B after each run, I adjusted my model to improve performance from a baseline eval_acc of 0.127 to 0.535 in fewer than 20 experiments. 
Below, you can see the training loss and validation accuracy curves plotted over time. The starting baseline is in black, and the rest of the runs are colored in rainbow order from red to purple based on their creation order: my earlier experiments are reds/oranges, and the later experiments are blues/purples. The legend shows the maximum sequence length (max_len), training epochs (E), and learning rate (LR) for each run. You can also expand the "BERT variants" run set at the end of this section to see more details about each run.
﻿
BERT variants14
﻿
Helpful Modificationsincrease logging frequency: this doesn't affect the model performance, but it yields more detailed/less noisy plots (compare the red zigzag with the peach curve).
lower the intial learning rate: my initial runs didn't seem to learn much; instead, eval_acc dropped substantially over training. Lowering the leaning rate helped stabilize the accuracy and even increase it slightly
lower maximum sequence length: The starting window size of 128 seemed too large for the space of Winograd Schema sentences (perhaps optimized for other tasks in the benchmarks). Lowering to 64, 32, and finally 16 kept improving eval_acc. At 8, I saw eval_acc drop again.
more epochs: increasing the number of epochs to 10 and then 20 gave me a better sense for model performance without wasting too much time on extra epochs after eval_acc stopped improving. Note that the combination of training batch size (per_gpu_train_batch_size) and number of epochs (num_train_epochs) affect how many training steps we effectively see, hence the endpoints of the curves above are different)
﻿Example Script to Run Training﻿It's incredibly easy to get started with this code. You can also follow this excellent report for step-by-step instructions on setting up Transformers. One crucial flag to modify is logging_steps, so that your model logs to Weights & Biases more frequently than the high default setting—otherwise, especially early in development, you may think you're not logging any results.
﻿
export GLUE_DIR=/path/to/glue/data
export TASK_NAME=WNLI
﻿
python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --evaluate_during_training \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/ \
  --overwrite_output_dir \
  --logging_steps 10
﻿
e
Second Pass: Comparing Different Base Models
How Useful Is It To Fine-Tune Yet Another BERT Model?BERT improves substantially with some fine-tuning, and we could probably tune it further, especially with additional data or a hyperparameter sweep. However, starting with a base model that has already proven successful on this task is likely to be faster and ultimately more performant than finetuning an arbitrary model.
 Fortunately, the Hugging Face model zoo has a variety of pretrained models:
﻿Roberta, a more robustly optimized variant of BERT, performs well on WNLI according to a model card in the Transformers repo, especially with pretraining on extra data﻿
﻿Albert, a lighter (parameter-reduced) version of BERT with a self-supervised loss that helps with inter-sentence coherence and especially tasks with multi-sentence inputs. This is the base model for the top entry in the regular GLUE leaderboard, and it ties several models for best performance on WNLI at 94.5 (surprisingly better than the T5 score from SuperGLUE). You can read more in Lan et al 2019 from Google Research. Interestingly, in the source repo, WNLI is left out of the benchmarked performance on GLUE, and the improvement on MNLI is fairly small.
﻿XLNet, an improvement on BERT that trains to predict all words (instead of just the 15% of masked words) in random order instead of sequential order. This might give XLNet an advantage in understanding dependencies between different words in a sentence. Though XLNet generally outperforms BERT on natural language inference tasks, results for WNLI are not reported in the source repository. 
Here I train a few variants of each type of model. Once again, the starting model is plotted in black, and the rainbow gradient indicates the order in which other experiments ran. You can select which model type to view by checking/unchecking the individual boxes next to the run set tabs at the bottom of this section. This makes it easy to compare different variants of the same model. You could also compare across model types, though in this section that will lead to an overwhelming number of lines.
﻿
Roberta10
 
Albert15
 
Albert (20 epochs)15
 
XLNet12
 
BERT14
﻿
ObservationsAn important caveat for this task, at least as it is implemented in the GLUE Benchmark that Transformers uses, is that the train/dev split for Winograd NLI is "somewhat adversarial": the same sentence can appear in its two different forms across train/dev, meaning that if a model has overfit to the training set, it may perform poorly in evaluation. Also, there are only 71 examples in the evaluation set. 
overfitting happens very quickly: across these experiments, eval_acc is often noisy and defaults to _decreasing_ over training. Setting the learning rate lower helps, except the default learning rate schedule for Transformers decreases this already low learning rate further over time, making it tricky to continue improving. The evaluation loss often diverges (see chart above), but it's possible to find a combination of hyperparameters that causes the evaluation loss to converge slowly instead.
XLNet seems to learn the most: training loss doesn't decrease very much in any of these runs, but the XLNet models on average have the most canonical training curves _and_ evaluation loss curves, paired with reasonable accuracies
Roberta performs reasonably: easiest of the models to tune manually, though overfits quickly if you look at the evaluation loss curves
Albert is too complicated: this base requires more degrees of freedom or a hyperparameter sweep, not quick exploration by hand. Results seemed the least consistent here, especially because the later/supposedly better base model v2 seems to perform worse on this task than v1.
Transformers TL;DR: Tune Less, They’re Already Powerful﻿
Transformers by base model72
﻿
﻿
Albert-v1 Is Best on Average, XLNet Has Highest Accuracy.Though this averaging process is approximate, Albert-v1 seems to perform best across different tuning efforts. The best validation accuracy over all runs was an XLNet model 0.59, while the best BERT model topped out at 0.55. To quantify the usefulness of finetuning BERT versus finetuning across Transformer base models, varying base models boosted the top accuracy by 7% and average accuracy by 16%. You can see the details of each run by expanding the "Transformer variants sorted by eval_acc" tab at the bottom of the last section, the "Model exploration summary".
﻿
Next steps: T5, ErnieOther models that would be awesome to try but are not yet configured for the AutoModel class in Transformers include:
﻿T5, the Text-to-Text Transfer Transformer from Raffel et al at Google, is the state-of-the-art in transfer learning for NLP and scores the highest overall on the SuperGLUE leaderboard, an improvement on the GLUE leaderboard referenced earlier. 
﻿Ernie, the Enhanced Representation through Knowledge Integration model from Sun et al 2019, which enriches its language learning with entity-level and phrase-level masking for concepts composed of multiple words. This improves performance on natural language inference tasks and ties for first on Winograd NLI in the GLUE leaderboard.
Perhaps there is an opportunity to configure the AutoModelForSequenceClassification class in Transformers to leverage these models.
﻿
Other Base Models to Explore in TransformersThe other base model types that would be easy to try—config changes only, no code!—with the [existing Transformers integration for GLUE](https://github.com/huggingface/transformers/tree/master/examples/text-classification):
﻿DistilBERT﻿
Camembert
XLMRoberta
Bart
FlaubertConfig
XLM
Concrete Examples from Best ModelI logged the actual predictions and confidence scores for one of the best models from this exploration, an Albert-v1 variant with a low learning rate.

In the first chart, I show the top 10 errors by prediction confidence score, and in the second, the top 10 most confident correct answers. A TextTable via wandb.Table() lets you easily log these predictions for all runs. You can use the arrows in the top right corner to browse all 10 examples in each section to formulate your own hypotheses about what's easy and hard for these models, and I will share mine at the end of this section.
﻿
Albert-v1-best-eval run1
﻿
Each concept has many possible Winograd schemas with binary output labelsThe exact formulation of the Winograd Schemas differs a bit from my example at the start of this report to allow for a binary prediction. In this format, my example would read as follows:
 "Humans train deep nets to help them in the real world. Deep nets help humans." Label: 1 (True, the second sentence follows from, or is entailed by, the first)
"Humans train deep nets to use them in the real world. Deep nets use humans." Label: 0 (False, the second sentence doesn't follow)
For each conceptual example, there are many possible reformulated sentences that yield different labels, depending on the choice of verbs and the order of the words--e.g. the subject could be the first word in both sentences, or not. The subject could be the reference of the pronoun, or not. This makes at least 4 possible formulations of each concept, not counting the verb meanings themselves, which could increase the possible reformulations to eight. One challenging aspect of WNLI is that the same conceptual example can appear in different formulations across the train/dev dataset split. This makes it easy to overfit and learn the wrong pattern, focusing on the word tokens or syntax patterns instead of the actual meaning. 
Observationsthe data is a bit noisy/some labels are debatable: Some of the examples may be mislabeled in the dev set, or perhaps more ambiguous to humans than an author thought. "They broadcast an announcement, but a subway came into the station and I couldn't hear over it. I couldn't hear the subway." and "Fred is the only man still alive who remembers my great-grandfather. He is a remarkable man. Fred is a remarkable man." are both marked as "True, this follows", but I don't think this is super clear in either case.
weak understanding of animal/entity relationships: Some of the harder examples involve two animals, like a duck and a minnow, or a cat and a mouse, where a human would know the correct answer based on how these animals relate in the real world (cats hunt mice, ducks probably eat minnows--perhaps this is an issue for the "police" and "gang members" example too). Explicitly increasing the knowledge representation of the model for certain entity types may help here, as seems to be the approach with Ernie. 
syntax may be easier to learn than semantics: Most of the highest-confidence correct answers have a repeating pattern, where the subject is the first word in each sentence, the final clause is slightly longer and phrases identically in both sentences, and the label is "0": for example, "Sam took French classes from Adam, because he was known to speak it fluently. Sam was known to speak it fluently." When the first word in the second sentence is different and is not the subject, such as with the two examples involving Beth and Sally in the first table, the model seems to get the answer wrong.
proper nouns are a confound: "Madonna" is a fairly rare word, and two of the highest confidence correct answers involve Madonna. Several formulations of the Madonna example appear in the train set. Perhaps the model gets more Madonna examples correct because it overfits to that rare token, or perhaps to the baseline rate of the labels for sentences that include Madonna. However, in train, Madonna appears in 2 examples with label 0 and 4 with label 1, which doesn't support the overfitting hypothesis. This also brings up a broader question of which facts about the world we expect a model to know or care about it knowing: the names of specific celebrities? whether particular names refer to celebrities? just the concept of celebrities?
next steps: more rigorous data analysis: It would be useful to build a frequency histogram on all the proper names and other subjects that appear in this WNLI, as well as the different possible formulations of an example, and balance across all of these. This would help disentangle whether a model is learning accidental patterns in the construction of the examples, or actual meaning.
Model exploration summaryThe parallel coordinates panel and parameter importance panel below summarize and confirm some of the conclusions described above. While I've conducted this short exploration by hand, choosing each experiment configuration to get a feel for the models, running a Hyperparameter Sweep would greatly accelerate the process—and generate these powerful visualizations automatically.
Observationslower learning rates are better: these models are already well-trained and they easily overfit the small dataset (note that "start_lr" and "learning_rate" are two column names for the same hyperparameter—"learning rate" is also a default variable that changes over time as a Transformer trains)
shorter sequence length is better: would need to dig into concrete examples, but they tend to be short
larger batch size is better: I did not focus on this hyperparameter, but increasing it seems to help performance—though this too is inconclusive, as larger batch size will lead to fewer log steps and less overfitting
more epochs are slightly better: again because the models ovefit easily, this is inconclusive
﻿
Transformer variants sorted by eval_acc73
﻿
﻿
Add a comment
Tags: Intermediate, NLP, HuggingFace, Experiment, BERT, Panels, Plots, Sweeps
Iterate on AI agents and models faster. Try Weights & Biases today.
Who Is Them? Text Disambiguation With Transformers

Table of Contents

Overview

Human Language Is Ambiguous to Machines

Language Understanding Benchmarks for Measuring Progress

HuggingFace Transformers are a great place to start

First pass: Fine-tuning BERT baseline

Short manual hyperparameter exploration

Helpful Modifications

﻿Example Script to Run Training﻿

Second Pass: Comparing Different Base Models

How Useful Is It To Fine-Tune Yet Another BERT Model?

Observations

Transformers TL;DR: Tune Less, They’re Already Powerful

Albert-v1 Is Best on Average, XLNet Has Highest Accuracy.

Next steps: T5, Ernie

Other Base Models to Explore in Transformers

Concrete Examples from Best Model

Each concept has many possible Winograd schemas with binary output labels

Observations

Model exploration summary

Observations

Example Script to Run Training