Transformer models and transfer learning methods continue to propel the field of Natural Language Processing forwards at a tremendous pace. However, state-of-the-art performance too often comes at the cost of (a lot of) complex code.
Simple Transformers avoids all the complexity and lets you get down to what matters – model training and experimenting with the Transformer model architecture. It helps you bypass all the complicated setups, boilerplate code, and all the other general unpleasantness by initializing a model in one line, training in the next, and evaluating with the third.
In this report, I build on the simpleTranformers repo, and explore some of the most common applications of deep NLP – including tasks from GLUE benchmark, along with the recipes for training SOTA transformer models to perform these tasks. I've used the distilbert
transformer model for all the tasks as it is less expensive computationally. I also extensively explore optimizing your distilbert hyperparameters with Sweeps.
Simpletransformers comes with native support for model performance tracking, using Weights & Biases.
ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.
The major advantage of ELECTRA training process is that it not only enables training large models on single GPU but it is also more accurate when compared to traditional training methods.
To demonstrate Multilabel Classification we will use the Jigsaw Toxic Comments dataset from Kaggle. Simple Transformers requires a column labels
which contains a multi-hot encoded lists of labels, as well as a column text
which contains all the text.
from simpletransformers.classification import MultiLabelClassificationModel
model = MultiLabelClassificationModel('distilbert', 'distilbert-base-uncased', num_labels=6,
args={'train_batch_size':2, 'gradient_accumulation_steps':16, 'learning_rate': 3e-5,
'num_train_epochs': 3, 'max_seq_length': 512})
This creates a MultiLabelClassificationModel that can be used for training, evaluating, and predicting on multilabel classification tasks. The first parameter is the model_type
, the second is the model_name
, and the third is the number of labels
in the data.
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. To demonstrate Named Entity Recognition, we’ll be using the CoNLL Dataset.
We'll create a NERModel
that can be used for training, evaluation, and prediction in NER tasks. The NERModel object takes in the following parameters:
model_type
: The type of model (bert, roberta)
model_name
: Default Transformer model name or path to a directory containing Transformer model file (pytorch_nodel.bin).
labels
(optional): A list of all Named Entity labels. If not given, [“O”, “B-MISC”, “I-MISC”, “B-PER”, “I-PER”, “B-ORG”, “I-ORG”, “B-LOC”, “I-LOC”] will be used.
args
(optional): Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args.
use_cuda
(optional): Use GPU if available. Setting to False will force model to use CPU only.
We use the following default args for the simpletransformers NERModel
:
{
"output_dir": "outputs/","cache_dir": "cache_dir/","fp16": True,
"fp16_opt_level": "O1","max_seq_length": 128,"train_batch_size": 8,
"gradient_accumulation_steps": 1,"eval_batch_size": 8, "num_train_epochs": 1,
"weight_decay": 0, "learning_rate": 4e-5, "adam_epsilon": 1e-8,
"warmup_ratio": 0.06, "warmup_steps": 0,"max_grad_norm": 1.0,
"logging_steps": 50,"save_steps": 2000,"overwrite_output_dir": False,
"reprocess_input_data": False,"evaluate_during_training": False,
"process_count": cpu_count() - 2 if cpu_count() > 2 else 1,
"n_gpu": 1,
}
Question answering (QA) is a computer science discipline in the field of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.
We'll use the Stanford Question Answering Dataset (SQuAD 2.0) for training and evaluating our model. SQuAD is a reading comprehension dataset and a standard benchmark for QA models. The dataset is publicly and it is also used as one the evaluation metrics for calculating GLUE benchmark scores.
The dataset consists of multiple dictionaries. Each such dictionary contains two attributes –
context
: The paragraph or text from which the question is asked.
qas
: A list of questions and answers.
Questions and answers are represented as dictionaries. Each dictionary in qas has the following components.
id
: (string) A unique ID for the question. Should be unique across the entire dataset.
question
: (string) A question.
is_impossible
: (bool) Indicates whether the question can be answered correctly from the context.
answers
: (list) The list of correct answers to the question.
A single answer is represented by a dictionary with the following attributes.
answer
: (string) The answer to the question. Must be a substring of the context.answer_start
: (int) Starting index of the answer in the context.Next we'll create a QuestionAnsweringModel object and set the hyperparameters for fine tuning the model. Just as before, the first parameter is the model_type
and the second is the model_name
.
In this report, we've trained and visualized models to perform some of the most important deep NLP tasks using simpletransformers
which is a high-level wrapper around the famous huggingface library. Simpletransformers combines the accessible transformer models provider by huggingface with its own powerful training scripts which makes training a SOTA model a piece of cake.