Using SimpleTransformers for Common NLP Applications

Explore Language Modeling, Named Entity Recognition, Question Answering with the SimpleTransformer library. Made by Ayush Chaurasia using Weights & Biases
Ayush Chaurasia


Transformer models and transfer learning methods continue to propel the field of Natural Language Processing forwards at a tremendous pace. However, state-of-the-art performance too often comes at the cost of (a lot of) complex code.

Simple Transformers avoids all the complexity and lets you get down to what matters – model training and experimenting with the Transformer model architecture. It helps you bypass all the complicated setups, boilerplate code, and all the other general unpleasantness by initializing a model in one line, training in the next, and evaluating with the third.

In this report, I build on the simpleTranformers repo, and explore some of the most common applications of deep NLP – including tasks from GLUE benchmark, along with the recipes for training SOTA transformer models to perform these tasks. I've used the distilbert transformer model for all the tasks as it is less expensive computationally. I also extensively explore optimizing your distilbert hyperparameters with Sweeps.

Simpletransformers comes with native support for model performance tracking, using Weights & Biases.

Full code walkthrough on Colab →

Language Modeling


ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.


The major advantage of ELECTRA training process is that it not only enables training large models on single GPU but it is also more accurate when compared to traditional training methods.


Language Modeling

MultiLabel Classification


To demonstrate Multilabel Classification we will use the Jigsaw Toxic Comments dataset from Kaggle. Simple Transformers requires a column labels which contains a multi-hot encoded lists of labels, as well as a column text which contains all the text.

from simpletransformers.classification import MultiLabelClassificationModel

model = MultiLabelClassificationModel('distilbert', 'distilbert-base-uncased', num_labels=6, 
args={'train_batch_size':2, 'gradient_accumulation_steps':16, 'learning_rate': 3e-5,
 'num_train_epochs': 3, 'max_seq_length': 512})

This creates a MultiLabelClassificationModel that can be used for training, evaluating, and predicting on multilabel classification tasks. The first parameter is the model_type, the second is the model_name, and the third is the number of labels in the data.

MultiLabel Classification

Named Entity Recognition


Named Entity Recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. To demonstrate Named Entity Recognition, we’ll be using the CoNLL Dataset.


We'll create a NERModel that can be used for training, evaluation, and prediction in NER tasks. The NERModel object takes in the following parameters:

We use the following default args for the simpletransformers NERModel:

    "output_dir": "outputs/","cache_dir": "cache_dir/","fp16": True,
    "fp16_opt_level": "O1","max_seq_length": 128,"train_batch_size": 8,
    "gradient_accumulation_steps": 1,"eval_batch_size": 8, "num_train_epochs": 1,
    "weight_decay": 0, "learning_rate": 4e-5, "adam_epsilon": 1e-8,
    "warmup_ratio": 0.06, "warmup_steps": 0,"max_grad_norm": 1.0,
    "logging_steps": 50,"save_steps": 2000,"overwrite_output_dir": False,
    "reprocess_input_data": False,"evaluate_during_training": False,
    "process_count": cpu_count() - 2 if cpu_count() > 2 else 1,
    "n_gpu": 1,

Named Entity Recognition

Question Answering


Question answering (QA) is a computer science discipline in the field of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.


We'll use the Stanford Question Answering Dataset (SQuAD 2.0) for training and evaluating our model. SQuAD is a reading comprehension dataset and a standard benchmark for QA models. The dataset is publicly and it is also used as one the evaluation metrics for calculating GLUE benchmark scores.

The dataset consists of multiple dictionaries. Each such dictionary contains two attributes –

Questions and answers are represented as dictionaries. Each dictionary in qas has the following components.

A single answer is represented by a dictionary with the following attributes.

The Question Answering Model

Next we'll create a QuestionAnsweringModel object and set the hyperparameters for fine tuning the model. Just as before, the first parameter is the model_type and the second is the model_name.

Question Answering


In this report, we've trained and visualized models to perform some of the most important deep NLP tasks using simpletransformers which is a high-level wrapper around the famous huggingface library. Simpletransformers combines the accessible transformer models provider by huggingface with its own powerful training scripts which makes training a SOTA model a piece of cake.