Exploring Ways to use Pretrained Transformers in biome.text

In this report, we will be exploring ways on how Hugging Face transformers can be used in the biome.text NLP library. We compare the usage of transformers as an embedding layer, the direct usage of the CLS token and a scalar mix of all layers for a curated text classification task. We also show the results of a quick HPO done with biome.text, in which we optimize a few training parameters.
Ignacio Talavera Cepeda

Introduction

In the last years we experienced a shift towards transfer learning as the standard approach to solve NLP problems. Before, models were usually trained entirely from scratch, utilizing at most pretrained word embeddings, but nowadays it is very common to start with large pretrained language models as backbone of a system, and to set a task specific head on top of it. This new paradigm has made it easier to find state-of-the-art architectures for a great variety of NLP tasks.

Almost all current language models are based on the transformer architecture. The awesome HuggingFace Transformers library provides access to hundreds of such pretrained language models including state-of-the-art models such as infamous BERT, as well as community driven models often covering a specific language type or resource requirements.

biome.text is a practical and easy to use NLP python library built on top of the impressive AllenNLP platform. It makes the introduction of Hugging Face transformers into your custom NLP model very easy and allows for an agile workflow even with state-of-the-art technologies. In this report we explore a few ways to use transformers in biome.text solving a text classification task, in which we try to predict the category in the Arxiv Dataset.

This report goes along with our transformer tutorial and can be seen as supplementary material to it.

External links

If this is the first time you hear about "Transformers" not referring to giant robots, here is a small list of resources at which you might want to have a look first:

Comparing Different Usages

In this section we will compare different approaches on how to introduce pretrained transformers to your custom NLP model.

Using Transformers as an Embedding Layer

biome.text makes it easy to combine different text features, such as word or character embeddings. In the same way we can treat the output of the pretrained transformer as contextualized word embeddings. However, this requires a small trick behind the curtain: since pretrained transformers usually work with word pieces, we need to add up the embeddings of the word pieces to obtain an encoding at word level. This also means that we do not make use of any special tokens that are often introduced in the pretraining of the transformers.

In our first experiment we combine this approach with a GRU pooler that takes as input the contextualized word embeddings of the transformer (corresponding to tokens obtained with a spaCy tokenizer) and ends with a linear layer that matches the number of labels in our classification task. In biome.text this architecture is expressed by following configuration:

conf1 = {
    "name": "arxiv_categories_classification",
    "tokenizer": {"lang": "en"},
    "features": {
        "transformers": {
            "model_name": "distilroberta-base",
            "trainable": True,
            "max_length": 512,
        }
    },
    "head": {
        "type": "TextClassification",
        "labels": categories_list,
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 128,
            "bidirectional": True,
        },
    },
}

We roughly optimize the learning rate by hand choosing values around the recommended learning rate of 5e-5 for transformers: [1e-4, 5e-5, 2e-5, 1e-5, 5e-6]

Section 5

Using the Special <CLS> Token

The default approach of using pretrained transformers in classification tasks is the usage of the special <CLS> token. This token is artificially introduced at the beginning of each input and is pretrained to represent the class of the input. The advantage of this approach is that we can use a pretrained pooler and only need to add a linear layer that matches the number of labels. The disadvantage is that we cannot easily throw in other text features into the mix.

This approach is expressed by following configuration:

conf2 = {
    "name": "arxiv_categories_classification",
    "features": {
        "transformers": {
            "model_name": "distilroberta-base",
            "trainable": True,
            "max_length": 512,
        }
    },
    "head": {
        "type": "TextClassification",
        "labels": categories_list,
        "pooler": {
            "type": "bert_pooler",
            "pretrained_model": "distilroberta-base",
            "requires_grad": True,
            "dropout": 0.1,
        },
    },
}

Note that if you only specify the "transformers" feature and omit the "tokenizer" key, biome.text will automatically choose a word piece embedding as input for the head.

Again, we roughly optimize the learning rate for this approach by hand.

Section 7

Using a Scalar mix of the <CLS> Token

As a last experiment we try to use not only the last transformer layer, but a scalar mix of the <CLS> token from all the transformer layers. With respect to the second configuration, we only have to set last_layer_only to False:

conf2_last_layer_false = {
    "name": "arxiv_categories_classification",
    "features": {
        "transformers": {
            "model_name": "distilroberta-base",
            "trainable": True,
            "max_length": 512,
            "last_layer_only": False,
        }
    },
    "head": {
        "type": "TextClassification",
        "labels": categories_list,
        "pooler": {
            "type": "bert_pooler",
            "pretrained_model": "distilroberta-base",
            "requires_grad": True,
            "dropout": 0.1,
        },
    },
}

Comparison and Conclusion

When comparing all three approaches we see that the first approach yields significantly worse results compared to the second and third approach. It seems that adding up word pieces and introducing an additional GRU pooler does not match the performance of operating at word piece level and taking advantage of the pretrained <CLS> token. The usage of a scalar mix, however, seems a to be a viable option, even though it does not lead to an improvement and seems to converge slower.

Section 9

In summary, for classification tasks you should take advantage of the pretrained <CLS> token and avoid summing up the word pieces to match word tokens. However, summing up word pieces in order to combine the transformers feature with other text features is still a path a we have to explore in future work.

Hyperparameter Optimization With Pretrained Transformers

biome.text uses the amazing Ray Tune library to perform efficient hyperparameter optimizations (HPO) in a few lines of code. For a detailed walkthrough of this feature, we refer the reader to our dedicated tutorial.

Hyperparameter optimization with large pretrained transformers is computationally expensive, even with dedicated hardware, such as GPUs or TPUs. For our little experiment, we hence select a subset of our training data to be able visit more configurations in our search space in a reasonable amount of time, while still providing reliable results. The best configuration will then be used to train the model on our entire training set.

Configuring a random search

In this experiment we will use random search to optimize a few of our training parameters. The hyperparameters we are about to tune are the following:

Following the Ray Tune's search space API, our random search configuration is expressed in following configuration in biome.text:

trainer_config = {
    "optimizer": {
        "type": "adamw",
        "lr": tune.loguniform(1e-5, 1e-4),
        "weight_decay": tune.loguniform(1e-3, 0.1)
    },
    "learning_rate_scheduler": {
        "type": "linear_with_warmup",
        "num_epochs": 5,
        "num_steps_per_epoch": 167,
        "warmup_steps": tune.choice(list(range(101))),
    },
    "batch_size": 6,
    "num_epochs": 5,
    "random_seed": tune.choice(list(range(100)))
}

Random search is one of the most straight-forward, yet powerful types of hyperparameter searches. Event though biome.text and Ray Tune allow for a wide variety of HPO algorithms, for this tutorial we will stick to random search only and leave more sophisticated algorithms for future work.

Note that the search spaces of the learning rate and the weight decay are given in logarithmic scale. This will be important to interpret the results below.

HPO Results

In the diagram below you can see the results of 30 trials. The color scale represents the validation loss, that is good performing trials will be bluish, while bad performing trials will lean towards yellow.

Section 13

This type of graph is very useful to analyze the importance of the different hyperparameters and to discover tendencies and ranges that seem to work best. We omit the random_seed parameter in the diagram since we do not expect to see any tendencies with respect to it.

We can see that the best performing learning rates clutter around 5e-5, which seems to confirm the well chosen default value in the Hugging Face Transformers library. We can also see that too much weight decay seems to hurt the performance, as does too few warmup steps. The best configuration obtained by the random search is the following:

For a showcase of the final model trained on the entire training dataset, as well as its evaluation, please check out our transformer tutorial.