Skip to main content

Exploring Ways To Use Pretrained Transformers in biome.text

In this article, we explore how HuggingFace transformers can be used in the biome.text NLP library, comparing their usage as an embedding layer, usage of the CLS token, and a scalar mix of all layers.
Created on October 15|Last edited on November 17
In recent years we experienced a shift towards transfer learning as the standard approach for solving NLP problems. Before this, models were usually trained entirely from scratch, utilizing at most pretrained word embeddings, but now it is very common to start with large pretrained language models as the backbone of a system and to set a task-specific head on top of it. This new paradigm has made it easier to find state-of-the-art architectures for a great variety of NLP tasks.
Almost all current language models are based on the transformer architecture. The awesome HuggingFace Transformers library provides access to hundreds of such pretrained language models including state-of-the-art models such as the infamous BERT, as well as community-driven models often covering a specific language type or resource requirements.
biome.text is a practical and easy-to-use NLP python library built on top of the impressive AllenNLP platform. It makes the introduction of HuggingFace transformers into your custom NLP model very easy and allows for an agile workflow even with state-of-the-art technologies. In this article, we explore a few ways to use transformers in biome.text solving a text classification task, in which we try to predict the category in the Arxiv Dataset.
This article goes along with our transformer tutorial and can be seen as supplementary material to it.
If this is the first time you hear about "Transformers" not referring to giant robots, here is a small list of resources at which you might want to have a look first:

Table of Contents




Comparing Different Usages

In this section, we will compare different approaches on how to introduce pretrained transformers to your custom NLP model.

Using Transformers as an Embedding Layer

biome.text makes it easy to combine different text features, such as word or character embeddings. In the same way, we can treat the output of the pretrained transformer as contextualized word embeddings. However, this requires a small trick behind the curtain: since pretrained transformers usually work with word pieces, we need to add up the embeddings of the word pieces to obtain an encoding at the word level. This also means that we do not make use of any special tokens that are often introduced in the pretraining of the transformers.
In our first experiment, we combine this approach with a GRU pooler that takes as input the contextualized word embeddings of the transformer (corresponding to tokens obtained with a spaCy tokenizer) and ends with a linear layer that matches the number of labels in our classification task. In biome.text this architecture is expressed by the following configuration:
conf1 = {
"name": "arxiv_categories_classification",
"tokenizer": {"lang": "en"},
"features": {
"transformers": {
"model_name": "distilroberta-base",
"trainable": True,
"max_length": 512,
}
},
"head": {
"type": "TextClassification",
"labels": categories_list,
"pooler": {
"type": "gru",
"num_layers": 1,
"hidden_size": 128,
"bidirectional": True,
},
},
}
We roughly optimize the learning rate by hand-choosing values around the recommended learning rate of 5e-5 for transformers: [1e-4, 5e-5, 2e-5, 1e-5, 5e-6]


Run set
5


Using the Special <CLS> Token

The default approach of using pretrained transformers in classification tasks is the usage of the special <CLS> token. This token is artificially introduced at the beginning of each input and is pretrained to represent the class of the input. The advantage of this approach is that we can use a pretrained pooler and only need to add a linear layer that matches the number of labels. The disadvantage is that we cannot easily throw in other text features into the mix.
This approach is expressed by the following configuration:
conf2 = {
"name": "arxiv_categories_classification",
"features": {
"transformers": {
"model_name": "distilroberta-base",
"trainable": True,
"max_length": 512,
}
},
"head": {
"type": "TextClassification",
"labels": categories_list,
"pooler": {
"type": "bert_pooler",
"pretrained_model": "distilroberta-base",
"requires_grad": True,
"dropout": 0.1,
},
},
}
Note that if you only specify the "transformers" feature and omit the "tokenizer" key, biome.text will automatically choose a word piece embedding as input for the head.
Again, we roughly optimize the learning rate for this approach by hand.

Run set
3


Using a Scalar mix of the <CLS> Token

As a last experiment we try to use not only the last transformer layer, but a scalar mix of the <CLS> token from all the transformer layers. With respect to the second configuration, we only have to set last_layer_only to False:
conf2_last_layer_false = {
"name": "arxiv_categories_classification",
"features": {
"transformers": {
"model_name": "distilroberta-base",
"trainable": True,
"max_length": 512,
"last_layer_only": False,
}
},
"head": {
"type": "TextClassification",
"labels": categories_list,
"pooler": {
"type": "bert_pooler",
"pretrained_model": "distilroberta-base",
"requires_grad": True,
"dropout": 0.1,
},
},
}

Comparison and Conclusion

When comparing all three approaches we see that the first approach yields significantly worse results compared to the second and third approach. It seems that adding up word pieces and introducing an additional GRU pooler does not match the performance of operating at word piece level and taking advantage of the pretrained <CLS> token. The usage of a scalar mix, however, seems a to be a viable option, even though it does not lead to an improvement and seems to converge slower.



Run set
3

In summary, for classification tasks you should take advantage of the pretrained <CLS> token and avoid summing up the word pieces to match word tokens. However, summing up word pieces in order to combine the transformers feature with other text features is still a path a we have to explore in future work.


Hyperparameter Optimization With Pretrained Transformers

biome.text uses the amazing Ray Tune library to perform efficient hyperparameter optimizations (HPO) in a few lines of code. For a detailed walkthrough of this feature, we refer the reader to our dedicated tutorial.
Hyperparameter optimization with large pretrained transformers is computationally expensive, even with dedicated hardware, such as GPUs or TPUs. For our little experiment, we hence select a subset of our training data to be able to visit more configurations in our search space in a reasonable amount of time, while still providing reliable results. The best configuration will then be used to train the model on our entire training set.
In this experiment, we will use random search to optimize a few of our training parameters. The hyperparameters we are about to tune are the following:
  • The learning rate (even though we found a pretty good one in the experiments above).
  • The weight decay of our adamw optimizer.
  • The warmup steps.
  • The random seed is used to initialize the weights of our network.
Following Ray Tune's search space API, our random search configuration is expressed in the following configuration in biome.text:
trainer_config = {
"optimizer": {
"type": "adamw",
"lr": tune.loguniform(1e-5, 1e-4),
"weight_decay": tune.loguniform(1e-3, 0.1)
},
"learning_rate_scheduler": {
"type": "linear_with_warmup",
"num_epochs": 5,
"num_steps_per_epoch": 167,
"warmup_steps": tune.choice(list(range(101))),
},
"batch_size": 6,
"num_epochs": 5,
"random_seed": tune.choice(list(range(100)))
}
Random search is one of the most straightforward, yet powerful types of hyperparameter searches. Event though biome.text and Ray Tune allow for a wide variety of HPO algorithms, for this tutorial we will stick to random search only and leave more sophisticated algorithms for future work.
Note that the search spaces of the learning rate and the weight decay are given in logarithmic scale. This will be important to interpret the results below.

HPO Results

In the diagram below you can see the results of 30 trials. The color scale represents the validation loss, that is good performing trials will be bluish, while bad-performing trials will lean towards yellow.

Run set
30

This type of graph is very useful to analyze the importance of the different hyperparameters and discovering tendencies and ranges that seem to work best. We omit the random_seed parameter in the diagram since we do not expect to see any tendencies with respect to it.
We can see that the best-performing learning rates clutter around 5e-5, which seems to confirm the well-chosen default value in the Hugging Face Transformers library. We can also see that too much weight decay seems to hurt the performance, as does too few warmup steps. The best configuration obtained by the random search is the following:
  • Learning rate = 0.0000453
  • Warmup steps = 45
  • Weight decay = 0.003197
For a showcase of the final model trained on the entire training dataset, as well as its evaluation, please check out our transformer tutorial.

Ayush Thakur
Ayush Thakur •  
Interesting report Ignacio. I used to think that using transformers as an embedding layer is a good enough choice. The experiments said a different story though. Next time I will try to use the `CLS` token. Thanks for the tip.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.