How W&B Helped Graphcore Optimize GroupBERT to Run Faster on IPUs

Learn how W&B helped the team at Graphcore train a new BERT model in 40% less time
Created on April 8|Last edited on April 27
Comment
This post was written by the team at Graphcore Research. You can head here to learn more about the company or here to read about some of the research they've been featured in.
IntroductionIn the last few years, Transformer language models have helped usher in the next generation of more mature intelligent virtual digital assistants (VDA), conversational user interfaces, and automated content generation. Being able to train these models faster with domain-specific datasets is vital for building robust products in industries undergoing massive AI transformation like banking, finance, healthcare, and the legal field. 
BERT has been at the center of this Transformer revolution. It’s established itself as one of the most popular and versatile language models in use today, achieving state of the art results and inspiring myriad adjacent Transformers models like FinBERT, RoBERTA, ALBERT, and the model we’ll be digging into today, GroupBERT.
GroupBERT is a recent BERT-based model pioneered by the team at Graphcore that introduces an enhanced Transformer structure using efficient grouped matrix convolutions and low-arithmetic intensity operations, which rely on very fast memory access. Building blocks like these are computationally efficient and because they can take advantage of the memory bandwidth, they’re uniquely well suited for Graphcore’s Intelligence Processing Unit (IPU). 
These IPUs run at 260 Tb/s which can effectively halve the number of parameters and resulting in twice as fast time to train, while retaining the same level of pre-training accuracy.
This substantial improvement means faster development times for IPU users to streamline functions such as translation, summarization, and advanced text analytics.
Below, we’ll explain how Graphcore Research used W&B to accelerate the experimentation process of optimizing GroupBERT. Notably, to achieve much faster training of GroupBERT it was beneficial to combine the model with LAMB optimizer (You et al, 2020), and Weight & Biases was a central element in making a smooth transition. 
Moving from Adam to LAMB
Fig 1: GroupBERT architecture
The original BERT was developed using Adam optimizer and running at a fairly low global batch size of 500 sentences. The LAMB optimizer adjusts the learning rate per tensor, which allows to increase the learning rate to 65,000 sentences. High batch size reduces the amount of weight updates, which further reduces the time needed to train a neural network. Moreover, large batch sizes create additional flexibility for data parallelism and as well as allowing for the use of pipeline parallelism more effectively. 
Additionally, a major inconvenience of BERT is its unusual experimental setup. Put simply: it’s not as straightforward as most machine learning models. Specifically, this stems from having three distinct optimization stages of training: phase one, phase two and fine-tuning. 
Phases one and two use the same source data but are fed into the underlying encoder at different sequence lengths: short sequences in phase one for most of the training and then a brief period of training at long sequence length in the second phase. This helps avoid unnecessary computation of the quadratic attention module but complicates the experimentation as the two pre-training phases use different hyperparameters.
Initially, GroupBERT, like BERT, was developed using Adam. However, large batch training offers further reductions in time-to-train and additional data parallelism. Therefore, we started switching Adam to Lamb for GroupBERT and the move was made frictionless and easy only by heavily relying on Weights and Biases dashboard and sweep functionality. Instead of tracking everything in one messy spreadsheet and juggling logs between multiple remote machines, the search for hyperparameters was straightforward, experiment tracking was automated and maintained by W&B servers and helped mitigate the difficulty of managing the switching between various phases of training. 
How we got there using W&BThe first step for migration was to establish a strong baseline we needed to match. For the two pre-training phases, our key metric was MLM accuracy. The end target was to achieve the same SQuAD v1.1 F1 score of 90.4 for GroupBERT. 
GroupBERT uses Layer Norm inside the residual module before applying convolutions and matmuls (prenorm configuration), as opposed to BERT that has Layer Norm after the skip connection (postnorm configuration). This change allows GroupBERT to use significantly higher learning rates than the original BERT model. As a starting point we’ve used the hyperparameters for BERT pre-training and set out to increase the peak learning rate: 
﻿
Fig 2: Fine-tuning the sweeps
Specifying the above sweep is extremely user friendly and takes only 20 lines of YAML config to run the models. Sweeping learning rate for pretraining runs is a simple procedure, mainly due the fact that only one hyperparameter needs to be explored. This makes grid search an ideal candidate for these sweeps. Since each pre-training run incurs a high computational cost, the granularity of the sweep is determined by the computational budget allocated for the exercise. 
What’s especially important and convenient when using W&B sweep is the distributed agent functionality. It allows us to perform training runs on multiple machines, even in different datacentres at the same time and automatically sync all results to the Weights & Biases server. Until the hyperparameter space is exhausted, the sweeping agents will continue to run runs on their dedicated machines. This is a very important feature that allows us to maintain 100% utilization rate of on-prem hardware, without manually starting and scheduling jobs on multiple machines.  
When the sweep runs are finished, all run logs and data are by default available on the W&B dashboard, where they are easily interpretable with a user-friendly dashboard. The plot shows that the learning rate was thoroughly explored to the point of the model diverging. By diving into the sweep table, the user can see all logged metric and then pick the run that did the best.
With minimal changes the same config also works for sweeping phase two of GroupBERT optimizer switch, as the only change needed is indicating the initialization checkpoint of the best Phase One run and specifying a config with longer sequence length. 
Fine-tuning SweepsAfter finding the best pre-training configuration that maximizes the learning rate (while  maintaining stable convergence), the last step is to verify robust downstream performance. We used SQuAD 1.1 dataset for this purpose. 
A change of training optimiser constitutes a new hyperparameter search for fine-tuning, as the convergence dynamics between optimizers are different. That means the optimal fine-tuning hyperparameters could shift. 
What makes fine-tuning sweeps more challenging is the curse of dimensionality: by thoroughly exploring multiple hyperparameters, the potential combinations to do grid search explode exponentially, making this simple and effective hyperparameter search algorithm intractable. 
Luckily, Weights & Biases features not one, but two sweeping tools that are designed just for this situation. One is random search, which explores an arbitrarily large space completely randomly. This method will find the optimal solutions, but practically, it can be costly and still requires an element of luck. 
The second option is running a Bayesian optimization sweep where successive runs inform the search algorithm about areas of the search space that have higher reward, thus wasting less time on exploring parts of hyperparameter space that yield low performance. 
We decided to explore learning rates from 1e-5 up to 1e-3, batch sizes between 20 and 100, and training duration in the region between 1 – 4 epochs. We decided to evaluate hyperparameters discovered by both random search and Bayesian search to compare the quality of the sweep methods.  
Fig-3: Visualization of Bayesian sweep runs of data eye plot
After an extensive sweep of 500 runs each, we saw that both sweeps identified the same optimal hyperparameters. With the original goal to match the 90.4 F1 score on SQuAD dataset, the result of the sweep yielded a configuration that matched the score exactly, but reduced the time-to-train from 53 to 31 hours.  
Moreover Bayesian search discovered two sets of equipotent fine-tuning hyperparameters, since it’s often possible to scale both batch size and learning rate at the same time. More importantly, the Bayesian search identified the two of top 5 runs after 65 runs, while random search took 330 runs to find the optimal hyperparameters. 
Finally, using another W&B feature that improves the workflow of the team, all these results are compiled together in a report and shared with the rest of the team. This way everybody has the opportunity to examine the runs more closely and shape their intuition first hand by looking at dashboards and visualized run data.
﻿
Conclusion Graphcore Research team are definitely W&B power users, but you can see how we approach optimizing models for the IPU and how W&B makes that easier to get better results. The key areas Graphcore Research highlighted are:
Easy to use: W&B lets you dedicate more time to experimentation, rather than writing boilerplate code
Central storage space: no more lost runs and diagrams, everything is stored in one place
Resource utilization: 100% utilization of on-prem resources is achievable for significant periods of time even when doing a lot of fragmented and short jobs
Sharing: all data is accessible to the entire team and W&B Reports help is shaping it in a friendly way
Lastly, if you'd like to run the GroupBERT code, register with spell.ml for free access to IPUs and click the link below:
﻿
Up NextTo know more details about GroupBERT, visit the GroupBERT Blog or read the original paper. We have made it available in our Applications Repo so you can try training models for your application.
﻿
﻿
Add a comment
Tags: Articles, Case Study, Hardware, BERT, NLP
Iterate on AI agents and models faster. Try Weights & Biases today.