The Next Generation of Transformers: Leaving BERT Behind With DeBERTa

This article gives an intro to DeBERTa and its improvements over BERT and RoBERTa, looking at the attention mechanism and the mask decoder module of DeBERTa.
Akshay Uppal
Created on August 7|Last edited on December 15
Comment
DeBERTa is shaping up to be the next generation of the BERT-style self-attention transformer models. For the first time, a  single model has surpassed human performance on natural language processing (NLP) tasks and topped the SuperGLUE leaderboard. 
In this post, we will deep dive into DeBERTa and answer some questions like:
What actually is DeBERTa? 
How is it different from BERT and RoBERTa? 
What is disentangled attention? 
Where to use DeBERTa and its future?
If you'd like to do your own experiments on DeBERTa or, really, any other machine learning project or research, Weights & Biases is free to try and easy to get up and running on. 
﻿
﻿
﻿
﻿
Table of ContentsWhat Is DeBERTa?The Self-Attention in BERT and RoBERTa?DeBERTa: Decoding-Enhanced BERT With Disentangled AttentionWhen You Would Use DeBERTaDeBERTa vs RoBERTa (or DeBERTa At Work)What's Next?Summary
﻿
﻿
Let's dive in!
What Is DeBERTa?Before we actually dive deeper into disentangled attention, let's glance quickly over how things were done in the era of BERT and RoBERTa. 
The Self-Attention in BERT and RoBERTa?The paper "Attention Is All You Need" of course kicked off the transformer revolution, but the general flow for models like these is something:
 Tokenization: A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered discrete elements. Those tokens are frequently words, characters, or parts of a word. So for "washing," you could have the entire word, each letter, or "wash" and "ing." All of these are examples of tokens.
Since in transformers all scores are computed in parallel, it does not have any context of the positions of each word/token. To solve this, position embeddings are added to the word embeddings before sending them to the self-attention layers. Why? Because word order and position can deeply affect a word or token's meaning. ("I want a burrito so bad" vs. "This burrito is so bad," for example.)
The final input comprises the element-wise addition of the context embeddings as well as the position embeddings and is sent to the self-attention block.
The inputs are transformed into the Query, Key, and Value vector using linear transformations by their corresponding trainable matrices Wk, Wq, Wv, and the attention scores are calculated as follows:
Q=H.Wq,  K=H.Wk,  V=H.WvQ = H.Wq, \:\: K=H.Wk, \:\:V = H.WvQ=H.Wq,K=H.Wk,V=H.Wv﻿
A=Q.KTd\displaystyle A = \frac{Q.K^T}{\sqrt{d}}A=d​Q.KT​﻿
Ho=Softmax(A)VHo = Softmax(A)VHo=Softmax(A)V﻿
Where HHH﻿ is the Input (context + position embeddings),  Wq,Wk and WvW_q, W_k \:and \:W_vWq​,Wk​andWv​﻿ are the transformation matrices for Query, Key, and Value respectively, ddd﻿  is the number of dimensions in the output hidden state and HoH_oHo​﻿ is the output of the attention layer.
DeBERTa: Decoding-Enhanced BERT With Disentangled AttentionTo differentiate and improve upon BERT and RoBERTa, the authors of DeBERTa introduced two key mechanisms: disentangled attention and an enhanced mask decoder.
Let's look at both these novel components in more detail and learn how it is different from the attention mechanism present in BERT and DeBERTa.
Disentangled AttentionThe proposed disentangled attention mechanism differs from all existing approaches in that each input word is represented using two separate vectors that encode a word’s content and position, respectively. The key difference is that instead of adding the two vectors we treat them separately throughout the network.
These relative position embeddings are shared across all the layers of the transformer as inputs and the self-attention layer not only uses the query and key transformation on the context embeddings but also has separate matrices for the transformation for these positional embeddings.
source: https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/﻿
A given token i is represented by two main components HiH_iHi​﻿ and Pi∣jP_{i|j}Pi∣j​﻿ where H is the context vector and P is the relative position vector of i with respect to j. Then attention weight of a word pair (i, j) can be computed as a sum of four attention scores using disentangled matrices on their contents and positions as content-to-content, content-to-position, position-to-content, and position-to-position.
Ai,j={Hi,Pi∣j}×{Hj,Pj∣i}⊤=HiHj⊤+HiPj∣i⊤+Pi∣jHj⊤+Pi∣jPj∣i⊤\begin{aligned}
A_{i, j} &=\left\{\boldsymbol{H}_{i}, \boldsymbol{P}_{\boldsymbol{i} \mid j}\right\} \times\left\{\boldsymbol{H}_{\boldsymbol{j}}, \boldsymbol{P}_{\boldsymbol{j} \mid \boldsymbol{i}}\right\}^{\top} \\
&=\boldsymbol{H}_{\boldsymbol{i}} \boldsymbol{H}_{j}^{\top}+\boldsymbol{H}_{\boldsymbol{i}} \boldsymbol{P}_{\boldsymbol{j} \mid \boldsymbol{i}}^{\top}+\boldsymbol{P}_{\boldsymbol{i} \mid \boldsymbol{j}} \boldsymbol{H}_{\boldsymbol{j}}^{\top}+\boldsymbol{P}_{\boldsymbol{i} \mid \boldsymbol{j}} \boldsymbol{P}_{j \mid \boldsymbol{i}}^{\top}
\end{aligned}Ai,j​​={Hi​,Pi∣j​}×{Hj​,Pj∣i​}⊤=Hi​Hj⊤​+Hi​Pj∣i⊤​+Pi∣j​Hj⊤​+Pi∣j​Pj∣i⊤​​﻿
The context-context portion essentially maps the semantic relation between the meaning of the two words i and j.  
The context-position portion is a map of how the context of the word i is related to the relative position vectors of the words around it. 
The position-context embedding is the map of how the relative position of the word i relates to the context of the word j. 
Finally, the position-position part of the overall attention would map the relation of position vectors for a word i to that of word j. Since we are using relative position embeddings everywhere, the position-position part becomes moot and is hence ignored going forward.
Since context-context and context-position attention have been covered previously by other researchers the authors explicitly claim that position-context attention is important as the modeling of the relative positions is incomplete by just using context-position.
Qc=H.Wcq,    Kc=H.Wck,    Vc=H.Wcv,    Qr=P.Wrq,    Kr=P.WrkQc = H.W^cq, \;\;Kc = H.W^ck,\;\;Vc = H.W^cv, \;\;Qr = P.W^rq, \;\;Kr = P.W^rk
Qc=H.Wcq,Kc=H.Wck,Vc=H.Wcv,Qr=P.Wrq,Kr=P.Wrk﻿
A~i,j=QicKjc⊤⏟(a) content-to-content +QicKδ(i,j)r⏟(b) content-to-position +KjcQδ(j,i)⊤r⏟(c) position-to-content \tilde{A}_{i, j}=\underbrace{\boldsymbol{Q}_{i}^{c} \boldsymbol{K}_{j}^{c \top}}_{\text {(a) content-to-content }}+\underbrace{\boldsymbol{Q}_{i}^{c} \boldsymbol{K}_{\boldsymbol{\delta}(i, j)}^{r}}_{\text {(b) content-to-position }}+\underbrace{\boldsymbol{K}_{j}^{c} \boldsymbol{Q}_{\delta(j, i)^{\top}}^{r}}_{\text {(c) position-to-content }}A~i,j​=(a) content-to-content Qic​Kjc⊤​​​+(b) content-to-position Qic​Kδ(i,j)r​​​+(c) position-to-content Kjc​Qδ(j,i)⊤r​​​﻿
Ho=softmax⁡(A~3d)Vc\boldsymbol{H}_{o}=\operatorname{softmax}\left(\frac{\tilde{\boldsymbol{A}}}{\sqrt{3 d}}\right) \boldsymbol{V}_{c}Ho​=softmax(3d​A~​)Vc​﻿
Source: ICLR 2021 slides:  https://iclr.cc/media/iclr-2021/Slides/2562.pdf﻿
The WCW_CWC​﻿ and WrW_rWr​﻿ represent the Query and Key transformation matrices for context and positions input vectors respectively and the δ(i,j)\delta(i, j)δ(i,j)﻿ represents the relative distance between the tokens i and j
Enhanced Masked Decoder (EMD)Following the footsteps of BERT, DeBERTa uses the masked language modeling task (MLM) for its pre-training. Since the model has access to only the relative positions and not the absolute positions it can be a problem in ambiguous cases. The absolute positions need to complement the relative position embeddings to completely understand the spatial modality.
BERT ingests the absolute position in the input layer itself. To give the model some context of the actual positions, the authors add them right after all the transformer layers but before the softmax layer for masked token prediction. In this way, DeBERTa captures the relative positions in all the transformer layers and only uses absolute positions as complementary information when decoding the masked words. Thus, the name enhanced masked decoder.
﻿
Source: The DebERTa paper https://arxiv.org/abs/2006.03654﻿
When You Would Use DeBERTaWith relative position bias in the mix, the authors chose to truncate the maximum relative distance to k. As a result in each layer, any token can attend directly to 2(k−1)2(k-1) 2(k−1)﻿tokens and itself. By further stacking the transformer layers each token in lthl^{th}lth﻿ layer can then implicitly attend to at most (2k−1)L(2k-1)L(2k−1)L﻿ tokens and itself. This theoretically enables the DeBERTa Large model (k=512 , L=24) to have a max sequence length of 24528. This makes DeBERTa a great choice for long sequences as opposed to RoBERTa.
Apart from long sequences DeBERTa models both base and Large have proven to be better than RoBERTa for various downstream NLP tasks. Here are some results from the paper.
﻿
﻿
DeBERTa vs RoBERTa (or DeBERTa At Work)For this section, we're going to take DeBERTa for a spin and compare it with RoBERTa on a text classification task for long sequences. 
To see the power of disentangled attention in action, the dataset we're using is mainly comprised of text with the number of words per sample in the range 250-1284 words which would result in tokens well beyond 512 (the max sequence length of RoBERTa and other BERT like architectures). The dataset is taken from US Consumer Finance Complaints and can be found here (Kaggle datasets provides again!).
The idea here is to compare the performance of DeBERTa and RoBERTa and compare them with basic training and the same hyperparameters. Do the improvements mentioned in the paper result in a better performance for DeBERTa over RoBERTa?
Now, please note: the entire dataset is 555k samples with 18 columns (!) but for this experiment, we are only including the text and labels as columns and samples with more than 250 words. This leads to a sample set of 17142 which we further divide into 80% train set and 20% validation set. 
Here's our distribution:
﻿
﻿
We do encounter some class imbalance here, but let's just ignore it for this comparison.
﻿
﻿
What we do see: for the same hyperparameters and just very basic training, the DeBERTa outperforms RoBERTa in terms of F1 score as well as accuracy for the validation set.
What's Next?
DeBERTa v2Like many high-performing franchises, DeBERTa has a sequel. DeBERTa v2 consists of 48 layers with a hidden size of 1,536 and 24 attention heads resulting in the 1.5B parameters model used for the SuperGLUE single-model submission which beats the human baseline for the first time. 
The v2 model is part of the original paper with certain scaling-up enhancements to the base and large model. Some changes v2 achieves are:
V2 utilizes a sentence piece tokenizer and has a vocabulary of size 128,000 built using the training data.
nGiE (nGram Induced Input Encoding): The DeBERTa-v2 model uses an additional convolution layer along with the first transformer layer to induce n-gram knowledge of sub-word encodings and their outputs are summed up before feeding to the next Transformer layer. This enables the model to better learn the local dependency of input tokens.
Sharing of position projection matrices Wkr, WqrW^r_k, \: W^r_qWkr​,Wqr​﻿ with content projection matrices Wkc, WqcW^c_k, \: W^c_qWkc​,Wqc​﻿ in all attention layers to reduce parameters 
Apply bucket to encode relative positions The DeBERTa-v2 model uses a log bucket to encode relative positions similar to T5.
DeBERTa v3Following the ELECTRA-style training paradigm, DeBERTa v3 has replaced the masked language model (MLM) pre-training task with a much more nuanced replace token detection task (RTD). 
RTD trains the model to distinguish between “real�� and “fake” input data. Instead of just corrupting the input by replacing tokens with “[MASK]” as in BERT, RTD  corrupts the input by replacing some input tokens with incorrect––but somewhat plausible––fakes.  Based on top of DeBERTa v2 enhancements The DeBERTa V3 base model comes with 12 layers and a hidden size of 768.
SummaryIn this post, we went over what new stuff does DeBERTa have to offer us and how the new ideas presented have given the research a new direction. We saw in depth how the proposed Disentangled attention is different from the classic self-attention in BERT and RoBERTa and we also saw how the actual position embeddings are being utilized in the MLM decoder head to further propel the model's performance. 
Checkout DeBERTa's annotated and many more such annotations at: https://github.com/au1206/paper_annotations﻿﻿﻿﻿﻿
﻿﻿You can find the entire code in the form of a Kaggle notebook here: https://www.kaggle.com/code/au1206/long-sequence-text-classification/notebook?scriptVersionId=109910854﻿﻿﻿﻿﻿
You can find more details for each model on this WandB dashboard: https://wandb.ai/akshayuppal12/long_sequence-text_classification?workspace=user-akshayuppal12﻿
Related Reading: 
An Introduction to BERT And How To Use It
In this article, we will explore the architecture behind Google’s revolutionary BERT model and implement it practically through the HuggingFace framework BERT NLP.
How W&B Helped Graphcore Optimize GroupBERT to Run Faster on IPUs
Learn how W&B helped the team at Graphcore train a new BERT model in 40% less time 
How to Fine-Tune BERT for Text Classification
A code-first reader-friendly kickstart to finetuning BERT for text classification, tf.data and tf.Hub
﻿
﻿
Add a comment
Tags: Articles, NLP, Intermediate, Experiment, BERT, Panels, Plots, Kaggle
Iterate on AI agents and models faster. Try Weights & Biases today.