An Introduction to BERT And How To Use It
In this article, we will explore the architecture behind Google’s revolutionary BERT model and implement it practically through the HuggingFace framework BERT NLP.
Created on August 24|Last edited on October 25
Comment
Before we dive into our article on BERT, let's first look at what we'll be covering.
Table Of Contents
Natural Language ProcessingWhat is BERT?How does BERT work?What Machine Learning Tasks Is BERT Used For?BERT-based ModelsExample: Sentiment Analysis with BERT using PythonSummary
Natural Language Processing
Natural Language Processing (NLP) is a branch of machine learning that focuses on the interaction of computers with human language to perform specific tasks including language generation, summarization, translation, and sentiment analysis.
The field exploded with the introduction of transformers in the legendary paper, "Attention Is All You Need."

Transformers and Successors, Image by Author
One of its successors, BERT, revolutionized the approach we used for dealing with words and linguistic text. The recent surge of state of the art performance in NLP is in many ways inspired by BERT so understanding the architecture is essential for leveraging these models in our projects!
In this post, we will explore the architecture behind BERT as well as pre-training tasks such as Masked Language Modeling and Next Sequence Prediction. We will look at practical applications of BERT, and BERT-based models and get some hands-on experience in working with BERT on the IMDB Movie Reviews Dataset. We will then take a quick glance at the recent surge of models and end this article with links to a few important resources.
This article is going to be quite a read as I'm attempting to explain the basics of BERT and illustrate it with code. I recommend grabbing some snacks and a large cup of water.
If you find some parts of this article leave you with questions, please reach out in the comments section. Hold on to your screens, I will see you on the other side!
And, of course, if you want to run the code in this report, follow along via the Google Colab:
What is BERT?
Bidirectional Encoder Representations from Transformers, better known as BERT, is a revolutionary paper by Google that increased the state of the art performance for various NLP tasks and was the stepping stone for many other revolutionary architectures.
And BERT really was revolutionary. To understand why, we need to compare it to the other successors of Transformers: OpenAI’s GPT-3 and Allen AI’s ELMo.
GPT-3 and ELMo were state of the art (SOTA) models. They had similar objective functions during pre-training. The main drawback for these two is that they used unidirectional language models to learn general language representations. This is a drawback because it restricts the potential of these models. The authors of the paper put it more subtly as “current technologies restrict the power of pre-trained representations…the major limitation is that standard models are unidirectional."
Unidirectional means that the context of a word is determined either by the word preceding it (left context) or by the word following it (right context) but not both. BERT is bidirectional, meaning both the words following it and preceding it determine the context of the word. This additional context is one of the things that makes BERT exceptionally powerful.
BERT was trained on a large dataset (you'll hear BERT called a large language model or LLM quite frequently) and as such has general language representation. All we need to do to create SOTA models of our own is to add one more output layer to BERT that tailors it to our specific task. This is referred to as Transfer Learning and BERT is a unified architecture that can be shared across various NLP tasks.
How does BERT work?
In this section, we will get in-depth with the model architecture and learn how BERT works. For us to understand the BERT architecture, we need to first understand Transformers and their Attention mechanism.
Transformers
As noted above, transformers were introduced in the legendary 2017 paper, "Attention Is All You Need." This paper replaced Recurrent Neural Network (RNN) models and introduced this new architecture, "Transformers." You can read DaleOnAI’s blog post to get a good introduction to Transformers and these three takeaways.
More in-depth resources are linked in the recommended reading section.
There are three main takeaways:
- Positional Encoding: Word order understanding was moved from the neural network to the data itself. Eg: “Optimus is on a walk”. This is now positionally encoded as [("Optimus",1),("is",2),("on",3),("a",4),("walk",5)], Note: The authors used a complex sine function to come up with these encodings, not integers.
- Attention: Attention is the mechanism that chooses the “word” the model should be "attending to." I like to visualize the attention mechanism as the glasses which focus on the important word in the current context.
- Self-Attention: Allows input words to essentially interact with each other and determine which word they should pay more “attention” to.
Another important feature of Transformers is that they possess six layers of encoders and decoders. Encoders as the name suggest encoded the input and decoders were in place to decode the encoded output.
BERT
BERT is built entirely on Transformer’s encoders. Take a look at the BERT architecture as illustrated in the paper:

BERT Architecture, Image by Authors of BERT paper
We will look at the BERT framework for training and the input/output representation in detail in the upcoming sections, here is a brief overview.
There are two steps to the BERT framework:
- Pre-training: The model is trained on unlabelled data over two unsupervised pre-training tasks (Masked Language Modeling and Next Sequence Prediction)
- Fine-tuning: The BERT model is initialized with pre-trained parameters and these parameters are fine-tuned using labeled data from downstream tasks like Text Similarity, Question Answer pairs, Classification, etc.
The main uniqueness of BERT was that it was a "unified architecture across different tasks."
The authors trained two BERT models, BERT_BASE and BERT_LARGE. BERT_BASE is of a similar size to OpenAI’s GPT for comparison purposes. BERT_LARGE is a monstrous model which achieved state of the art by a higher degree across different tasks.
Input / Output Representation
BERT handles a wide variety of downstream tasks. To accommodate this, input representation needed to be similar for a single sentence or a pair of sentences.
Important Note: I am going to stick with the definition of sentence and sequence given by the authors. A sentence is an “arbitrary span of contiguous text”, it can refer to multiple linguistic sentences packed together or an individual linguistic sentence. A sequence is the input token sequence fed to BERT. With the definitions out of the way, let us jump back into our adventure in the BERT-land.
💡
To make our lives easier, I am going to explain this concept with an example and some visualizations. Consider this:
Sentence 1: “Napoleon revolutionized military organization.”
Sentence 2: “Napoleon has a legacy.”
Sentence 3: “Legacy still lives.”
The input representation process consists of Tokenisation and Embedding. There are two main tokens: [CLS] and [SEP] tokens.
The [CLS] token is the first token added to every sequence, it is a special classifier token. The [SEP] token is used to indicate the separation of sentence pairs. For sentence pairs, we also add a learned embedding indication of where it comes from (Sentence A or Sentence B).
From there, every word is split into character chunks present in the dictionary (tokens). This tokenization is then converted into embeddings (vectors).
Visually, for a single sentence:

Single Sentence Embedding, Image by Author
For a sentence pair,

Two Sentence Embedding, Image by Author
There are a few important things to note here, [CLS] is given 101 and [SEP] is given 102 by default.
The next step in training BERT is Pre-training and the first step of that Pre-training is Masked Language Modeling (MLM).
Masked Language Modeling
If we simply introduced bidirectional conditioning it would allow each word to indirectly see itself in a multi-layered context. This would be a huge disaster.
To prevent this from happening, we use masked language modeling (MLM).
In MLM, we randomly mask some percentage of the input token and try to predict those tokens. The author landed at masking 15% of the input tokens with the [MASK] token.
The main point of importance is that we are only predicting the masked word, we are not reconstructing the entire output.
Here's how we do that with Python:
from transformers import BertTokenizer, BertForMaskedLMimport torchtokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertForMaskedLM.from_pretrained('bert-base-uncased')text = ("Napoleon revolutionised military organisation""Napoleon has legacy""Legacy still lives")rep = tokenizer(text, return_tensors = "pt")print("Before Masking",rep.input_ids)rand = torch.rand(rep.input_ids.shape)mask_arr = (rand < 0.15) * (rep.input_ids != 101) * (rep.input_ids != 102)selection = torch.flatten(mask_arr[0].nonzero()).tolist()rep.input_ids[0, selection] = 103after_masking = rep.input_idsprint("After Masking", after_masking)
I will quickly run through the code, we are importing BertTokenizer for tokenizing our text, and we import BertForMaskedLM to perform MLM. We note the input_ids before and after masking.
Note: We actively avoid masking tokens corresponding to 101 ([CLS]) and 102([SEP]).
💡
The few main takeaways from running this code are that before masking we don’t observe 103, which is the token for [MASK]. In after_masking we do observe 103 present in some places. This is MLM.
Next Sequence Prediction
For performing tasks like Question Answering and Natural Language Inference, BERT needs to understand the relationship between sentences. To train BERT in understanding this relationship, we prepare two sentences (A and B).
Half the time Sentence B is the sentence following A in a paragraph, this is denoted by IsNext (0 in the code). The other half of the time Sentence B is chosen at random and doesn’t follow Sentence A, this is denoted by NotNext(1 in the code).
from transformers import BertTokenizer, BertForNextSentencePredictionimport torchtokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')SentenceA = ("Napoleon has legacy")SentenceB = ("Legacy still lives")rep = tokenizer(SentenceA,SentenceB, return_tensors = "pt")print(rep)
Printing rep (representation), we see that we can identify the origin of sentence pairs. Here 0 represents Sentence A and 1 represents Sentence B.
With Fine-tuning out of the way, let us move on to prediction.
Text Prediction
For classification tasks, the output corresponding to the [CLS] token is taken as an input to an additional output layer. This is shown visually in the next session.
For multiple sentence tasks, we take the output corresponding to [SEP] and perform actions using an additional output layer.
After having treaded through the mechanics behind BERT, we can relax for a bit. We are going to look at the uses of BERT in Machine Learning tasks.
What Machine Learning Tasks Is BERT Used For?
BERT requires just one additional output layer (in most cases) to perform most machine learning tasks. Let us look at a few examples.
1. Classification

Image by Authors of BERT paper
As we can observe from the image, classification happens by taking as input the cell output corresponding to the [CLS] token. Classification may include classifying into various labels, one example includes News classification into categories such as ‘tech’, ‘business’ etc.
2. Question Answering

Image by Authors of BERT paper
Question-Answering models are models that can answer any given question given some context. They can choose the answer from paragraphs, options, etc.
Fine-tuning BERT on Stanford Question Answering Dataset (SQuAD v2.0) allows the model to perform the aforementioned tasks. Similar to that of Classification, we take the output corresponding to [SEP] taken as T_SEP, the tokens proceeding this is taken as Question, and those following it are taken as Answers.
3. Named Entity Recognition
Named Entity Recognition involves extracting meaningful relationships (entities) from sentences. The first step is to identify an entity and categorize them into groups like Person, Organisation, etc. The main difference in this task is that we take the output from every embedding vector, not just [CLS].
4. Sentiment Analysis
Sentiment Analysis is the task in which a model identifies human emotion from a given text, human emotion includes anger, sadness, happiness, etc. This is exactly what we are going to do in our Sentiment Analysis with BERT section.
5. Text Summarisation
In text summarisation, our model condenses huge volumes of text (usually documents) into smaller summaries. There are two types of text summarisation, Extractive text summarisation (given documents reduce the size), and Abstractive text summarisation (producing a summary with no prior content). Look into BERT Summariser model BERT SUM for in-depth architecture for performing Text Summarisation.
BERT-based Models
1. FinBERT
FinBERT is BERT trained on financial data specifically for sentiment analysis. It can understand a lot of “investment jargon” and correctly identify the sentiment. The financial data it was trained on includes text from financial news services and the FiQA dataset.
2. RoBERTa
RoBERTa is an acronym for Robustly optimized BERT approach. It has the exact same architecture as BERT, it introduces additional improvements in resources and time. It is trained with improved training methodology, more computing power, and 1000% more data. It also introduced dynamic masking which changes during training.
3. DeBERTa
DeBERTa is an acronym for Decoding-enhanced BERT with Disentangled attention. It was built on top of RoBERTa with enhanced mask decoder training on half the data and disentangled attention. It surpassed human performance in the SuperGLUE benchmark.
4. DistilBERT
DistilBERT is a distilled version of BERT trained using half the number of examples as the original paper but retains 97% of the performance. It uses distillation, which approximates the original BERT to a much smaller model.
This is much more sustainable to train, but it still requires a lot of resources.
Example: Sentiment Analysis with BERT using Python
In this section, we will perform Sentiment Analysis with BERT on the IMDB Movie Reviews dataset. The dataset consists of 50,000 reviews. I have extracted a much smaller dataset (with 135) reviews, which can be downloaded using the link in the code block.
Remember, you can also follow along in this Colab.
Importing Dependencies
import torchimport pandas as pdimport numpy as npfrom transformers import BertTokenizer, BertForSequenceClassification
Downloading The IMDB Dataset
The link below will download the shorter version of the original IMDB Movie Reviews dataset. The original dataset contains 50,000 reviews, in the shorter one, we only have 135 reviews.
df = pd.read_csv('https://gist.githubusercontent.com/Mukilan-Krishnakumar/e998ecf27d11b84fe6225db11c239bc6/raw/74dbac2b992235e555df9a0a4e4d7271680e7e45/imdb_movie_reviews.csv')df = df.drop('sentiment',axis=1)
Model Building and Evaluation
We are going to use bert-base-multilingual-uncased-sentiment, BERT model fine-tuned for Sentiment Analysis. We define a custom function called sentiment_movie_score, which goes through data, row-by-row, performs sentiment analysis, and returns the score, between 1(Negative) and 5(Positive).
tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')model = BertForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')def sentiment_movie_score(movie_review):token = tokenizer.encode(movie_review, return_tensors = 'pt')result = model(token)return int(torch.argmax(result.logits))+1df['sentiment'] = df['text'].apply(lambda x: sentiment_movie_score(x[:512]))
Visualizing The Results
We initialize BERT_Sentiment_Analysis as the project name. We log our DataFrame as wandb Table and finish the run.
!pip install wandbimport wandbwandb.init(project="BERT_Sentiment_Analysis")wandb.run.log({"Sentiment Analysis of IMDB Movie Reviews" : wandb.Table(dataframe=df)})wandb.run.finish()
Using wandb.Table, we log our data. We can then visualize or query it. This process is explained in much more depth in the documentation.
Summary
In this article, we sought to introduce yo to the BERT architecture, Masked Language Modelling, Next Sequence Prediction, the BERT framework, types of BERT, ML tasks, and applications of BERT.
We also had hands-on experience with building a sentiment analysis model on IMDB Movie Reviews Dataset.
The field of NLP is constantly changing and evolving, and we have the opportunity to experience the golden age of natural language processing unfolding right before our eyes. Every year new models, new architectures, and updates of previous architectures are implemented.
For example, new models trained on Wav2Vec 2.0 dataset became State-of-the-Art, these include W2V BERT and XLS-R. These ‘Universal Models’ can generalize well to new tasks in a given domain. Meta-Learning has seen a tremendous increase in recent years. Meta-Learning models include models such as T0, FLAN, and ExT5.
All the examples in the previous passage were built on top of the transformer architecture. There is a new architecture, Perceiver, which is similar to transformers but can scale well to high dimensional data.
The future of NLP is evolving and vibrant. We can expect many more models to achieve State-of-the-Art by building on the transformer architecture, and new architectures will also be introduced (the viability of these models can only be understood in the future). The field of NLP offers tremendous opportunities and I hope you can take part in this “Golden Age”.
Recommended Reading
Related Articles
How to Fine-Tune BERT for Text Classification
A code-first reader-friendly kickstart to finetuning BERT for text classification, tf.data and tf.Hub
Does Model Size Matter? A Comparison of BERT and DistilBERT
This article provides a comparison of DistilBERT and BERT from Hugging Face, using hyperparameter sweeps from Weights & Biases.
How W&B Helped Graphcore Optimize GroupBERT to Run Faster on IPUs
Learn how W&B helped the team at Graphcore train a new BERT model in 40% less time
How Graphcore Is Supporting the Next Generation of Large Models With the Help of W&B
This article explains how Weights & Biases helped Graphcore drive 50 to 100 times more experiments on their advanced IPU hardware.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.