Sentiment Classification using Bi-LSTM with Attention

This article post provides a guide to using Bi-LSTM and attention mechanisms for sentiment classification. It includes a hands-on implementation with code examples.
Aman Arora
Created on June 27|Last edited on July 7
Comment
﻿
IntroductionThis article post provides a comprehensive guide to using Bi-LSTM and attention mechanisms for sentiment classification. It includes a hands-on implementation of a Bi-LSTM model for tweet classification and demonstrates how to enhance the model with a soft attention mechanism. 
The post concludes with a comparison of the two models, highlighting the improved performance achieved by incorporating attention. The complete code is available in the linked GitHub repository.
As part of this blog post, we are performing tweet classification using Bi-LSTM. We will be using Cyberbullying Classification dataset from Kaggle. 
Here is what will be covered in this article: 
Table of ContentsIntroductionTable of ContentsIntroduction to RNNs, LSTMs & Bi-LSTMsRecurrent Neural Networks (RNNs) LSTM:Bi-LSTM:Introduction to Attention MechanismsWhat is Attention?How does Attention work?Types of AttentionBi-LSTM for Tweet Classification with complete training & evaluation code + W&B for experiment trackingData PreparationSentiment ClassifierModel TrainingModel EvaluationAdding Soft Attention to Bi-LSTM for Tweet ClassificationSentiment Classifier with AttentionConclusion
﻿
﻿
Let's get going!
Introduction to RNNs, LSTMs & Bi-LSTMs
Recurrent Neural Networks (RNNs) ﻿Recurrent Neural Networks (RNNs) are a type of neural network that have been successful in modeling sequential data. They have the unique feature of maintaining an internal state that can remember information about previous inputs in the sequence, making them particularly suited for tasks such as language modeling and time-series prediction. 
RNNs were the go-to-network choice for many tasks until LSTMs & Transformers came along. Check out this blog by none other than Andrej Karpathy, titled “The Unreasonable Effectiveness of Recurrent Neural Networks”.
However, RNNs suffer from the vanishing gradient problem, which makes it difficult for them to learn long-range dependencies in the data.
To further understand why, let’s borrow a representation of RNN from one of the best blogs on RNNs & LSTMs by **Christopher Olah,** titled “**Understanding LSTM Networks”, one could visualise RNNs as below.
﻿
Here X0X_0X0​﻿, X1X_1X1​﻿ , X2X_2X2​﻿ are the first, second & third words respectively. XtX_tXt​﻿ represents word at position ttt﻿, so, as we can see in the image above, RNNs consume the sentence word-by-word while updating the hidden-state hth_tht​﻿ at every step. 
But, as we go further in the sequence, as that gap grows, RNNs become unable to learn to connect the information.
﻿
From Christopher Olah’s blog: 
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them.
Suppose we have a sentence: "I am feeling very good today because the weather is nice." We want to predict the sentiment of this sentence.
A standard RNN would process this sentence one word at a time, maintaining an internal state that captures the information about the words it has seen so far. For example, when it gets to the word "good", the RNN's state should contain information about the preceding word "I am feeling very". However, due to the vanishing gradient problem, RNNs tend to forget information about earlier words as the sentence gets longer. So, by the time it gets to "nice", it might have forgotten much of the context, which could lead to less accurate predictions.
LSTM:Long Short-Term Memory (LSTM) networks, a type of RNN, were designed to overcome this vanishing gradient problem. They introduce a memory cell that can maintain its state over time and gating units that regulate the flow of information into and out of the cell. This architecture allows LSTMs to learn longer sequences than traditional RNNs.
Again, borrowing from Christopher Olah’s blog:
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
﻿
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
﻿
As part of this blog post, we won’t be going through the inner workings of LSTMs, but I would refer the readers to Christopher Olah’s blog.
However, the main difference between RNNs and LSTMs is in this inner structure of repeating chains. The main idea is that LSTMs have logic gates that allow the network to keep or forget information as needed. Thus, they are able to maintain longer-term dependencies.
For the same example sentence, "I am feeling very good today because the weather is nice." 
When the LSTM processes the word "good," it not only updates its internal state but also updates a separate memory cell. As mentioned before, the information in this cell is regulated by gating units, which decide what information to keep or forget. This allows the LSTM to maintain a longer context, so when it gets to "nice", it still remembers the earlier parts of the sentence, leading to a more accurate prediction.
Bi-LSTM:Bi-directional LSTMs (Bi-LSTMs) extend the idea of LSTMs by having two LSTMs in each layer. One LSTM processes the sequence from left to right (forward), and the other from right to left (backward). The outputs of both LSTMs are then concatenated. This allows the network to have access to past (from the forward LSTM) and future (from the backward LSTM) contexts at the same time, which can be very useful in many tasks.
For the same example sentence, "I am feeling very good today because the weather is nice" -one LSTM goes from "I" to "nice", and the other goes from "nice" to "I". The final representation for each word is then the concatenation of the two LSTM outputs. For example, the representation for "good" would contain information about both the preceding words "I am feeling very" and the following words "today because the weather is nice". This gives the Bi-LSTM a more complete understanding of the sentence, which can lead to even more accurate predictions.
Introduction to Attention Mechanisms
What is Attention?"Attention" is inspired by the human visual attention mechanism, which allows us to focus on a part of a visual scene while perceiving the rest of it in lower resolution. Similarly, in neural networks, attention allows the model to focus on certain parts of the input when producing an output.
Let's take our example sentence: "I am feeling very good today because the weather is nice". In a standard RNN or LSTM, each word in the sentence would contribute equally to the final prediction. However, with attention, the model can learn to focus more on the words "good" and "nice" when predicting the sentiment of the sentence, as these words are more indicative of the sentiment than the other words.
How does Attention work?In the simplest form, an attention mechanism scores each item in the input sequence, and these scores are then used to weight the contribution of each item to the output. The scores are based on the current output and each item in the input sequence, allowing the model to focus on relevant inputs.
For instance, in our example sentence, the attention mechanism might assign higher scores to the words "good" and "nice", causing these words to have a greater impact on the final sentiment prediction.
Types of AttentionThere are several types of attention mechanisms, including additive attention, multiplicative attention, and scaled dot-product attention (used in the Transformer model). Each of these mechanisms uses a different method to calculate the attention scores.
In Large Language Models (which are transformer based) such as ChatGPT, Claude, Bard, LLaMA, scaled-dot-product attention is used with some tips & tricks like fast-attention, AliBi and more.
In this blog post, we will be exploring the use of additive attention in the context of a Bi-LSTM model for tweet classification.
Bi-LSTM for Tweet Classification with complete training & evaluation code + W&B for experiment trackingAs part of this section, we will perform Tweet Classification using a Bi-LSTM in PyTorch. All code has been shared publicly for you to reproduce in this repository.
But, before we can train our model, let’s look at the dataset that we will be working with first. I have uploaded the dataset as a W&B table for you to experiment with.
﻿
﻿
The table above is our training set, you can access the training and test set as W&B tables here.
As part of this Tweet Classification task, we take the text column as input and sentiment column as output.
Data PreparationSince we can’t directly feed the data to Neural Networks, we will create a class called TweetDataset, that takes in the text as input, tokenizes it using BertTokenizer and returns a dictionary.
from transformers import BertTokenizer
from torch.utils.data import Dataset
import torch
import pandas as pd
﻿
﻿
class TweetDataset(Dataset):
    def __init__(self, filename, maxlen):
        # Store the contents of the file in a pandas dataframe
        self.df = pd.read_csv(filename, delimiter=",")
﻿
        self.classes = ["negative", "neutral", "positive"]
        # Initialize the BERT tokenizer
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
﻿
        self.maxlen = maxlen
﻿
    def __len__(self):
        return len(self.df)
﻿
    def __getitem__(self, index):
        # Selecting the sentence and label at the specified index in the data frame
        sentence = self.df.loc[index, "text"]
        label = torch.tensor(
            self.classes.index(self.df.loc[index, "sentiment"]), dtype=torch.long
        )
﻿
        # Preprocessing the text to be suitable for BERT
        input_ids = self.tokenizer(
            sentence,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
            max_length=self.maxlen,
        )["input_ids"][0]
        attention_mask = self.tokenizer(
            sentence,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
            max_length=self.maxlen,
        )["attention_mask"][0]
﻿
        return {
            "sentence": sentence,
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "label": label,
        }
This is what the output looks like for an example sentence “Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China:  (SH)  (BJ).”:
{'sentence': 'Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China:  (SH)  (BJ).',
 'input_ids': tensor([  101,  8344,  2003,  2036,  2428, 10990,  1006, 10785,  1011,  1011,
         24581,  2015, 14891,  5686,  1007,  1012,  2204,  1056, 28394,  4523,
          1999,  2859,  1024,  1006, 14021,  1007,  1006,  1038,  3501,  1007,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'label': tensor(1)}
Sentiment ClassifierNow, once we have the dataset ready, which outputs token IDs, we are ready to create our Bi-LSTM. PyTorch has already implemented the nn.LSTM class for us, to make it Bi-LSTM, all we need to do is to pass bidirectional=True.
class SentimentClassifier(nn.Module):
    def __init__(
        self,
        tokenizer,
        embedding_dim=128,
        hidden_dim=256,
        output_dim=3,
        n_layers=2,
        bidirectional=True,
        dropout=0.1,
    ):
        super().__init__()
﻿
        # Embedding layer
        self.embedding = nn.Embedding(len(tokenizer.get_vocab()), embedding_dim)
﻿
        # LSTM layer
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout,
            batch_first=True,
        )
﻿
        # Dense layer
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )
        # Activation function
        self.act = nn.Sigmoid()
﻿
    def forward(self, text):
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        _, (hidden, _) = self.lstm(embedded)
        # hidden = [batch size, num layers * num directions,hid dim]
﻿
        # concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
﻿
        # hidden = [batch size, hid dim * num directions]
        dense_outputs = self.fc(hidden)
﻿
        # Final activation function
        outputs = self.act(dense_outputs)
﻿
        return outputs
The model takes in outputs of the tokenizer, and converts them to embeddings using self.embedding which is a nn.Embedding class from vocab size to embedding_dim. Thus, each token get’s converted to embedding_dim long vector. 
Once, we have that, we pass the embedded tokens to self.lstm, and because bidirectional=True, therefore, this is a Bi-LSTM. Finally, we take the outputs and concatenate the forward and backward hidden states. (Can you think of why?)
We concatenate forward and backward pass because Bi-LSTM consists of two lstms in each direction - forward and backward. Thus to make a prediction as mentioned in the section before, the outputs of both LSTMs are then concatenated. This allows the network to have access to past (from the forward LSTM) and future (from the backward LSTM) context at the same time.
💡
﻿
Model TrainingNext up, we need to define a training function that can take the data and model, and train our model. We will use nn.CrossEntropyLoss as our loss function, because we are performing text classification.
﻿
def train_one_epoch(model, data_loader, criterion, optimizer, device):
    model.train()
    epoch_loss = 0
    epoch_accuracy = 0
﻿
    for i, batch in tqdm(enumerate(data_loader), total=len(data_loader)):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]
﻿
        # Move tensors to GPU
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels = labels.to(device)
﻿
        # Forward pass
        outputs = model(input_ids)
        accuracy = (outputs.argmax(dim=1) == labels).float().mean()
        # Compute loss
        loss = criterion(outputs, labels)
﻿
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
﻿
        epoch_loss += loss.item()
        epoch_accuracy += accuracy.item()
    return epoch_loss / len(data_loader)
The training function above takes the outputs from the tokenizer - input_ids and passes them to the Sentiment Classifier model to get outputs of shape batch_size, 3 because we have 3 labels - “positive”, “neutral,” & “negative”. 
We keep track of loss and accuracy and return the mean of both metrics, which becomes our epoch_loss and epoch_accuracy.
Model EvaluationSame as above, we define an evaluation function that takes in the model.
def evaluate_one_epoch(model, data_loader, criterion, device):
    model.eval()
    epoch_loss = 0
    epoch_accuracy = 0
﻿
    with torch.no_grad():
        for i, batch in tqdm(enumerate(data_loader), total=len(data_loader)):
            input_ids = batch["input_ids"]
            attention_mask = batch["attention_mask"]
            labels = batch["label"]
﻿
            # Move tensors to GPU
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)
﻿
            # Forward pass
            outputs = model(input_ids)
            accuracy = (outputs.argmax(dim=1) == labels).float().mean()
﻿
            # Compute loss
            loss = criterion(outputs, labels)
﻿
            epoch_loss += loss.item()
            epoch_accuracy += accuracy.item()
﻿
    return epoch_loss / len(data_loader), epoch_accuracy / len(data_loader)
The function above takes in the test data loader and calculates epoch loss and accuracy, which are then returned. 
To train and evaluate the model on your own machine, simply run the following commands:
git clone https://github.com/amaarora/attention_lstm
cd attention_lstm/src/
python main.py
This will kickoff a training run on the Tweet Sentiment dataset and also create a Weights and Biases dashboard.
﻿
﻿
As can be seen above, we can track the validation loss and accuracy for our Sentiment Classifier which uses a Bi-LSTM. The model achieves 72% accuracy on the validation dataset. 
Adding Soft Attention to Bi-LSTM for Tweet ClassificationHow can we further improve the performance of this model? Let’s add attention. Remember, attention will help the model to focus on the right parts of the sentence. 
For example, in the sentence “I really like the song Love Story by Taylor Swift”, the model would want to pay more attention to the word “like” when classifying the sentiment of this sentence.
{'sentence': 'I really really like the song Love Story by Taylor Swift',
 'input_ids': tensor([ 101, 1045, 2428, 2428, 2066, 1996, 2299, 2293, 2466, 2011, 4202, 9170,
          102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'label': tensor(2)}
So, how can we achieve this? Simple - we must allow the model to add weights to each separate token. By doing so, the model can automatically give more weights to the tokens for the word “like,” and therefore, it should help increase the performance of the model. 
Sentiment Classifier with AttentionLet’s now define our model that has the capability to attend to the right parts of the sentence through an attention mechanism.
First, let’s write a class to define the attention mechanism.
class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attn = nn.Linear(hidden_dim, 1, bias=False)
﻿
    def forward(self, lstm_output):
        # lstm_output = [batch size, seq_len, hidden_dim]
        attention_scores = self.attn(lstm_output)
        # attention_scores = [batch size, seq_len, 1]
        attention_scores = attention_scores.squeeze(2)
        # attention_scores = [batch size, seq_len]
        return F.softmax(attention_scores, dim=1)
Above, we have a simple nn.Linear layer that has shape [hidden_dim, 1], since each token get’s represented by a 256-length vector, we take in this value and convert it to a single score. Thereby, each token get’s a separate score which becomes the attention score. This will allow the model to give high scores to words such as “like” when classifying “I really like the song Love Story by Taylor Swift”.
Let’s now update our Sentiment Classifier model from before to add an attention mechanism to it.
class SentimentClassifierWithSoftAttention(nn.Module):
    def __init__(
        self,
        tokenizer,
        embedding_dim=128,
        hidden_dim=256,
        output_dim=3,
        n_layers=2,
        bidirectional=True,
        dropout=0.1,
    ):
        super().__init__()
﻿
        # Embedding layer
        self.embedding = nn.Embedding(len(tokenizer.get_vocab()), embedding_dim)
﻿
        # attention
        self.attention = Attention(hidden_dim * 2 if bidirectional else hidden_dim)
﻿
        # LSTM layer
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout,
            batch_first=True,
        )
﻿
        # Dense layer
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )
        # Activation function
        self.act = nn.Sigmoid()
﻿
    def forward(self, text):
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        lstm_output, (hidden, _) = self.lstm(embedded)
        # lstm_output = [batch size, seq_len, hidden_dim*num_directions]
        # concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
        attention_weights = self.attention(lstm_output)
        # attention_weights = [batch size, seq_len]
        attention_weights = attention_weights.unsqueeze(2)
        weighted = lstm_output * attention_weights
        # weighted = [batch size, seq_len, hidden_dim]
﻿
        weighted_sum = weighted.sum(dim=1)
        # weighted_sum = [batch size, hidden_dim]
﻿
        dense_outputs = self.fc(weighted_sum)
        # dense_outputs = [batch size, output_dim]
﻿
        # Final activation function
        outputs = self.act(dense_outputs)
﻿
        return outputs
Most of the code is the same as before, except that we now take the lstm_output and create a score for each token in the sentence using attention - attention_weights = self.attention(lstm_output). This way, each token get’s a score. Finally, we multiply the lstm_output with attention_weights to get weighted outputs - allowing the model to give higher weights to more relevant tokens and lower weights to less relevant tokens. 
We can train and evaluate this model using the training and evaluation script as before. 
To train and evaluate on your own machine, run the following commands:
git clone https://github.com/amaarora/attention_lstm
cd attention_lstm/src/
python main.py
But, in this case, make sure that config.yml has model_name: SentimentClassifierWithSoftAttention.
Finally, we can compare the two models since we used Weights & Biases for experiment tracking. From the dashboard we saw before, our Sentiment Classifier model with attention achieves a higher validation accuracy and a much lower training loss. Thus, adding attention to the model helped the model fit better to the dataset, thereby improving the performance of the model. 
Remember, this improvement for the model came by allowing the model to be able to give more weights to relevant tokens when classifying tweets. From “How does attention work” section of our blog post: 
”In the simplest form, an attention mechanism scores each item in the input sequence, and these scores are then used to weight the contribution of each item to the output.”
💡
ConclusionIn this blog post, we have covered the basics of RNNs, LSTMs, and Bi-LSTMs, and introduced the concept of attention mechanisms. We have implemented a Bi-LSTM model for tweet classification and enhanced it with soft attention. We have also provided complete training and evaluation code and used the WandB library for experiment tracking, which can be found in this GitHub repository.
We hope that this post has provided you with a good understanding of how to use LSTMs and attention mechanisms for text classification tasks. As always, we encourage you to experiment with the code and try out different architectures and attention mechanisms to see what works best for your specific task. Happy coding! 🙂
﻿
﻿
Add a comment
Tags: Sentiment Analysis, NLP, Articles, Kaggle, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.