Skip to main content

Training Transformers for Text Classification on HuggingFace

Here we will train transformers for classification from scratch , and how self attention plays crucial role in working of transformers for sequential tasks.
Created on June 23|Last edited on May 30

Introduction

Transformers was first introduced in research paper titled Attention is all you need.
Since then, many types of transformers came into light i.e. Linformer, Longformer,Sparse Transformer etc. But here we will focus on vanilla transformer for now and try to implement text/document classification from scratch.
If you want to understand the theoritical concept of transformer in deep, then I would recommend to read this blog by Jay Allamar.

Try out in Colab

1. Dataset

Here we will use deceptive opinion span corpus where we need to predictive hotel review whether it's deceptive or not.
Here we will consider two columns i.e. text and deceptive.
  • Preprocessing

Let's divide dataset into training and validation sets and then preprocess text review using torchtext BucketIterator.
# Load dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

deceptive = pd.read_csv('deceptive_data.csv')[:400]
text = pd.read_csv('deceptive_data.csv')[:400]['text'].values
deceptive['labels'] = deceptive['deceptive'].map({'deceptive':1,'truthful':0})
labels = deceptive['labels'].values

x_train,x_test,y_train,y_test = train_test_split(text,labels,test_size=0.3)
# now generate different csvs for trianing and validation data
# save both train_csv and test_csv with .csv extension
train_csv = pd.DataFrame({'deceptive':x_train,'labels':y_train})
test_csv = pd.DataFrame({'deceptive':x_test,'labels':y_test})

train_csv.to_csv('train_deceptive.csv',index=None)
test_csv.to_csv('val_deceptive.csv',index=None)

# create datafield for text and label
TEXT = data.Field(sequential=True,include_lengths=True,batch_first=True,lower=True)
LABEL = data.Field(sequential=False)

train,val = data.TabularDataset.splits(format='csv',fields=[('deceptive',TEXT),('labels',LABEL)],path='./',train='train_deceptive.csv',validation='val_deceptive.csv')
# now write BucketIterator for train and val dataset
TEXT.build_vocab(train,max_size=17494)
LABEL.build_vocab(train)
train_iter,val_iter = data.BucketIterator.splits(datasets=(train,val),batch_size=64,device=torch.device('cpu'),sort=False,sort_key = lambda x:len(x.user_review),sort_within_batch=False,repeat=False)

# Tokenization
import pandas as pd
import numpy as np
train_data = pd.read_csv('train_deceptive.csv')['deceptive'].values
labels = list(pd.read_csv('train_deceptive.csv')['labels'].values)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(train_data,labels,test_size=0.2)
train_data = np.array(train_data)
train_data = list(train_data)
labels = list(y_train)

test_data = train_data
print(test_data[0])
final_data = []
import re
for data in test_data:
new = re.sub('[0-9]','',str(data))
final_data.append(new)
print(final_data[0])
from tensorflow.keras.preprocessing import text,sequence
tokenize = text.Tokenizer()

tokenize.fit_on_texts(texts = (final_data))
index_data = tokenize.texts_to_sequences(final_data)

word_index = tokenize.word_index

2. Self-attention Mechanism

Self attention mechanism was introduced to solve the issues faced by encoder-decoder for long sequences. It is a mechanism which allows inputs to interact with each other inputs wrt to one input and generate outputs concurrently to get context from long sequences. The detailed explaination of self attention can be found here.
# Self Attention Architecture
class selfattention(nn.Module):
def __init__(self,k,heads=9):
super().__init__()
self.k , self.heads = k,heads
# initialize dimensions for keys,values and queries as we are building multi-head attention layers for modern transformer
self.tokeys = nn.Linear(k,k*heads,bias=False)
self.toqueries = nn.Linear(k,k*heads,bias=False)
self.tovalues = nn.Linear(k,k*heads,bias=False)
self.unified = nn.Linear(k*heads,k,bias=False)
def forward(self,x): # pass values into __init__ function
b,t,k = x.size() # where b is number of batches, t is size of input sequence length and k is number of dimensions
# now determine keys,values,queries and heads
h = self.heads
keys = self.tokeys(x).view(b,t,h,k) # initially we don't have head next to each other
queries = self.toqueries(x).view(b,t,h,k)
values = self.tovalues(x).view(b,t,h,k)
# now we want head next to batch, its highly computational expensive but we have to do it, so transform keys,values and queries
keys = keys.transpose(1,2).contiguous().view(b*h,t,k)
queries = queries.transpose(1,2).contiguous().view(b*h,t,k)
values = values.transpose(1,2).contiguous().view(b*h,t,k)
# now calculate dot product using bmm function in pytorch
dot = torch.bmm(queries,keys.transpose(1,2))
dot = F.softmax(dot,dim=2)
out = torch.bmm(dot,values).view(b,h,t,k)
# now finally we want unified output in k dimensions as we have initially, so to do this again transspose out
out = out.transpose(1,2).contiguous().view(b,t,h*k)
out = self.unified(out)
return out


3. Transformer Block and Model Architecture

Now we will pass inputs to self attention class and then will apply Layer Normalization followed by dropout to prevent overfitting and Sequential layers of Linear and ReLU to convert inputs into non-linear. All these operations are integrated in transformer block as shown below.
# Now build architecture for transformer block
class transformer(nn.Module):
def __init__(self,k,heads=9):
super().__init__()
self.attention = selfattention(k) # pass class named selfattention
# now add normalization layer to normalize outputs of attention layer
self.norm1 = nn.LayerNorm(k)
self.norm2 = nn.LayerNorm(k)
# now make a fully connected multi layer perceptron
self.ff = nn.Sequential(nn.Linear(k,4*k),nn.ReLU(),nn.Linear(4*k,k)) # fully connected layer for hidden states
self.drop = nn.Dropout(0.5)
def forward(self,x):
attention = self.attention(x)
x = self.norm1(attention+x)
x = self.drop(x) # dropout layer after normalization(drop some neurons to prevent from overfitting)
perceptron = self.ff(x)
x = self.norm2(perceptron + x)
x = self.drop(x)
return x
Now embeddings layers,self attention layer and transformer block is integrated in Model class as shown below.
# now add classification and add embedding layer initially to it.
# REMEMBER INPUT SHOULD BE INDEXES OF WORDS
class classify(nn.Module):
def __init__(self,k,seq_length,num_tokens,depth,num_classes,max_pool=True,heads=9):
super().__init__()
# initialize tokens
self.num_tokens = num_tokens
self.maxpool = max_pool
# it needs input in word indexes using tokenizer
self.tokenemb = Embedding(embedding_dim=k,num_embeddings=num_tokens)
# position embedding
self.posemb = Embedding(embedding_dim=k, num_embeddings=seq_length)
tfblocks = []
for i in range(depth):
tfblocks.append(transformer(k))
# now add sequential layer of tfblocks
self.transform = nn.Sequential(*tfblocks)
# add linear layer to convert into desired number of class i.e. 2 for binary class
self.prob = nn.Linear(k,num_classes)
self.drop = nn.Dropout(0.5)
def forward(self,x,y):
tokens = self.tokenemb(x)
b,t,k = tokens.size()
for batch in range(b):
for pos in range(t):
for i in range(k):
if i%2==0:
tokens[batch][pos][i]+=np.sin(pos/(1000**(2*i/k)))
else:
tokens[batch][pos][i]+=np.cos(pos/(1000**(((2*i)+1)/k)))
#positions = self.posemb(torch.arange(t, device=torch.device('cuda')))[None, :, :].expand(b, t, k)
x = tokens # adding tokens and positions to get proper embeddings of each indexes of words
x = self.drop(x)
x = self.transform(x)
x = x.max(dim=1)[0] if self.maxpool else x.mean(dim=1)
x = self.prob(x)
x = F.softmax(x,dim=1)
loss = torch.nn.CrossEntropyLoss()
loss = loss(x,y)
return loss,x

# Initializing classify model for binary classification
classifier = classify(128,100,17496,12,2)
classifier.to(device)


4. Training and Evaluation

Now it's time to train model and save checkpoints for each epoch. Losses will be monitored for every 2 steps through wandb api.
# declare optimizer
from torch.autograd import no_grad,Variable
from torch.optim import Optimizer
from torch.optim import Adam,SGD,AdamW,lr_scheduler
optimizer = Adam(classifier.parameters(),lr=0.03,amsgrad=True)
# set scheduler for learning rate, for that calculate total steps
!pip install transformers
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import RandomSampler
epochs = 100 # researchers suggest to take epochs should be in range of 4-7 for fine tuning pretrained model as we have concern
# just for last layer which is untrained classification layer.
total_steps = len(train_iter) * epochs
# scheduler take care of linear schedule of learning rate
scheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=0,num_training_steps=total_steps)

# Training
epochs = 2
final_loss = []
output = []
testing_accuracy = []
final_test_loss=[]
wandb.watch(classifier,log='all')
for epoc in range(epochs):
print('epoch-',epoc)
total_loss = 0
classifier.train()
for step,batch in enumerate(train_iter):
optimizer.zero_grad()
# now check length of input sequence whether it is greater than 512 or not
if batch.deceptive[0].size(1)>50:
batch_train = batch.deceptive[0].tolist()
batch_train = [batch_train[x][:50] for x in range(len(batch_train))]
else:
batch_train=batch.deceptive[0].tolist()
loss,outputs = classifier.forward(torch.LongTensor(batch_train).to(device),(batch.labels-1).to(device)) # here -2 is because of padding tokens
#print('loss',loss,'outputs',outputs)
if step%2==0:
wandb.log({"training_loss":loss})

loss.backward()
total_loss+=loss.item()
torch.nn.utils.clip_grad_norm_(classifier.parameters(),1.0)
optimizer.step()
scheduler.step()
avg_loss = total_loss/len(train_iter)
print('train_loss',avg_loss)
final_loss.append(avg_loss) # testing outputs
# Validating
classifier.eval()
test_accuracy = 0
test_loss=0
# Evaluation
for step,batch_t in enumerate(val_iter):
with torch.no_grad():
if batch_t.deceptive[0].size(1)>50:
batch_t_test = batch_t.deceptive[0].tolist()
batch_t_test = [batch_t_test[x][:50] for x in range(len(batch_t_test))]
else:
batch_t_test= batch_t.deceptive[0].tolist()
outpu = classifier.forward(torch.LongTensor(batch_t_test).to(device),(batch_t.labels-1).to(device))
predictions = outpu[1]
# no need for backward loss and zero grad
test_accuracy += f1_score(y_pred=np.argmax((predictions.cpu().detach().numpy()),axis=1),y_true=(batch_t.labels-1).cpu().detach().numpy())
test_loss+=outpu[0].item()
output.append(predictions)
avg_test_loss = test_loss/len(val_iter)
print('val_loss',avg_test_loss)
final_test_loss.append(avg_test_loss)
avg_accuracy = test_accuracy/len(val_iter)
torch.save({'epoch':epoc,'state_dict':classifier.state_dict(),'accuracy':avg_accuracy},'/output/checkpoints.pth.tar')
testing_accuracy.append(avg_accuracy)


Run: fast-shadow-34
1

So, that's it for this report. Till now , we saw how transformers can be trained from scratch for text classification. We can train transformers for many other tasks like Named Entity Recognition, Text Summarization etc. Reports on these topics can be found here -

References

Weights & Biases
Weights & Biases helps you keep track of your machine learning experiments. Use our tool to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.
Get started in 5 minutes.