Training Transformers for Text Classification on HuggingFace
Here we will train transformers for classification from scratch , and how self attention plays crucial role in working of transformers for sequential tasks.
Created on June 23|Last edited on May 30
Comment
Introduction
Since then, many types of transformers came into light i.e. Linformer, Longformer,Sparse Transformer etc. But here we will focus on vanilla transformer for now and try to implement text/document classification from scratch.
If you want to understand the theoritical concept of transformer in deep, then I would recommend to read this blog by Jay Allamar.
Try out in Colab
1. Dataset
Here we will use deceptive opinion span corpus where we need to predictive hotel review whether it's deceptive or not.
Here we will consider two columns i.e. text and deceptive.
Preprocessing
Let's divide dataset into training and validation sets and then preprocess text review using torchtext BucketIterator.
# Load datasetimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitdeceptive = pd.read_csv('deceptive_data.csv')[:400]text = pd.read_csv('deceptive_data.csv')[:400]['text'].valuesdeceptive['labels'] = deceptive['deceptive'].map({'deceptive':1,'truthful':0})labels = deceptive['labels'].valuesx_train,x_test,y_train,y_test = train_test_split(text,labels,test_size=0.3)# now generate different csvs for trianing and validation data# save both train_csv and test_csv with .csv extensiontrain_csv = pd.DataFrame({'deceptive':x_train,'labels':y_train})test_csv = pd.DataFrame({'deceptive':x_test,'labels':y_test})train_csv.to_csv('train_deceptive.csv',index=None)test_csv.to_csv('val_deceptive.csv',index=None)# create datafield for text and labelTEXT = data.Field(sequential=True,include_lengths=True,batch_first=True,lower=True)LABEL = data.Field(sequential=False)train,val = data.TabularDataset.splits(format='csv',fields=[('deceptive',TEXT),('labels',LABEL)],path='./',train='train_deceptive.csv',validation='val_deceptive.csv')# now write BucketIterator for train and val datasetTEXT.build_vocab(train,max_size=17494)LABEL.build_vocab(train)train_iter,val_iter = data.BucketIterator.splits(datasets=(train,val),batch_size=64,device=torch.device('cpu'),sort=False,sort_key = lambda x:len(x.user_review),sort_within_batch=False,repeat=False)# Tokenizationimport pandas as pdimport numpy as nptrain_data = pd.read_csv('train_deceptive.csv')['deceptive'].valueslabels = list(pd.read_csv('train_deceptive.csv')['labels'].values)from sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test = train_test_split(train_data,labels,test_size=0.2)train_data = np.array(train_data)train_data = list(train_data)labels = list(y_train)test_data = train_dataprint(test_data[0])final_data = []import refor data in test_data:new = re.sub('[0-9]','',str(data))final_data.append(new)print(final_data[0])from tensorflow.keras.preprocessing import text,sequencetokenize = text.Tokenizer()tokenize.fit_on_texts(texts = (final_data))index_data = tokenize.texts_to_sequences(final_data)word_index = tokenize.word_index
2. Self-attention Mechanism
Self attention mechanism was introduced to solve the issues faced by encoder-decoder for long sequences. It is a mechanism which allows inputs to interact with each other inputs wrt to one input and generate outputs concurrently to get context from long sequences. The detailed explaination of self attention can be found here.
# Self Attention Architectureclass selfattention(nn.Module):def __init__(self,k,heads=9):super().__init__()self.k , self.heads = k,heads# initialize dimensions for keys,values and queries as we are building multi-head attention layers for modern transformerself.tokeys = nn.Linear(k,k*heads,bias=False)self.toqueries = nn.Linear(k,k*heads,bias=False)self.tovalues = nn.Linear(k,k*heads,bias=False)self.unified = nn.Linear(k*heads,k,bias=False)def forward(self,x): # pass values into __init__ functionb,t,k = x.size() # where b is number of batches, t is size of input sequence length and k is number of dimensions# now determine keys,values,queries and headsh = self.headskeys = self.tokeys(x).view(b,t,h,k) # initially we don't have head next to each otherqueries = self.toqueries(x).view(b,t,h,k)values = self.tovalues(x).view(b,t,h,k)# now we want head next to batch, its highly computational expensive but we have to do it, so transform keys,values and querieskeys = keys.transpose(1,2).contiguous().view(b*h,t,k)queries = queries.transpose(1,2).contiguous().view(b*h,t,k)values = values.transpose(1,2).contiguous().view(b*h,t,k)# now calculate dot product using bmm function in pytorchdot = torch.bmm(queries,keys.transpose(1,2))dot = F.softmax(dot,dim=2)out = torch.bmm(dot,values).view(b,h,t,k)# now finally we want unified output in k dimensions as we have initially, so to do this again transspose outout = out.transpose(1,2).contiguous().view(b,t,h*k)out = self.unified(out)return out
3. Transformer Block and Model Architecture
Now we will pass inputs to self attention class and then will apply Layer Normalization followed by dropout to prevent overfitting and Sequential layers of Linear and ReLU to convert inputs into non-linear. All these operations are integrated in transformer block as shown below.
# Now build architecture for transformer blockclass transformer(nn.Module):def __init__(self,k,heads=9):super().__init__()self.attention = selfattention(k) # pass class named selfattention# now add normalization layer to normalize outputs of attention layerself.norm1 = nn.LayerNorm(k)self.norm2 = nn.LayerNorm(k)# now make a fully connected multi layer perceptronself.ff = nn.Sequential(nn.Linear(k,4*k),nn.ReLU(),nn.Linear(4*k,k)) # fully connected layer for hidden statesself.drop = nn.Dropout(0.5)def forward(self,x):attention = self.attention(x)x = self.norm1(attention+x)x = self.drop(x) # dropout layer after normalization(drop some neurons to prevent from overfitting)perceptron = self.ff(x)x = self.norm2(perceptron + x)x = self.drop(x)return x
Now embeddings layers,self attention layer and transformer block is integrated in Model class as shown below.
# now add classification and add embedding layer initially to it.# REMEMBER INPUT SHOULD BE INDEXES OF WORDSclass classify(nn.Module):def __init__(self,k,seq_length,num_tokens,depth,num_classes,max_pool=True,heads=9):super().__init__()# initialize tokensself.num_tokens = num_tokensself.maxpool = max_pool# it needs input in word indexes using tokenizerself.tokenemb = Embedding(embedding_dim=k,num_embeddings=num_tokens)# position embeddingself.posemb = Embedding(embedding_dim=k, num_embeddings=seq_length)tfblocks = []for i in range(depth):tfblocks.append(transformer(k))# now add sequential layer of tfblocksself.transform = nn.Sequential(*tfblocks)# add linear layer to convert into desired number of class i.e. 2 for binary classself.prob = nn.Linear(k,num_classes)self.drop = nn.Dropout(0.5)def forward(self,x,y):tokens = self.tokenemb(x)b,t,k = tokens.size()for batch in range(b):for pos in range(t):for i in range(k):if i%2==0:tokens[batch][pos][i]+=np.sin(pos/(1000**(2*i/k)))else:tokens[batch][pos][i]+=np.cos(pos/(1000**(((2*i)+1)/k)))#positions = self.posemb(torch.arange(t, device=torch.device('cuda')))[None, :, :].expand(b, t, k)x = tokens # adding tokens and positions to get proper embeddings of each indexes of wordsx = self.drop(x)x = self.transform(x)x = x.max(dim=1)[0] if self.maxpool else x.mean(dim=1)x = self.prob(x)x = F.softmax(x,dim=1)loss = torch.nn.CrossEntropyLoss()loss = loss(x,y)return loss,x# Initializing classify model for binary classificationclassifier = classify(128,100,17496,12,2)classifier.to(device)
4. Training and Evaluation
Now it's time to train model and save checkpoints for each epoch. Losses will be monitored for every 2 steps through wandb api.
# declare optimizerfrom torch.autograd import no_grad,Variablefrom torch.optim import Optimizerfrom torch.optim import Adam,SGD,AdamW,lr_scheduleroptimizer = Adam(classifier.parameters(),lr=0.03,amsgrad=True)# set scheduler for learning rate, for that calculate total steps!pip install transformersfrom transformers import get_linear_schedule_with_warmupfrom torch.utils.data import RandomSamplerepochs = 100 # researchers suggest to take epochs should be in range of 4-7 for fine tuning pretrained model as we have concern# just for last layer which is untrained classification layer.total_steps = len(train_iter) * epochs# scheduler take care of linear schedule of learning ratescheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=0,num_training_steps=total_steps)# Trainingepochs = 2final_loss = []output = []testing_accuracy = []final_test_loss=[]wandb.watch(classifier,log='all')for epoc in range(epochs):print('epoch-',epoc)total_loss = 0classifier.train()for step,batch in enumerate(train_iter):optimizer.zero_grad()# now check length of input sequence whether it is greater than 512 or notif batch.deceptive[0].size(1)>50:batch_train = batch.deceptive[0].tolist()batch_train = [batch_train[x][:50] for x in range(len(batch_train))]else:batch_train=batch.deceptive[0].tolist()loss,outputs = classifier.forward(torch.LongTensor(batch_train).to(device),(batch.labels-1).to(device)) # here -2 is because of padding tokens#print('loss',loss,'outputs',outputs)if step%2==0:wandb.log({"training_loss":loss})loss.backward()total_loss+=loss.item()torch.nn.utils.clip_grad_norm_(classifier.parameters(),1.0)optimizer.step()scheduler.step()avg_loss = total_loss/len(train_iter)print('train_loss',avg_loss)final_loss.append(avg_loss) # testing outputs# Validatingclassifier.eval()test_accuracy = 0test_loss=0# Evaluationfor step,batch_t in enumerate(val_iter):with torch.no_grad():if batch_t.deceptive[0].size(1)>50:batch_t_test = batch_t.deceptive[0].tolist()batch_t_test = [batch_t_test[x][:50] for x in range(len(batch_t_test))]else:batch_t_test= batch_t.deceptive[0].tolist()outpu = classifier.forward(torch.LongTensor(batch_t_test).to(device),(batch_t.labels-1).to(device))predictions = outpu[1]# no need for backward loss and zero gradtest_accuracy += f1_score(y_pred=np.argmax((predictions.cpu().detach().numpy()),axis=1),y_true=(batch_t.labels-1).cpu().detach().numpy())test_loss+=outpu[0].item()output.append(predictions)avg_test_loss = test_loss/len(val_iter)print('val_loss',avg_test_loss)final_test_loss.append(avg_test_loss)avg_accuracy = test_accuracy/len(val_iter)torch.save({'epoch':epoc,'state_dict':classifier.state_dict(),'accuracy':avg_accuracy},'/output/checkpoints.pth.tar')testing_accuracy.append(avg_accuracy)
Run: fast-shadow-34
1
So, that's it for this report. Till now , we saw how transformers can be trained from scratch for text classification. We can train transformers for many other tasks like Named Entity Recognition, Text Summarization etc. Reports on these topics can be found here -
References
Weights & Biases
Weights & Biases helps you keep track of your machine learning experiments. Use our tool to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.
Add a comment