RNN - Training model and Fine tuning of GPT-2 pretrained model

Created on May 8|Last edited on August 17
Comment
﻿
IntrodutionIn this assignment we had done experiment with the Dakshina dataset released by Google. This dataset contains pairs of the following form: x. y
ajanabee अजनबी.
i.e., a word in the native script and its corresponding transliteration in the Latin script (the way we type while chatting with our friends on WhatsApp etc). 
We have been many pairs in the form (xi,yi)i=1n(x_i, y_i)^n_{i=1}(xi​,yi​)i=1n​﻿ 
Now our goal is to train a model y=f(x)y = f(x)y=f(x)﻿which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर).
This is the problem of mapping a sequence of characters in one language to a sequence of characters in another language. 
In other words, this is a scaled down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to sequence of characters here).
Now let us start answering our questions based on the codes that have been developed by us and use wandb as a tool, which automatically generates plot, that is helpful in making up the conclusions.
Question 1﻿
The RNN custom model, its computational complexity and number of trainable parameters:An RNN based seq2seq model which contains the following layers has been built:
 (i) input layer for character embeddings.
 (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) 
(iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari).
The code is flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU) and the number of layers in the encoder and decoder can be changed.
﻿
(a) Total number of computations done by our networkTo make our calculations easier, we split down our calculation parts into two, one for encoder and the other for decoder. We then added up the results of both to arrive at the total computations done by our network.
﻿
Encoder RNNWe have assumed the following while doing the following computations that are found below.
The input embedding size is m
The encoder and decoder have 1 layer each
The hidden cell state is k for both the encoder and decoder
The length of the input and output sequence is the same, i.e., T, the size of the vocabulary is the same for the source and target language, i.e., V
﻿
﻿si=σ(Uk×mxm×1+Wk×ks(i−1)k×1+bk×1s_i = σ(U_{k × m} x_{m × 1} + W_{k × k}s(i-1)_{k × 1} +b_{k × 1}si​=σ(Uk×m​xm×1​+Wk×k​s(i−1)k×1​+bk×1​﻿﻿
Let us take the input at any time step 't' be e(t)∣Vi×1∣e(t)_{|V_i × 1|}e(t)∣Vi​×1∣​﻿﻿
The embedding matrix for the input is represented by Cm×∣Vi∣C_{m ×|V_i |}Cm×∣Vi​∣​﻿﻿
The number of computations required to calculate sis_i si​﻿﻿
Number of multiplications required = k×m+k×kk × m  + k × kk×m+k×k﻿﻿
Number of addition required = 3×k3×k3×k﻿﻿
Number of embedding required = m×∣Vi∣m ×|V_i |m×∣Vi​∣﻿﻿
So, the total computations to calculate si=(k×m+k×k)+(3×k)+(m×∣Vi∣)s_i = (k × m  + k × k) + (3×k) + (m ×|V_i|)si​=(k×m+k×k)+(3×k)+(m×∣Vi​∣)﻿﻿
Total computations done by encoder for all time steps = T(km+k2+m×∣Vi∣+3k)T(km  + k^2 + m × |V_i| + 3k)T(km+k2+m×∣Vi​∣+3k)﻿﻿
Decoder RNNThe same assumptions that are mentioned in the encoder part has been assumed in the decoder part also, to make calculations.
﻿
﻿si=σ(Uk×mxm×1+Wk×ks(i−1)k×1+bk×1s_i = σ(U_{k × m} x_{m × 1} + W_{k × k}s(i-1)_{k × 1} +b_{k × 1}si​=σ(Uk×m​xm×1​+Wk×k​s(i−1)k×1​+bk×1​﻿﻿
﻿yi=O(Vk×ks(i)k×1+ck×1)y_i = O(V_{k × k}s(i)_{k × 1} + c_{k × 1})yi​=O(Vk×k​s(i)k×1​+ck×1​)﻿﻿
The embedding matrix be represented by Cm×∣V0∣C_{m ×|V_0|}Cm×∣V0​∣​﻿﻿
The number of computations required to calculate sis_i si​﻿﻿
Number of multiplications required = k×m+k×kk × m  + k × kk×m+k×k﻿﻿
Number of addition required = 3×k3×k3×k﻿﻿
Number of embedding required = m×∣V0∣m ×|V_0|m×∣V0​∣﻿﻿
So, the total computations to calculate si=(k×m+k×k)+(3×k)+(m×∣V0∣)s_i = (k × m  + k × k) + (3×k) + (m ×|V_0|)si​=(k×m+k×k)+(3×k)+(m×∣V0​∣)﻿﻿
The number of computations required to calculate yiy_i yi​﻿﻿
Number of multiplications required to calculate yiy_iyi​﻿ is (k×k)( k × k) (k×k)﻿﻿
Number of additions required to calculate yiy_iyi​﻿ is (2×k)( 2 × k) (2×k)﻿﻿
Number of multiplications required to calculate yiy_iyi​﻿ is (k×k)( k × k) (k×k)﻿﻿
The required number of computations to calculate yiy_iyi​﻿ =  (k×k)+(2×k)( k × k)+(2 × k)(k×k)+(2×k)﻿﻿
The number of computations done by decoder
Considering all the time steps, the total computations required for decoder = T(km+2k2+5k+m×∣V0∣)T(km + 2k^2 +5k + m ×|V_0|)T(km+2k2+5k+m×∣V0​∣)﻿﻿
Input Embedding LayerTotal number of additions = m(Vi−1)m(V_i-1)m(Vi​−1)﻿﻿
Total number of multiplications = mVimV_imVi​﻿﻿
Total number of computations in Input Embedding Layer = m(Vi−1)m(V_i-1)m(Vi​−1)﻿+ mVimV_imVi​﻿﻿
﻿
ADDING THE TOTAL COMPUTATIONS OF BOTH ENCODER RNN AND DECODER RNN and INPUT EMBEDDING LAYER
Total computations done by our network = T(2km+3k2+8k+m(∣Vi+V0∣))T(2km + 3k^2 +8k + m (|V_i + V_0|))





T(2km+3k2+8k+m(∣Vi​+V0​∣))﻿ + m(Vi−1)m(V_i-1)m(Vi​−1)﻿+ mVimV_imVi​﻿ 
Ignoring complex calculations like non linear operations, we have obtained the above results
(b) Total number of parameters in our networkWe have assumed the following while doing the following computations that are found below.
The input embedding size is m
The encoder and decoder have 1 layer each
The hidden cell state is k for both the encoder and decoder
The length of the input and output sequence is the same, i.e., T, the size of the vocabulary is the same for the source and target language, i.e., V
﻿
Now let us jump into the calculation part for our total number of parameters.
﻿
The total number of parameters in encoder is k2+km+kk^2 + km + kk2+km+k﻿ and the total number of parameters in decoder is  2k2+km+2k2k^2 + km + 2k2k2+km+2k﻿ and that of in the dense layer is km+mkm + mkm+m﻿﻿
Thus we get the total number of parameters in our network by adding all the above.
Thus the total number of parameters in our network is 3k2+3km+3k+m3k^2 + 3km +3k +m3k2+3km+3k+m﻿﻿
Question 2We had now trained our model using hindi language from the Dakshina dataset.
Having low number of epochs we did hyperparameter sweep on random hyperparameter. This will help us understand how accuracy gets affected by different hyperparameter. Initially, the sweep had zero dropout. 
Now we increase the number of epochs to 15. We narrowed down the hyperparameter space top concertize our learning from the first sweep. After this we added various dropout values . This helps us in identifying the best configiurations.
We checked whether increasing latent dimensions to 512 gives us good results are not and so high value of latent dimension was tested.
﻿
Accuracy vs Created PlotPlot showing minimum y values, plot showing maximum y values, plot showing average y values are shown in the plot individually. The y in the above description denote validation accuracy.
﻿
Sweep: Bayesian Sweep without attention 158
Sweep: Bayesian Sweep without attention 280
﻿
Correlation Summary TableThe table has been set in such a way that the correlation can be seen according to the importance of different parameters.
﻿
Sweep: Bayesian Sweep without attention 158
Sweep: Bayesian Sweep without attention 20
﻿
﻿
Parallel Coordinates plot﻿
Sweep: Bayesian Sweep without attention 158
Sweep: Bayesian Sweep without attention 20
﻿
We have tried different input embedding sizes,  different number of encoder layers, different number of decoder layers
We have used dropout and dropout is only applied to the non-recurrent connections that is, it is only applied to the feedforward lines. It is applied to overcome overfitting. 
Other interesting parameters like batch size, epochs, latent dimensions, different cell types like LSTM, RNN, GRU
The following are the hyperparameters that were used
Cell Type - ['LSTM', 'GRU', 'RNN']
Number of layers in encoder = [1,2,3,4,5]
Number of layers in decoder = [1,2,3,4]
Dropout = [0,0.1,0.2,0.3, 0.4]
Latent dimensions =  [64, 256,512, 1024, 2048]
Batch size = [32, 64, 100]
Embedded dimension = [5, 8, 10, 16, 32]
Epochs = [15, 20]
﻿
﻿
Strategy used to reduce the number of run without affecting the high accuracy.To reduce the number of runs and to achieve the same amount of high accuracy is a challenging task. This can be made easy by using an utility from tensor flow keras known as early stopping. So when there is no improvement in validation accuracy, this early stopping prevents particular longer epoch and ends it, thus reducing the number of runs, and achieving the same amount of higher accuracy approximately.
However there is a problem with this method. That is, the model does not make use of all available training data. It may be desirable to avoid over fitting and to train on all possible data, especially on problems where the amount of training data is very limited. But in our case, there is sufficient data available. Therefore, we can make use of this method to reduce the number of runs, without affecting high accuracy as well as without losing the data.
﻿
Question 3
Plot showing how validation accuracy varies for different hyperparameter configurations﻿
Sweep: Bayesian Sweep without attention 158
Sweep: Bayesian Sweep without attention 20
﻿
Plot showing training loss variations for different hyperparameter configurations
Observation 1 - Cell type ComparisonFrom the automatically generated graph in the wandb, on performing the analysis and study we came to know that GRU uses less memory and is faster than LSTM, however, LSTM is more accurate when using datasets with longer sequences. LSTMs are also called as fancy RNNs. Cell state is not present in Vanilla RNNs. They only have hidden states. These hidden states serve the purpose of memory for RNNs. LSTMs had both Cell states and hidden states. Thus they perform well. This can also be noticed from our wandb plots shown below. 
Observation 2 - Importance of number of layers in encoding layersWhen the number of layer is 3, we get good results, when the number of layers is decreased, the accuracy decreases. The same happens when the number of layer is increased. So we came to conclusion, that having a moderate number of layers in encoder serves better. In other words, we can tell as we increase the number of encoding layers, the difference between the maximum accuracy and minimum accuracy decreases. However increasing the number of layers from 3 to 5 has no significant gain on maximum validation accuracy.
Observation 3 - Importance of number of layers in decoding layersHaving higher number of decoding layers will give a better validation accuracy. When we increase the number of decoder layers, the difference between maximum and minimum accuracy obtained will decrease. As we see in the case of encoder, the maximum validation accuracy on increasing the number of layers from 2 to 4 has no significant gain. But in decoder, the validation accuracy keeps on increasing till introduce 4 layers.
Observation 4 - Importance of dropout layers.Increasing the amount of dropouts helps us get more validation accuracy. This will help us generalize the model better. Values of dropout say between 10 percent and 30 percent performs better than no dropouts.
﻿
Sweep: Bayesian Sweep without attention 158
Sweep: Bayesian Sweep without attention 20
﻿
Observation 5 - Importance of batch sizeWhen batch size is increased, it gives us a slower convergence of model. This can be visualized by seeing the image generated by wandb. But this fastened the training process and the training time gets decreased. This gives an intuition to find a balance between the above two by finding an optimal batch size. 
Observation 6 - Importance of embedding dimensionsIncreasing the embedding dimension increases the validation accuracy. These embeddings overcome the limitations of traditional encoding methods and can be used for purposes such as input into another model, and visualizations. For vanilla sequence to sequence model, 512 hidden units is sufficient. Increasing more than that does not give desired result.
Observation 7 - Importance of data augmentationThe dataset is sufficiently large that there is no need for data augmentation as there is no need for artificially increasing the amount of data by generating new data points from existing data. We are explicitly creating a python iterator using iter and consuming its element using the next.
﻿
Question 4
(a) Test accuracy - 35%Best Model Configuration
Cell Type : LSTM
Embedding Size : 32
Hidden Units : 256
Number of encoder layer : 4
Number of decoder layer : 4
Dropout : 0.3
Epochs : 20
Batch Size : 64
The above configuration was used to get the test accuracy of 35 %. The output exactly matches the reference output.
(b) Sample inputs from the test data and predictions made by our best modelAll the predictions on the test set is added in a folder Predictions_Vanilla on our GitHub project.  This folder contains two main files success.txt and **failure.txt which contains correct predictions and incorrect predictions for seq2seq model without attention.
Correct Predictions
Sample outputs for Correct Predictions
Incorrect Predictions
Sample outputs for Incorrect Predictions
All the predictions on the test set is uploaded in the folder predictions_vanilla on our github project.
(c) Comment on the errors made by our modelComments on Vowels vs Consonants
Most of the vowels are correctly predicted even in the wrongly predicted words. However the model makes more error on the consonants. This means that the model works well on vowels than consonants. This can be seen from the Sample outputs for Incorrect Predictions numbered 1, 2 and 4
Comments on longer sequence vs shorter sequence
By comparing the words in both correctly predicted words and incorrectly predicted words, it is clear that the model does not make any error based on the length of the word. That is number of characters in the words does not make any sense with the model.
Comments on confusion between similar characters
The model is  getting confused between ड and द. 
The model is also getting confused between ग and ज.  
Sometimes, it is replacing  टि with सि. 
It is also swapping ट and त sometimes. 
There are some other confusions also in our model. But these 4 are the major confusions.
﻿
Other Insightful Observations
While checking the predictions list, we could see that our model was making some errors like bringing an extra ॅ. This even though a small error, it is considered and the word is taken as wrongly predicted or incorrectly predicted word.
Question 5After adding the attention layer in the basic seq-to-seq model, we had done the hyperparameter tuning.
We had tried the combination of following hyperparameters.
Batch size = [64, 128, 256]
Number of encoder/decoder layers = 1
Dropout = [0, 0.1, 0.2, 0.3, 0.4]
Latent dimensions =  [64, 128, 256, 512, 1024]
Embedded dimension = [4, 8, 16, 20, 32, 64, 256, 512]
Cell Type = ['LSTM', 'GRU', 'RNN']
Accuracy vs Created Plot ﻿
Sweep: Bayesian Sweep with attention 125
Bayesian Sweep with attention 27
﻿
Parallel Coordination Plot﻿
Sweep: Bayesian Sweep with attention 125
Bayesian Sweep with attention 27
﻿
Correlation Summary Table﻿
﻿
Sweep: Bayesian Sweep with attention 125
Sweep: Bayesian Sweep with attention 27
﻿
(b) Test Accuracy We got a test accuracy of 41%. This is almost 6 percent greater than the model seq_to_seq. All the predictions on the test set is added in a folder Predictions_attention on our GitHub project. The same can be found by clicking here. This folder contains two main files success.txt and **failure.txt which contains correct predictions and incorrect predictions for seq2seq model with attention.
Best configuration
Batch Size =	 64
Dropout = 	0.2
Embedding Dimension = 512
Epochs	= 20	
Hidden Units = 1024
Cell Type = LSTM
(c) Which works better ? Yes, Attention model works better than the Vanilla model. The following are some of the errors that this model corrected and the inferences are shown below.
Improvements when attention is introduced
Swapping ट and त is reduced .The model is  now correctly predicting the difference between ड and द. The model is also getting confused between ग and ज.  At some places, it is correctly using  टि and  सि wherever it is necessary.  
This can be viewed from the following table
Analysis of error corrected by Attention based model compared to seq2seq model
﻿
(d) Attention maps﻿
Run set10
﻿
﻿
We can see some interesting patterns from the above heatmaps.
Question 6It is recommended to read the README.md file before using the repository. It is should be noted that the connection strength is shown on this html file, when the cursor is brought on the output Hindi words. ﻿﻿﻿﻿
The codes for html file can be found by clicking here﻿
﻿
﻿
Run: html1
﻿
Question 7GitHub Link :
﻿https://github.com/vishnukt2506/Assignment_3﻿
This repository contains the codes for assignment 2 of the course CS6910 - Fundamentals of Deep Learning of IIT Madras, handled by Prof Mitesh Khapra
The overall structure of our assignment works has been given in the GitHub repository clearly. The same can be viewed by clicking the following link for image description of our work﻿
All the works pertaining to Vanilla sequence to sequence is found in the following link﻿
All the works pertaining to Sequence to Sequence with attention is found in the following link﻿
﻿
Question 8Finetuning the GPT2 model to generate lyrics for English songs has been done. The given blog reference has been used to generate the following reference. 
The given blog shows how to finetune the GPT2 model to generate headlines for financial articles. Instead of headlines we had used lyrics using the given datasets useful for training. 
Generated lyrics 
[["I Love Deeplearning\n\nA Real Thing, Stupid\n\nThat's Why We Have to Be Here\n\nThat's Why We Need You\n\nDeeplearning\n\n[iStock]\n\nStefan Shewerman\n\n\n[li-5]Autumn and She Came Into My Life<|endoftext|>"],
 ["I Love Deeplearning\n\nGetting to know your neural network\n\nLearning how to spot unwanted attention\n\nSolving problems\n\nWhat does the brain do when it's too busy to do its job?\n\nThe beauty of deep learning\n\nHow can you teach deep learning to improve your life\n\nWhat is deep learning?\n\nHow do you get deep learning to work\n\nHow do you get deep learning to run your business\n\nWhat happens when you make an error?\n\nHow do<|endoftext|>"],
 ['I Love Deeplearning" - more than a thousand blogs and six video tutorials will change your life.\n\nHere are ten lessons you can take to get started with Deeplearning.\n\n10 Simple Training Tips for Successful Training\n\nOver the years, we\'ve all heard of being taught that every time you train, your brain will adapt. You\'ll need to slowly build up the capacity and training in order to move your brain around. The problem is, learning to keep up with the flow of information, train<|endoftext|>'],
 ['I Love Deeplearning in Windows 8 and 7\n\nAre you a Microsoft.com subscriber?\n\nFollow me on Twitter and Google+<|endoftext|>'],
 ['I Love Deeplearning is an excellent example of how self-driving cars could revolutionize how we manage our lives. It uses artificial intelligence to design and build autonomous vehicles and systems that work as a test bed for artificial intelligence in real time.\n\n1. The AlphaGo computer comes with a huge price tag. The amount of money needed to build a machine learning engine in just 5 years is in the billions.\n\n2. This is the fourth prediction in the AlphaGo AlphaGo contest. The outcome of this<|endoftext|>']]
The following were the hyperparameters used for finetuning
batch size = 16
epochs = 20
learning rate = 2e-5
maximum sequence length = 400
warmup steps = 200
﻿
Observations
GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input. The model is chameleon-like because it adapts to the style and content of the conditioning text. This allows the user to generate realistic and coherent continuations about a topic of their choosing
Self Declaration
Team MembersWe, Vishnu K T and Nagarajan Chandru hereby declare that all the works done above have equal contribution of both of the team mates.
OE21S024 - 50 % contribution CS20M041 - 50 % contribution
The works include the following  -
All the report works
Coding and developing model
Development of Google colab notebooks 
Hyperparameter sweeps (wandb)
GitHub Maintenance
Analysis of results and corrections
﻿
﻿
Add a comment