Assignment 3
Use recurrent neural networks to build a transliteration system.
The two contributors to the code are:
Prithaj Banerjee - CS21S045
Kondapalli Jayavardhan - CS21S011
Created on May 4|Last edited on May 8
Comment
Instructions
- The goal of this assignment is fourfold: (i) learn how to model sequence to sequence learning problems using Recurrent Neural Networks (ii) compare different cells such as vanilla RNN, LSTM and GRU (iii) understand how attention networks overcome the limitations of vanilla seq2seq models (iv) visualise the interactions between different components in a RNN based model.
- We strongly recommend that you work on this assignment in a team of size 2. Both the members of the team are expected to work together (in a subsequent viva both members will be expected to answer questions, explain the code, etc).
- Collaborations and discussions with other groups are strictly prohibited.
- You must use Python (numpy and pandas) for your implementation.
- You can use any and all packages from keras, pytorch, tensorflow
- You can run the code in a jupyter notebook on colab by enabling GPUs.
- You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots that we have asked for below can be (automatically) generated using the apis provided by wandb.ai. You will upload a link to this report on gradescope.
- You also need to provide a link to your github code as shown below. Follow good software engineering practices and set up a github repo for the project on Day 1. Please do not write all code on your local machine and push everything to github on the last day. The commits in github should reflect how the code has evolved during the course of the assignment.
- You have to check moodle regularly for updates regarding the assignment.
Problem Statement
In this assignment you will experiment with the Dakshina dataset released by Google. This dataset contains pairs of the following form:
.
ajanabee अजनबी.
i.e., a word in the native script and its corresponding transliteration in the Latin script (the way we type while chatting with our friends on WhatsApp etc). Given many such pairs your goal is to train a model which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर).
As you would realise this is the problem of mapping a sequence of characters in one language to a sequence of characters in another language. Notice that this is a scaled down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to sequence of characters here).
Question 1 (15 Marks)
Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari).
The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU) and the number of layers in the encoder and decoder can be changed.
(a) What is the total number of computations done by your network? (assume that the input embedding size is , encoder and decoder have 1 layer each, the hidden cell state is for both the encoder and decoder, the length of the input and output sequence is the same, i.e., , the size of the vocabulary is the same for the source and target language, i.e., )
(b) What is the total number of parameters in your network? (assume that the input embedding size is , encoder and decoder have 1 layer each, the hidden cell state is for both the encoder and decoder and the length of the input and output sequence is the same, i.e., , the size of the vocabulary is the same for the source and target language, i.e., )
Answer:
We build a seq2seq model with the configurations mentioned and made the parameters flexible to change.
The link for the notebook is : https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q1.ipynb
The link for the python code which can be ran with command line arguments is : https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q1_CommandLine.py
These code contains model building and printing the model. Training testing are done in separate files.
PARAMETERS
Given Input embedding size = m
Hidden cell state size = k
Input to encoder and decoder = T
Vocabulary size = V
RNN:
Layer Number of parameters
Encoder Embedded layer - Vm
Decoder Embedded layer Vm
Encoder Layer h' = activation(Ux + Wh + b)
U,W,b are parameters
U -> k*m , w -> k * k , b -> k*1
km + k^2 + k parameters
Decoder layer same as Encoder
km + k^2 + k parameters
Dense Layer softmax(Vh + b)
V - > V*k , b -> V*1
Vk + V parameters
Total parameters 2Vm + 2(k^2 + km + k) + (Vk + v)
LSTM :
Layer Number of parameters
Encoder Embedded layer - Vm
Decoder Embedded layer Vm
Encoder Layer LSTM cell has 4 Feedforward Network components
as it has input , forgot , output gates.
Therefore 4 (km + k^2 + k) parameters
Decoder layer same as Encoder
4 (km + k^2 + k) parameters
Dense Layer softmax(Vh + b)
V - > V*k , b -> V*1
Vk + V parameters
Total parameters 2Vm + 2(4[k^2 + km + k]) + (Vk + v)
GRU :
Layer Number of parameters
Encoder Embedded layer - Vm
Decoder Embedded layer Vm
Encoder Layer GRU cell has 3 Feedforward Network components
as it has input , forgot , output gates.
Therefore 3(km + k^2 + k) parameters
Decoder layer same as Encoder
3(km + k^2 + k) parameters
Dense Layer softmax(Vh + b)
V - > V*k , b -> V*1
Vk + V parameters
Total computations 2Vm + 2(3[k^2 + km + k]) + (Vk + v)
COMPUTATIONS
Given input and output size is T, Therefore Encoder and decoder is iterate through T time. Total computation is T times of one iteration computation.
RNN
Layer Computations
Encoder Embedded layer - VmT
Decoder Embedded layer VmT
Encoder Layer h' = activation(Ux + Wh + b)
U,W,b are parameters
U -> k*m , w -> k * k , b -> k*1
K^2 computations for Wh
km computations for Ux
k computations for adding bias
k computations for activation
(km + k^2 + 2k)T
Decoder layer same as Encoder
(km + k^2 + 2k)T
Dense Layer softmax(Vh + b)
V - > V*k , b -> V*1
Vk computations for Vh
V computations for adding bias
V computations for activation
(Vk + 2V)T parameters
Total computations 2VmT + 2(k^2 + km +2 k)T + T[(Vk + 2V)]
LSTM :
Layer Computations
Encoder Embedded layer - VmT
Decoder Embedded layer VmT
Encoder Layer LSTM cell has 4 Feedforward Network components
as it has input , forgot , output gates.
Therefore [4 (km + k^2 + 2k)+ 4k] T computations
Decoder layer same as Encoder
[4 (km + k^2 + 2k)+ 4k] T computations
Dense Layer softmax(Vh + b)
V - > V*k , b -> V*1
[ Vk + 2V ]T parameters
Total computations 2VmT + 2(4[k^2 + km + 2k]+4k)T + [ (Vk + v)+v ]T
GRU :
Layer Computations
Encoder Embedded layer - VmT
Decoder Embedded layer VmT
Encoder Layer GRU cell has 3 Feedforward Network components
as it has input , forgot , output gates.
Therefore [3(km + k^2 + 2k) + 3k]T computations
Decoder layer same as Encoder
[3(km + k^2 + 2k) + 3k]T computations
Dense Layer softmax(Vh + b)
V - > V*k , b -> V*1
T[(Vk + V)+V] parameters
Total computations 2VmT + 2(3[k^2 + km + 2k]+3k)T + T(Vk + 2v)
Question 2 (10 Marks)
You will now train your model using any one language from the Dakshina dataset (I would suggest pick a language that you can read so that it is easy to analyse the errors). Use the standard train, dev, test set from the folder dakshina_dataset_v1.0/hi/lexicons/ (replace hi by the language of your choice)
Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions but you are free to decide which hyperparameters you want to explore
- input embedding size: 16, 32, 64, 256, ...
- number of encoder layers: 1, 2, 3
- number of decoder layers: 1, 2, 3
- hidden layer size: 16, 32, 64, 256, ...
- cell type: RNN, GRU, LSTM
- dropout: 20%, 30% (btw, where will you add dropout? you should read up a bit on this)
- beam search in decoder with different beam sizes:
Based on your sweep please paste the following plots which are automatically generated by wandb:
- accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration).
- parallel co-ordinates plot
- correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)
Also write down the hyperparameters and their values that you sweeped over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried for efficiently searching the hyperparameters.
Answer:
The code for this question is https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q2.ipynb
- Validation Accuracy v/s created plot and Training Accuracy v/s created plots are shown below that plots the validation and training accuracy achieved for each experiment performed according to its creation date. Validation accuracy is achieved on the dev set in dakshina_dataset_v1.0/hi/lexicons/ folder and the validation accuracy is coming around 60%. Training is done on the train set and training accuracy is more than 90%.
- Parallel coordinates plot is attached just below the Accuracy v/s created plot that shows all the hyperparameters corresponding to its accuracy.
- Correlation summary table is plotted below parallel coordinates plot that shows a summary of correlation of different hyperparameters with respect to the validation accuracy.
- Validation Accuracy and Training Accuracy are also plotted below for different experiments.
Validation Accuracy v/s created plot
Training Accuracy v/s created plots
Parallel coordinates plot
Correlation summary table and Validation Accuracy, Training Accuracy
We have tried different hyperparameters in sweep as follows:
- Input embedding sizes (encoder_embedding_size)
- Different number of encoder layers(encoder_decoder_latentDims):
- Different number of decoder layers(encoder_decoder_latentDims):
- Dropout(dropout):
- Beam search decoder(beam_width):
- Different number of epochs(epochs):
- Different cell(cell_type):
- Different learning rates(learning_rate):
- Different Batch sizes (batch_size):
- Different optimizers(optimiser):
- Decoder embedding sizes (decoder_embedding_size)
Explanation of where and why dropout is put:
We used the dropout in the recurrent cells like (RNN,GRU,LSTM) cells of both encoder and decoder. Here dropouts are used to mainly check the overfitting. Recurrent cells like LSTMs are more prone to overfitting during training. So, to avoid overfitting and increasing the test accuracy we use dropout.
The sweep config file that we used for hyperparameter tuning is attached below:
#Different sweep configurationssweep_config = {"method":"bayes",}metric = {"name" : "val_accuracy","goal" : "maximize"}sweep_config['metric']=metricparameter_dict = {'encoder_embedding_size': {'values': [32,64,128,256,512]},'decoder_embedding_size': {'values': [32,64,128,256,512]},'cell_type': {'values': ['lstm', 'gru', 'rnn']},'encoder_decoder_latentDims': {'values': [ [256],[512],[64,128],[256,512], [256,128], [64,128,256], [128,128,128], [512,256,128], [512,512,512], [1024 , 512 , 256], [512,256,128,64]]},'dropout': {'values': [0 , 0.1 ,0.3 , 0.5]},'optimiser': {'values': ['adam', 'rmsprop', 'sgd']},'learning_rate': {'values': [0.0001 , 0.001, 0.002, 0.003, 0.005, 0.0005]},'epochs': {'values': [4,6,8,10,15,20,25,30,35,40,50]},'beam_width': {'values': [1,2,3,5]},'batch_size':{'values': [32, 64, 128, 256, 512]}}sweep_config['parameters']=parameter_dict
Strategy for reducing the total number of experiments :
Since there are multiple number of hyperparameters and each with multiple values so tuning with different combination of each of these will generate exponential number of plots. So, we applied some unique smart strategies to reduce the number of runs as follows:
- Bayes search strategy: Wandb provides with different search strategies to choose between the hyperparameters such as 1) Random search, 2) Grid Search and 3) Bayes Search. We choose Bayes search strategy which uses gaussian function to model the function and then chooses parameter to optimize probability of improvement. Here we are optimizing (maximizing) over the validation accuracy.
- Same Encoder-Decoder Size: To reduce the number of runs we tried to reduce the number of parameters in sweep. So, we kept the same number of encoders and decoders.
- Beam Width: We tried to make it computationally less heavy and so we tried beam_width of 1,2,3 and we got our desired results.
- Encoder embedding size: Since the number of unique tokens in the languages we used (english -> hindi) is not that large, we restricted our input and output embedding sizes to a maximum of 512.
Moreover, we first tried few sweep runs and saw that some parameters are giving better values than the rest, so we tried using only those parameters in sweep and tried to find other best parameters. Like 'gru' and 'rmsprop' were giving the best values.
Question 3 (15 Marks)
Based on the above plots write down some insightful observations. For example,
- RNN based model takes longer time to converge than GRU or LSTM
- using smaller sizes for the hidden layer does not give good results
- dropout leads to better performance
(Note: I don't know if any of the above statements is true. I just wrote some random comments that came to my mind)
Of course, each inference should be backed by appropriate evidence.
Answer:
Some insightful observations based on the above plots:
By observing the parallel coordinate plot and the correlation summary table we came to these insightful inferences.
- Some comment on how increasing/decreasing/not changing the number of encoder and decoder latent dimensions helps/does not help : As in the sweep hyperparameter encoder_decoder_latentDims we used different filter size like 'values': [[256],[512],[64,128],[256,512], [256,128], [64,128,256], [128,128,128], [512,256,128], [512,512,512], [1024 , 512 , 256], [512,256,128,64]]. So [64,128,256] means 64 is the encoder latent dimension size for the first cell followed by 128 for the second and 256 for the third. Since we are taking encoder and decoder as the same size so decoder sizes will also be the same. Similarly we tried with increasing the size in each layer as in [64,128],[256,512] and [64,128,256] or decreasing like in [512,256,128],[1024 , 512 , 256], [512,256,128,64] or even keeping all sizes same like in [128,128,128], [512,512,512]. As in the parallel coordinate plot we observed that the decreasing size like [256,128] is giving better accuracy than increasing the filter size in later layers like [64,128,256]. Even keeping sizes same in each layer like [128,128,128], [512,512,512] is giving better accuracy than increasing the sizes in subsequent layers. We got the best accuracy with the configuration as [256,128].
- Some comment on how increasing/decreasing the size of the input embeddings across layers helps/does not help: As in the sweep hyperparameter encoder_embedding_size we tried various numbers of input size. We sweeped over 'values': [32,64,128,256,512]. We got our best test result when we used higher sizes like 512 or 256, whereas we are getting bad accuracy for values like 32. Moreover we get from the correlation summary table that the parameter encoder_embedding_size is highly positively correlated with accuracy.
- Some comment on how increasing/decreasing the size of the LSTM hidden state in the encoder/decoder helps/does not help: As in the sweep hyperparameter encoder_decoder_latentDims we used different filter size like 'values': [[256],[512],[64,128],[256,512], [256,128], [64,128,256], [128,128,128], [512,256,128], [512,512,512], [1024 , 512 , 256], [512,256,128,64]]. We could find that LSTM or GRU are giving better results compared to RNN cells. Now, moreover increasing the size of the LSTM hidden stat in the encoder and decoder is giving better result. As in the parallel coordinate plot we observed that the decreasing size like [256,128] is giving better accuracy than increasing the filter size in later layers like [64,128,256]. We got the best accuracy with the configuration as [256,128]. Moreover in the correlation summary table it is positively correlated so higher the value gives better result.
- Some comparison between LSTM, GRU, RNN: GRU and LSTM Models converge faster than Simple RNN models. Overall GRU converged fastest among the three. This is clear from their respective correlations in the correlation table. We are getting our best results from GRU cell. LSTM and GRU are positively correlated in the summary table whereas RNN is negatively correlated.
- Comments on whether dropout helps or not: Dropout: It is used in recurrent cells like LSTM, GRU or RNN to check the overfit. In our hyperparameters we used dropout 'values': [0 , 0.1 ,0.3 , 0.5]. So, it is found that the accuracy increases when we use dropouts. Our best model is trained with dropout taking 0.3. From the correlation table it is evident that dropout is positively correlated with the accuracy.
- Some comment on whether beam search helped/did not help : The positive correlation with beam width depicts that using larger beam width improves the accuracy as it should be the case because we are not purely relying on the topmost prediction rather on a set of top k predictions where k is the beam width. But using large beam widths comes with a computational overhead.
- Different number of epochs: We tried with different epochs like 'values': [4,6,8,10,15,20,25,30,35,40,50] and found that number of epoch plays an important role in accuracy. If we use small epochs like 5 or 10 then the model is less trained and doesn't give good accuracy and with too much number of epochs its overfitted like in 50 and accuracy starts decreasing. So, epochs like 15, 20 are good for this model. Our best model is trained with 25 epochs.
- Some comment on the use of different learning rates: Different learning rates: We ran sweep with learning_rate 'values': [0.0001 , 0.001, 0.002, 0.003, 0.005, 0.0005] . It is observed that lower learning rate is giving better accuracy like 0.002 is giving better accuracy than that of 0.0005. Our best model is trained with learning rate 0.002. From the summary table it is evident that it is highly positively correlated.
- Some comments on the use of different batch sizes: Different Batch sizes: With different batch_size the accuracy also varies. So we tried with different batch sizes 'values': [32, 64, 128, 256, 512]. Our best model is trained with batch size of 32. As from the correlation table batch size highly important and it is negatively correlated with the accuracy so small sizes gave better accuracy.
- Comments on the use of different learning algorithms: Different optimizers: We tried with two optimizer_name in the sweep 'values': ['adam', 'rmsprop', 'sgd']. 'rmsprop' gave better accuracy that 'sgd' and our best model is trained with 'rmsprop'. So, most of the good accuracy came with 'rmsprop'.
Question 4 (10 Marks)
You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only).
(a) Use the best model from your sweep and report the accuracy on the test set (the output is correct only if it exactly matches the reference output).
(b) Provide sample inputs from the test data and predictions made by your best model (more marks for presenting this grid creatively). Also upload all the predictions on the test set in a folder predictions_vanilla on your github project.
(c) Comment on the errors made by your model (simple insightful bullet points)
- The model makes more errors on consonants than vowels
- The model makes more errors on longer sequences
- I am thinking confusion matrix but may be it's just me!
- ...
Answer:
The code for this question is https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q4.ipynb
Accuracy
These are the parameters for our best model:
#Best configuration model with attention OFF#If you want to test model without attention use this configurationlatent_dims = [128 , 128 , 128]embed_dims = [128 , 256]attention = Falserecurrent_layer = 'gru'dropout = 0.1batch_size = 64 # Batch size for training.epochs = 20 # Number of epochs to train for.learning_rate = 0.002optimiser = 'rmsprop'beam_width = 3wandb_params = False
Test accuracy without attention
Test accuracy : 100.0, Completion = 0.0Test accuracy : 31.683168316831683, Completion = 0.02221235006663705Test accuracy : 35.32338308457712, Completion = 0.0444247001332741Test accuracy : 40.863787375415285, Completion = 0.06663705019991115Test accuracy : 40.14962593516209, Completion = 0.0888494002665482Test accuracy : 38.32335329341317, Completion = 0.11106175033318526Test accuracy : 37.770382695507486, Completion = 0.1332741003998223Test accuracy : 36.80456490727532, Completion = 0.15548645046645934Test accuracy : 36.5792759051186, Completion = 0.1776988005330964Test accuracy : 37.957824639289676, Completion = 0.19991115059973344Test accuracy : 38.96103896103896, Completion = 0.22212350066637052Test accuracy : 39.963669391462304, Completion = 0.24433585073300756Test accuracy : 40.96586178184846, Completion = 0.2665482007996446Test accuracy : 40.4304381245196, Completion = 0.28876055086628166Test accuracy : 41.256245538900785, Completion = 0.3109729009329187Test accuracy : 42.10526315789474, Completion = 0.33318525099955576Test accuracy : 41.78638351030606, Completion = 0.3553976010661928Test accuracy : 41.740152851263964, Completion = 0.37760995113282986Test accuracy : 41.14380899500278, Completion = 0.3998223011994669Test accuracy : 42.08311415044713, Completion = 0.42203465126610396Test accuracy : 42.478760619690156, Completion = 0.44424700133274103Test accuracy : 42.69395525940028, Completion = 0.46645935139937805Test accuracy : 43.980009086778736, Completion = 0.48867170146601513Test accuracy : 44.15471534115602, Completion = 0.5108840515326522Test accuracy : 44.35651811745106, Completion = 0.5330964015992892Test accuracy : 44.742103158736505, Completion = 0.5553087516659263Test accuracy : 44.482891195693966, Completion = 0.5775211017325633Test accuracy : 44.42798963346909, Completion = 0.5997334517992003Test accuracy : 44.62691895751517, Completion = 0.6219458018658374Test accuracy : 44.777662874870735, Completion = 0.6441581519324745Test accuracy : 45.21826057980673, Completion = 0.6663705019991115Test accuracy : 45.21122218639149, Completion = 0.6885828520657485Test accuracy : 45.48578569197126, Completion = 0.7107952021323856Test accuracy : 45.71342017570433, Completion = 0.7330075521990227Test accuracy : 45.89826521611291, Completion = 0.7552199022656597Test accuracy : 46.072550699800054, Completion = 0.7774322523322967Test accuracy : 46.1816162177173, Completion = 0.7996446023989338Test accuracy : 46.09564982437179, Completion = 0.8218569524655709Test accuracy : 46.14575111812681, Completion = 0.8440693025322079Test accuracy : 46.0907459625737, Completion = 0.8662816525988449Test accuracy : 46.16345913521619, Completion = 0.8884940026654821Test accuracy : 46.15947329919532, Completion = 0.9107063527321191Test accuracy : 45.8938348012378, Completion = 0.9329187027987561Test accuracy : 46.05905603348059, Completion = 0.9551310528653931Test accuracy : 46.26221313337878, Completion = 0.9773434029320303Test accuracy : 46.23417018440347, Completion = 0.9995557529986673Test accuracy : 46.24611283873834
The test accuracy on the test data achieved was : 46.24 %
Sample Outputs:
Sample inputs are taken from the test data and predictions made by our best model

Sample test data, predicted hindi and actual hindi
Predictions_Vanilla
All the predictions on the test set are provided in predictions_vanilla.csv file inside predictions_vanilla folder on our github project.
The github link for the file is https://github.com/Doeschate/CS6910_Assignment3/blob/main/predictions_vanilla/predictions_vanilla.csv
Some comment about errors in vowels v/s consonants:
From the generated test dataset in the form of predictions_vanilla.csv it is observed that vowel prediction is more prone to error than consonants. Most of the time consonants are predicted quite correctly as there is less variation of the pronunciation of consonants as opposed to a vowel. Moreover it is observed that sometimes a longer sequence of vowel is miss pronounced as a short sequence vowel. For example: ank is predicted as आंक where as ground truth is अंक. So here the vowel is wrongly pronounced.
Some comment about errors on longer sequences v/s shorter sequences:
It is observed that there is a chance of more error towards longer sequences compared to shorter sequences as there may be varied pronunciation of a longer sequence based on the emphasis made on a certain part of the word. Moreover there may occur some confusion between the short vowel sequence and long vowel sequence. The model can predict a short vowel sequence in place of a long vowel sequence or vise versa. Example in the word joki the model predicted जोकी in place of जोकि so here the emphasize on the vowel 'i' is predicted more by the model.
Some comment about confusion between similar characters:
Sometimes similar characters in english are pronounced differently in hindi and these similar characters lead to confusion for any transliteration model. For example in the word taste our model predicted 't' as 'त' in तस्ते where as actually here it is 'ट' in टेस्ट. So, 't' can sound both as 'त' or 'ट' depending on the context. Sometimes, the same english letter d can be used differently like consonants द (dha) and ड (da).
Other insightful observation:
There are few other insightful observations also noticed like:
- el is predicted as इल but actual ground truth is ऐल. Now the letter 'e' here pronounced as 'ऐ' instead of 'इ'. Here due to different use of same letter the model is confused between the actual use.
- Sometimes it may happen that different devnagari script are use to represent same english character like in रैस it is replaced with रई of रईस.
- Another important observation is that the model was unable to detect similar sounding corresponding devanagari घ with ग.
- Similar sounded words like टब and तब are not distinguished well by the model.
- In tarko the model predicted तर्को in place of तर्कों so the dot ं to represent plural is missed.
- The english letter 'g' can be pronounced as 'ज' or 'ग' based on the context. In the word 'drager' our model failed to detect the difference and predicted ड्रेगर in place of ड्रेजर.
Question 5 (20 Marks)
Now add an attention network to your basis sequence to sequence model and train the model again. For the sake of simplicity you can use a single layered encoder and a single layered decoder (if you want you can use multiple layers also). Please answer the following questions:
(a) Did you tune the hyperparameters again? If yes please paste appropriate plots below.
(b) Evaluate your best model on the test set and report the accuracy. Also upload all the predictions on the test set in a folder predictions_attention on your github project.
(c) Does the attention based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences (i.e., outputs which were predicted incorrectly by your best seq2seq model are predicted correctly by this model)
(d) In a 3 x 3 grid paste the attention heatmaps for 10 inputs from your test data (read up on what are attention heatmaps).
Answer:
The code for this question is https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q5.ipynb
- Validation Accuracy v/s created plot shown below are the plots showing the validation accuracy achieved for each experiment performed according to its creation date. Validation accuracy is achieved with attention on the dev set in dakshina_dataset_v1.0/hi/lexicons/ folder and the validation accuracy is coming around 60%. Training is done on the train set and training accuracy is more than 90%.
- Parallel coordinates plot is attached just below the Accuracy v/s created plot that shows all the hyperparameters corresponding to its accuracy.
- Correlation summary table is plotted below parallel coordinates plot that shows a summary of correlation of different hyperparameters with respect to the validation accuracy.
- Validation Accuracy and Training Accuracy are also plotted below for different experiments.
Accuracy v/s Created plot
Parallel coordinates plot
Correlation summary table and Validation Accuracy, Training Accuracy
Accuracy improvement corresponding to seq2seq model:
Code for testing the accuracy with attention is https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q5_Test.ipynb

Validation Accuracy with attention 64.8%
Validation Accuracy with attention is 64.8% which is around 4% more than the validation accuracy without attention which is around 60%.
Test accuracy with attention
Test accuracy with Attention: 100.0, Completion = 0.0Test accuracy with Attention: 40.59405940594059, Completion = 0.02221235006663705Test accuracy with Attention: 46.26865671641791, Completion = 0.0444247001332741Test accuracy with Attention: 51.16279069767442, Completion = 0.06663705019991115Test accuracy with Attention: 49.12718204488778, Completion = 0.0888494002665482Test accuracy with Attention: 49.30139720558882, Completion = 0.11106175033318526Test accuracy with Attention: 47.25457570715474, Completion = 0.1332741003998223Test accuracy with Attention: 46.3623395149786, Completion = 0.15548645046645934Test accuracy with Attention: 45.3183520599251, Completion = 0.1776988005330964Test accuracy with Attention: 45.94894561598224, Completion = 0.19991115059973344Test accuracy with Attention: 46.353646353646354, Completion = 0.22212350066637052Test accuracy with Attention: 46.59400544959128, Completion = 0.24433585073300756Test accuracy with Attention: 46.711074104912576, Completion = 0.2665482007996446Test accuracy with Attention: 46.11837048424289, Completion = 0.28876055086628166Test accuracy with Attention: 47.0378301213419, Completion = 0.3109729009329187Test accuracy with Attention: 47.96802131912059, Completion = 0.33318525099955576Test accuracy with Attention: 47.28294815740163, Completion = 0.3553976010661928Test accuracy with Attention: 47.325102880658434, Completion = 0.37760995113282986Test accuracy with Attention: 46.86285397001666, Completion = 0.3998223011994669Test accuracy with Attention: 47.44871120462914, Completion = 0.42203465126610396Test accuracy with Attention: 47.876061969015495, Completion = 0.44424700133274103Test accuracy with Attention: 48.453117563065206, Completion = 0.46645935139937805Test accuracy with Attention: 49.795547478418904, Completion = 0.48867170146601513Test accuracy with Attention: 50.282485875706215, Completion = 0.5108840515326522Test accuracy with Attention: 50.56226572261558, Completion = 0.5330964015992892Test accuracy with Attention: 50.53978408636545, Completion = 0.5553087516659263Test accuracy with Attention: 50.403690888119954, Completion = 0.5775211017325633Test accuracy with Attention: 50.46279155868197, Completion = 0.5997334517992003Test accuracy with Attention: 50.66047840057122, Completion = 0.6219458018658374Test accuracy with Attention: 50.84453636677008, Completion = 0.6441581519324745Test accuracy with Attention: 51.24958347217594, Completion = 0.6663705019991115Test accuracy with Attention: 50.983553692357304, Completion = 0.6885828520657485Test accuracy with Attention: 51.23398937831927, Completion = 0.7107952021323856Test accuracy with Attention: 51.408664041199636, Completion = 0.7330075521990227Test accuracy with Attention: 51.51426051161423, Completion = 0.7552199022656597Test accuracy with Attention: 51.670951156812336, Completion = 0.7774322523322967Test accuracy with Attention: 51.84670924743127, Completion = 0.7996446023989338Test accuracy with Attention: 51.769791948122126, Completion = 0.8218569524655709Test accuracy with Attention: 51.775848460931336, Completion = 0.8440693025322079Test accuracy with Attention: 51.832863368367086, Completion = 0.8662816525988449Test accuracy with Attention: 51.962009497625594, Completion = 0.8884940026654821Test accuracy with Attention: 52.011704462326264, Completion = 0.9107063527321191Test accuracy with Attention: 51.535348726493694, Completion = 0.9329187027987561Test accuracy with Attention: 51.77865612648221, Completion = 0.9551310528653931Test accuracy with Attention: 51.96546239491025, Completion = 0.9773434029320303Test accuracy with Attention: 52.0328815818707, Completion = 0.9995557529986673Test accuracy : 52.04353620613061
Test accuracy : 52.04%
There is a considerable improvement of test accuracy from 46.24 % without attention to 52.04% with attention . So, there is more than 5% improvement of accuracy with attention on the test data.
Analysis of the errors corrected by the attention based model compared to the seq2seq:
All the predictions on the test set with attention are provided in predictions_attention.csv file on our github project.
The github link for the file is
Sample Outputs:
Sample inputs are taken from the test data and predictions made by our best model

- As in the Q4 we showed how there is more error for vowels than consonants in without attention model, but here with attention this error with vowels is reduced to some extant as there is specific attention given to the pronunciation of each character. For example: Without attention ank is predicted as आंक where as with attention it is correctly predicted as अंक.
- el was predicted as इल in the seq2seq model previously due to the different pronunciation of letter e in two different context, but here with attention it is predicted as एल which is more similar to ऐल the ground truth.
- As in the previous model, there was some confusion between the long and short vowels. But for models with attention, it is somewhat solved, like in the case with 'e', it is quite solved.
- Here, the problem faced by the seq2seq model of not being able to predict similar sounding characters like ग and घ is slightly reduced.
- The english letter 'g' can be pronounced as 'ज' or 'ग' based on the context. In the word 'drager' our previous model failed to detect the difference and predicted ड्रेगर in place of ड्रेजर where as the attention based model predicted it correctly as ड्रेजर
Attention maps plot
The attention heatmap is plotted and attached below:



Interesting patterns on attention maps:
Here light shaded portions are with the highest probabilities with more the light having better probability. Dark regions are less in probability with dark blank being the lowest.
The most interesting pattern observed is that as we are towards the diagonal position of the heat map we are getting lighter regions with more probabilities whereas regions away from the diagonal towards the corner are pitch black with very less probability. Thus it is very interestingly observed that the corresponding English and Devanagari characters are more likely to be contributing to each other as opposed to farther characters.
For example for an ith position character in Devanagari it is very likely that the letter at (i-ε,i+ε) position english characters will contribute more. Whereas characters at farthest distance from i will hardly contribute.
Question 6 (20 Marks)
This a challenge question and most of you will find it hard.
I like the visualisation in the figure captioned "Connectivity" in this article. Make a similar visualisation for your model. Please look at this blog for some starter code. The goal is to figure out the following: When the model is decoding the -th character in the output which is the input character that it is looking at?
Have fun!
Answer:
The visualization code for this question https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3-Q6-VisualizationCode.html
The goal of this question is to figure out the following: When the model is decoding the -th character in the output which is the input character that it is looking at.
We used attention, so here only the letters creating influence of a particular devanagari letter pronunciation is given more attention. Here, we showed for the i-th character in the devanagari which english letters are creating most impacts. The different shades of blue shows different level of influence with light being low and dark being high.
The connectivity visualization for some of the words from test set is given below:
- ank अंक:
In this english word ank, 'an' contribute to अ, ं is influenced by 'an' with 'n' being the most influential. 'nk' helps to generate क.

2. umesh उम्मेश:
Similar to the previous example here also contribution of different english character on their devanagari counter part is shown.

3. warriors वारियर्स:
Here an interesting pattern is observed, w helps to generate both व and ा .

4. aastin आस्तिन:
Here, a is contributing to both आ and स .

5. kalika कालिका:
Here ि is generated by a combination of a,l and i.

6. gin गिन:
As relevant n contributes most toward न.

7. ail एल:
This image is self explanatory with a as ए

8. africa अफ्रिका:
Similar to the previous example here also contribution of different english character on their devanagari counter part is shown.

9. orion ओरियन:
Similar to the previous example here also contribution of different english character on their devanagari counter part is shown.

10. taste तस्ते:
Similar to the previous example here also contribution of different english character on their devanagari counter part is shown.

Question 7 (10 Marks)
Paste a link to your github code for Part A
- We will check for coding style, clarity in using functions and a README file with clear instructions on training and evaluating the model (the 10 marks will be based on this).
- We will also run a plagiarism check to ensure that the code is not copied (0 marks in the assignment if we find that the code is plagiarised).
- We will check the number of commits made by the two team members and then give marks accordingly. For example, if we see 70% of the commits were made by one team member then that member will get more marks in the assignment (note that this contribution will decide the marks split for the entire assignment and not just this question).
- We will also check if the training and test splits have been used properly. You will get 0 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy.
Answer:
README.md file is also provided in this which can be used to run the code.
Question 8 (0 Marks)
Note that this question does not carry any marks and will not be graded. This is only for students who are looking for a challenge and want to get something more out of the course.
Your task is to finetune the GPT2 model to generate lyrics for English songs. You can refer to this blog and follow the steps there. This blog shows how to finetune the GPT2 model to generate headlines for financial articles. Instead of headlines you will use lyrics so you may find the following datasets useful for training: dataset1, dataset2
At test time you will give it a prompt: "I love Deep Learning" and it should complete the song based on this prompt :-) Paste the generated song in a block below!
Answer:
The GPT2 model is finetuned to generate lyrics for English song.
The Song is generated by the GPT2 model.
The song generated is as follows:
"I love Deep Learning"
-- By GPT2
I Love Deep Learning, and learning to fly
Dreams Come to you When you're feeling strong
Dreams Come to you When you're weak
Dreams Come to you When you're calm
Dreams Come to you When you're tired
When I take your hand, say "Lead me on, lead me on"
I love you so much, that's why I won't let you go
Dreams Come to you, help me find you somewhere
When you're with me, I'm proud to say hello
Dreams Come to you When you're high on the mountain,
Say "Go on, go on" I love you so, that's why I won't let you go
Dreams Come to you, you are my dream
Dreams Come to you, you are my dream
The github link for the code is https://github.com/Doeschate/CS6910_Assignment3/blob/main/Assignment3_Q8.ipynb
Self Declaration
List down the contributions of the two team members:
For example,
CS21S011: (50% contribution)
- Code for data loading
- Calculating number of Computations
- Designing and implementing code for model Encoder
- Code for Beam Decoder
- Designing and implementing code for Decoder Inference
- Ran Sweep for best model without attention
- Generating Attention map and analysing error
- Visualization of our model
CS21S045: (50% contribution)
- Code for data preprocessing
- Calculating number of parameters
- Designing and implementing code for model Decoder
- Code for testing Accuracy
- Designing and implementing code for Encoder Inference
- Ran Sweep for best model with attention
- Generating hindi output for the test data and comparing and analysing errors by our model
- Generating "I love Deep Learning" song
We, Prithaj Banerjee and Kondapalli Jayavardhan, swear on our honour that the above declaration is correct.
Note: Your marks in the assignment will be in proportion to the above declaration. Honesty will be rewarded (Help is always given in CS6910 to those who deserve it!).
This is an opportunity for you to come clean. If one of the team members has not contributed then it should come out clearly from the above declaration. There will be a viva after the submission. If your performance in the viva does not match the declaration above then both the team members will be penalised (50% of the marks earned in the assignment).
Add a comment