CS6910 - Assignment 3

Use recurrent neural networks to build a transliteration system.
Created on May 21|Last edited on May 21
Comment
﻿
InstructionsThe goal of this assignment is threefold: (i) learn how to model sequence-to-sequence learning problems using Recurrent Neural Networks (ii) compare different cells such as vanilla RNN, LSTM and GRU (iii) understand how attention networks overcome the limitations of vanilla seq2seq models
Discussions with other students is encouraged.
You must use Python for your implementation.
You can use any and all packages from PyTorch or PyTorch-Lightning. NO OTHER DL library, such as TensorFlow or Keras is allowed. 
Please confirm with the TAs before using any new external library. BTW, you may want to explore PyTorch-Lightning as it includes fp16 mixed-precision training, wandb integration, and many other black boxes eliminating the need for boiler-plate code. Also, do look out for PyTorch2.0.
You can run the code in a jupyter notebook on Colab/Kaggle by enabling GPUs.
You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots we have asked for below can be (automatically) generated using the APIs provided by wandb.ai. You will upload a link to this report on Gradescope.
You must also provide a link to your GitHub code, as shown below. Follow good software engineering practices and set up a GitHub repo for the project on Day 1. Please do not write all code on your local machine and push everything to GitHub on the last day. The commits in GitHub should reflect how the code has evolved during the course of the assignment.
You have to check Moodle regularly for updates regarding the assignment.
Problem StatementIn this assignment, you will experiment with a sample of the Aksharantar dataset released by AI4Bharat. This dataset contains pairs of the following form: 
xxx,yyy
ajanabee,अजनबी
i.e., a word in the native script and its corresponding transliteration in the Latin script (how we type while chatting with our friends on WhatsApp etc). Given many such (xi,yi)i=1n(x_i, y_i)_{i=1}^n(xi​,yi​)i=1n​ pairs your goal is to train a model y=f^(x)y = \hat{f}(x)y=f^​(x) which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर). 
As you would realize, this is the problem of mapping a sequence of characters in one language to a sequence of characters in another. Notice that this is a scaled-down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to a sequence of characters here).
Read this blog to understand how to build neural sequence-to-sequence models.
Question 1 (15 Marks)Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari). 
The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU), and the number of layers in the encoder and decoder can be changed.
(a) What is the total number of computations done by your network? (assume that the input embedding size is mmm, the encoder and decoder have 111 layer each, the hidden cell state is kkk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TTT, the size of the vocabulary is the same for the source and target language, i.e., VVV)
(b) What is the total number of parameters in your network? (assume that the input embedding size is mmm, the encoder and decoder have 111 layer each, the hidden cell state is kkk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TTT, the size of the vocabulary is the same for the source and target language, i.e., VVV)
Ans)a) I'll assume that it's a dual RNN setup(both encoder and decoder are RNNs).
Encoder:
 The input is a one hot vector for the corresponding vector input letter. Thus for one input letter a matrix multiplication of ( $1 *  2$ V ) times (V $\times$ m). But this doesn't require matrix multiplication as the corresponding column can be accessed instantly with O(1) lookup. Thus for an input sequence of length T, O(T) time would be required.
Equations for RNN are :
st=activation(U⋅xt+W⋅xt−1+b)s_t = \text{activation}(U \cdot x_{t} + W \cdot x_{t-1} + b)st​=activation(U⋅xt​+W⋅xt−1​+b)
yt=activation(V⋅st+c)y_t =\text{activation}(V \cdot s_t + c)yt​=activation(V⋅st​+c)
Here, multiplying a matrix of size m×nm \times nm×n with a vector of size n×1n \times 1n×1 requires n(multiplications) and n-1(additions) for each of the rows.
U⋅xt:(k×m)×(m×1)=k∗(2m−1)U \cdot x_{t} : (k \times m) \times (m \times 1) = k * (2m - 1)U⋅xt​:(k×m)×(m×1)=k∗(2m−1)
W⋅xt−1:(k×k)×(k×1)=k∗(2k−1)W\cdot x_{t-1} : (k \times k) \times (k \times 1) = k * (2k - 1)W⋅xt−1​:(k×k)×(k×1)=k∗(2k−1)
V⋅st:(m×k)×(k×1)=m∗(2k−1)V\cdot s_{t} : (m \times k) \times (k \times 1) = m * (2k - 1)V⋅st​:(m×k)×(k×1)=m∗(2k−1)
 
Addition of b, c : k + m

Activation function : k + m
Hence total number of calculations = 
(2k2+4mk+m)∗T(2k^2 + 4mk + m) * T(2k2+4mk+m)∗T
Decoder:

The embedding layer in the decoder will remain of the same size as that of the encoder as the length of the output sequence ( T ) and length of target language vocabulary is same ( V ). Hence the number of calculations would remain the same for this O( T ).

The equations for the decoder would be the same as the encoder and thus the total number of calculations would remain the same (assumption that the hidden state size is the same). Hence total number of computations :
(2k2+4mk+m)∗T(2k^2 + 4mk + m) * T(2k2+4mk+m)∗T

The dense layer has output size V. The equation for this layer is:
y=softmax(Wx+b)y = \text{softmax}(Wx + b)y=softmax(Wx+b)
So, the number of computations in one forward pass of the layer will be V(2m−1)+V(addition of B) + V (softmax activation) thus totalling to 2mV + V, but for one forward pass of the model, there will be T
forward passes for the dense layer amounting to T * (2mV+V)
computations.
Total Number of computations:
=O(T)+(2k2+4mk+m)∗T+O(T)+(2k2+4mk+m)∗T+(2mV+V)∗T= O( T ) +(2k^2 + 4mk + m) * T + O( T ) + (2k^2 + 4mk + m) * T + (2mV + V) * T=O(T)+(2k2+4mk+m)∗T+O(T)+(2k2+4mk+m)∗T+(2mV+V)∗T
=(4k2+2m(4k+1+V)+V)∗T+O(T)= (4k^2 + 2m(4k + 1 + V) + V) * T + O( T )=(4k2+2m(4k+1+V)+V)∗T+O(T)
b)
The number of parameters in the network:

Embedding : Matrix of size 
V×mV \times mV×m
Hence number of parameters = Vm

RNN Layer: Matrices U, W, V, b, c with parameter sizes:
U:k×mU : k \times mU:k×m
W:k×kW : k \times kW:k×k
V:m×kV : m \times kV:m×k
b:k×1b : k \times 1b:k×1
c:m×1c : m \times 1c:m×1

Where m = Size of embedding layer output and k = hidden state size.
Thus the total number of parameters = 
km+k2+mk+k+m=k2+2km+k+mkm + k^2 + mk + k + m = k^2 + 2km + k + mkm+k2+mk+k+m=k2+2km+k+m

Dense layer:
The weights are of size:
W:m×VW : m \times V W:m×V
b:1×Vb : 1 \times Vb:1×V

Hence number of parameters: 
Vm +V
Total parameters of the network:
Encoder parameters + Decoder parameters + Dense layer
=Vm+k2+2km+k+m+Vm+k2+2km+k+m+Vm+V= Vm + k^2 + 2km + k + m + Vm + k^2 + 2km + k + m + Vm + V=Vm+k2+2km+k+m+Vm+k2+2km+k+m+Vm+V
=((k2+2km+k+m))∗2+3Vm+V= (( k^2 + 2km + k + m) ) * 2 + 3Vm + V=((k2+2km+k+m))∗2+3Vm+V
Question 2 (10 Marks)You will now train your model using any one language from the Aksharantar dataset (I would suggest picking a language that you can read so that it is easy to analyze the errors). Use the standard train, dev, test set from the folder aksharantar_sampled/hin (replace hin with the language of your choice)
BTW, you should read up on how NLG models operate in inference mode, and how it is different from training. This blog might help you with it. 
Using the sweep feature in wandb find the best hyperparameter configuration. Here are some suggestions, but you are free to decide which hyperparameters you want to explore
input embedding size: 161616, 323232, 646464, 256256256, ...
number of encoder layers: 111, 222, 333 
number of decoder layers: 111, 222, 333 
hidden layer size: 161616, 323232, 646464, 256256256, ...
cell type: RNN, GRU, LSTM
bidirectional: Yes, No
dropout: 0.20.20.2, 0.30.30.3 (btw, where will you add dropout? You should read up a bit on this)
beam search in decoder with different beam sizes: 
Based on your sweep, please paste the following plots, which are automatically generated by wandb:
accuracy v/s created plot (I would like to see the number of experiments you ran to get the best configuration). 
parallel co-ordinates plot
correlation summary table (to see the correlation of each hyperparameter with the loss/accuracy)
Also, write down the hyperparameters and their values that you swept over. Smart strategies to reduce the number of runs while still achieving a high accuracy would be appreciated. Write down any unique strategy that you tried for efficiently searching the hyperparameters.
Ans) I used 'bayes' method for finding the the next hyperparameter config to be used in the sweep. I also implemented a simple version of early stopping which stopped the training when the validation accuracy dropped after 1 epoch. I used the following strategy to do the parameter search.
 
Use a wide range of hyperparameters to test which cell types(RNN, LSTM, GRU) worked the best.
Hyperparameter Configs:
  decoder_num_layers:
values:
- 1
- 2
- 3
  decoder_type:
values:
- gru
- lstm
- rnn
  embedding_size:
values:
- 64
- 128
  encoder_num_layers:
values:
- 1
- 2
- 3
  encoder_type:
values:
- gru
- lstm
- rnn
  epochs:
values:
- 10
  hidden_size:
values:
- 128
- 256
- 384
- 512
  learning_rate:
values:
- 0.0001
- 0.0006
- 0.001
- 0.0003
  optimizer:
values:
- adam
- sgd
  teacher_forcing_ratio:
values:
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
﻿
﻿
﻿
Initial Test6
﻿
﻿
﻿
RNN 3
﻿
﻿
﻿
GRUs7
﻿
﻿
﻿
LSTMs8
﻿
﻿
Dropout was added at two places:
In the hidden layers of the cells. This didn't result in any improvement (The green sweep has 0 dropout_p compared to the rest).
In the embedding layer of the cells. This resulted in some improvement (The blue sweep in the second plot has some dropout_p = 0.2).
﻿
﻿
﻿
Cell Dropout5
﻿
﻿
﻿
﻿
Embedding Dropout3
﻿
﻿
﻿
Final Run11
﻿
﻿
﻿
Run set37
﻿
Question 3 (15 Marks)Based on the above plots write down some insightful observations. For example, 
RNN based model takes longer time to converge than GRU or LSTM
using smaller sizes for the hidden layer does not give good results
dropout leads to better performance 
(Note: I don't know if any of the above statements are true. I just wrote some random comments that came to my mind)
Of course, each inference should be backed by appropriate evidence.
﻿

RNNs have a much worse performance than GRUs and LSTMs as shown from their sweep.

The validation accuracy doesn't really depend on encoder/decoder layer sizes/number. This suggests that making a larger model will lead to overfitting and size of 256,384 seems ideal. Smaller sizes lead to underfitting and slow learning of features.

Learning rates faster than 0.001 lead to non convergence. While rates like 0.0001 turns out to be very slow for convergence. Hence 0.0003, 0.0006 are ideal for this.

Embedding size is less important than hidden size. 

Dropout in the embedding layer performed better than dropout in the hidden layer. Also dropout in the embedding layer with probability 0.2 performed the best.

The number of layers doesn't have a high correlation with the number of layers in the encoder/decoder.

More the teacher forcing ratio is closer to 1, worse the performance due to "Exposure Bias" but very low teacher forcing ratio leads to very slow learning. Hence values lesser than 0.7 led to poor training speed.
﻿
Question 4 (10 Marks)You will now apply your best model on the test data (You shouldn't have used test data so far. All the above experiments should have been done using train and val data only). 
(a) Use the best model from your sweep and report the accuracy on the test set (the output is correct only if it exactly matches the reference output). 
(b) Provide sample inputs from the test data and predictions made by your best model (more marks for presenting this grid creatively). Also, upload all the predictions on the test set in a folder predictions_vanilla on your GitHub project.
(c) Comment on the errors made by your model (simple insightful bullet points)
The model makes more errors on consonants than vowels
The model makes more errors on longer sequences
I am thinking confusion matrix, but maybe it's just me!
...
Ans)The best model on the basis of validation accuracy:
Encoder embedding_size : 64
Encoder hidden_size: 384
Encoder num_layers: 2
Encoder type: lstm
dropout_p : 0
Decoder embedding_size: 64
Decoder hidden_size: 384
Decoder num_layers: 2
Decoder type: lstm
Teacher Forcing Ratio : 0.8
Optimizer : Adam
Learning Rate : 0.0003
I used the trained model from my sweeps which gave a validation accuracy of 0.4456 for testing which had the above hyperparameters.
I got an accuracy of 0.4019 on the test dataset.
A set of 10 expected, predicted results from the test data:
heetler , हिटलर, हेटलर
kaushik, कौशिक, कौशिक
bombay, बॉम्बे, बोमबे
loid, लॉइड, लिड
shikshanvishayak,शिक्षणविषयक,शिक्षणविषयक
purushottam,पुरुषोत्तम,पुरुषोत्त्म
shrivas,श्रीवस,श्रीवास
palale,पळाले,पळले
baabhlichi,बाभळीची,बाभलीची
pitali,पितळी,पिटली
From the above examples it is apparent that the model is getting the correct phonetic sound of the words but is not able to predict the exact emphasis of the words.
The model is unable to differentiate between Aaa and a sounds because in English a single a can have multiple pronunciations. This can be seen in the 8th example where the first a is not emphasized while the second one is emphasized. This issue would occur with other vowels as well.
Also, cases where there is a word with half letters(like म् in बॉम्बे), the model is not able to differentiate between the full letter and half ones.
The Marathi letter ळ is often confused with ल as is scene in the last example as both have L as its English couterpart.
﻿
Question 5 (20 Marks)Add an attention network to your base sequence-to-sequence model and train the model again. For the sake of simplicity, you can use a single-layered encoder and a single-layered decoder (if you want, you can use multiple layers also). Please answer the following questions:
(a) Did you tune the hyperparameters again? If yes, please paste the appropriate plots below.
(b) Evaluate your best model on the test set and report the accuracy. Also, upload all the predictions on the test set in a folder predictions_attention on your GitHub project.
(c) Does the attention-based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences (i.e., outputs that were predicted incorrectly by your best seq2seq model are predicted correctly by this model)
(d) In a 3×33 \times 33×3 grid, paste the attention heatmaps for 101010 inputs from your test data (read up on what attention heatmaps are).
Ans)I ran attention network and tested for different teacher forcing values (0.6, 0.7, 0.8) on the best model. The best results is for 0.7.
﻿
﻿
﻿
Run set3
﻿
The test accuracy on this model is 0.3757. Sadly, using attention didn't improve performance in my testing. 
heetler,हिटलर,हीटलर
kaushik,कौशिक,कौशिक
bombay,बॉम्बे,बॉम्बे
loid,लॉइड,लोइड
shikshanvishayak,शिक्षणविषयक,शिक्षणविषयक
purushottam,पुरुषोत्तम,पुरुषोत्तम
shrivas,श्रीवस,श्रीवास
palale,पळाले,पळले
baabhlichi,बाभळीची,बाभलीची
pitali,पितळी,पिटळी
The model is worsened when it comes to differentiate between Aaa and a. But it fixed some other vowel related problems.
It fixed the case where there is a word with half letters(like म् in बॉम्बे).
Heatmap Results:
﻿
﻿
Question 7 (10 Marks)Paste a link to your GitHub Link
Example: https://github.com/Aytien/CS6910_Assignment3;
We will check for coding style, clarity in using functions, and a README file with clear instructions on training and evaluating the model (the 101010 marks will be based on this).
We will also run a plagiarism check to ensure that the code is not copied (000 marks in the assignment if we find that the code is plagiarised).
We will also check if the training and test splits have been used properly. You will get 000 marks on the assignment if we find any cheating (e.g., adding test data to training data) to get higher accuracy.
Self-DeclarationI, Aditya Mahesh Patil(Roll no: CS20B004), swear on my honour that I have written the code and the report by myself and have not copied it from the internet or other students. I have used the blog link from the report as reference for writing my code.
﻿
﻿
Add a comment