CS6910:Assignment 3

Use RNNs to build a transliteration system.
Created on April 18|Last edited on May 20
Comment
﻿
Instructions
The goal of this assignment is threefold: (i) learn how to model sequence-to-sequence learning problems using Recurrent Neural Networks (ii) compare different cells such as vanilla RNN, LSTM and GRU (iii) understand how attention networks overcome the limitations of vanilla seq2seq models
Discussions with other students is encouraged.
You must use Python for your implementation.
You can use any and all packages from PyTorch or PyTorch-Lightning. NO OTHER DL library, such as TensorFlow or Keras is allowed. 
Please confirm with the TAs before using any new external library. BTW, you may want to explore PyTorch-Lightning as it includes fp16 mixed-precision training, wandb integration, and many other black boxes eliminating the need for boiler-plate code. Also, do look out for PyTorch2.0.
You can run the code in a jupyter notebook on Colab/Kaggle by enabling GPUs.
You have to generate the report in the same format as shown below using wandb.ai. You can start by cloning this report using the clone option above. Most of the plots we have asked for below can be (automatically) generated using the APIs provided by wandb.ai. You will upload a link to this report on Gradescope.
You must also provide a link to your GitHub code, as shown below. Follow good software engineering practices and set up a GitHub repo for the project on Day 1. Please do not write all code on your local machine and push everything to GitHub on the last day. The commits in GitHub should reflect how the code has evolved during the course of the assignment.
You have to check Moodle regularly for updates regarding the assignment.
﻿
﻿
﻿
Problem StatementIn this assignment, you will experiment with a sample of the Aksharantar dataset released by AI4Bharat. This dataset contains pairs of the following form: 
xxx,yyy
ajanabee,अजनबी
i.e., a word in the native script and its corresponding transliteration in the Latin script (how we type while chatting with our friends on WhatsApp etc). Given many such (xi,yi)i=1n(x_i, y_i)_{i=1}^n(xi​,yi​)i=1n​ pairs your goal is to train a model y=f^(x)y = \hat{f}(x)y=f^​(x) which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर). 
As you would realize, this is the problem of mapping a sequence of characters in one language to a sequence of characters in another. Notice that this is a scaled-down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to a sequence of characters here).
Read this blog to understand how to build neural sequence-to-sequence models.
﻿
﻿
Question-1Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari). 
The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU), and the number of layers in the encoder and decoder can be changed.
(a) What is the total number of computations done by your network? (assume that the input embedding size is mmm, the encoder and decoder have 111 layer each, the hidden cell state is kkk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TTT, the size of the vocabulary is the same for the source and target language, i.e., VVV)
(b) What is the total number of parameters in your network? (assume that the input embedding size is mmm, the encoder and decoder have 111 layer each, the hidden cell state is kkk for both the encoder and decoder, and the length of the input and output sequence
is the same, i.e., TTT, the size of the vocabulary is the same for the source and target language, i.e., VVV)
Answer to part(a)
Matrix & Vector dimensions:
Ii⇒(V,1)I_i \Rightarrow (V,1)Ii​⇒(V,1)
E⇒(m,V)E \Rightarrow (m,V)E⇒(m,V) Embedding matrix
U⇒(k,m)U \Rightarrow (k,m)U⇒(k,m)
W⇒(k,k)W \Rightarrow (k,k)W⇒(k,k)
si−1⇒(k,1)s_{i-1} \Rightarrow (k,1)si−1​⇒(k,1)
b⇒(k,1)b \Rightarrow (k,1)b⇒(k,1)
V′⇒(k,k)EncoderV' \Rightarrow (k,k) EncoderV′⇒(k,k)Encoder
V′⇒(V,k)DecoderV' \Rightarrow (V,k) DecoderV′⇒(V,k)Decoder
c⇒(c,1)Encoderc \Rightarrow (c,1) Encoderc⇒(c,1)Encoder
c⇒(V,1)Decoderc \Rightarrow (V,1) Decoderc⇒(V,1)Decoder
xi⇒(m,1)x_i \Rightarrow (m,1)xi​⇒(m,1)
Formulae:
xi=EIix_i=EI_ixi​=EIi​
​
si=σ(Uxi+Wsi−1+b)s_i=\sigma(Ux_i+Ws_{i-1}+b)si​=σ(Uxi​+Wsi−1​+b)
yi=Softmax(V′si+c)y_i=Softmax(V's_i+c)yi​=Softmax(V′si​+c) 
Computations done in encoder:
si=σ(Uxi+Wsi−1+b)s_i=\sigma(Ux_i+Ws_{i-1}+b)si​=σ(Uxi​+Wsi−1​+b)
Here,
Multiplication = O(km+k2)O(km+k^2)O(km+k2)
Addition = O(2k)O(2k)O(2k)
Total = O(km+k2+2k)O(km+k^2+2k)O(km+k2+2k)
EIi=O(mV)EI_i = O(mV)EIi​=O(mV)
σ=O(k)\sigma =O(k)σ=O(k)
Total computations for encoder per step = O(mk+k2+3k+mV)O(mk + k^2 + 3k + mV)O(mk+k2+3k+mV)
Total computations for encoder in T step = O(T(mk+k2+3k+mV))O(T(mk + k^2 +3k + mV))O(T(mk+k2+3k+mV))
Compuations in decoder
si=σ(Uxi+Wsi−1+b)s_i = \sigma(Ux_i+Ws_{i-1} +b)si​=σ(Uxi​+Wsi−1​+b)
Here,
Multiplication = O(km+k2)O(km +k^2)O(km+k2)
Additions = O(2k)O(2k)O(2k)
Total = O(km+k2+2k)O(km + k^2 + 2k)O(km+k2+2k), this is same as encoder
EIi=O(mV)EI_i = O(mV)EIi​=O(mV)
yi=O(Vk+V)y_i = O(Vk + V)yi​=O(Vk+V)
Softmax=O(V)Softmax = O(V)Softmax=O(V)
Total number of computations for decoder per step = O(mk+k2+mV+Vk+2V+2k)O(mk+k^2+mV+Vk+2V+2k)O(mk+k2+mV+Vk+2V+2k)
Hence, total number of computations for decoder for T steps=
O(T(mk+k2+mV+Vk+2V+2k)O(T(mk+k^2+mV+Vk+2V+2k)O(T(mk+k2+mV+Vk+2V+2k)
Total computations= O(T(2mk +2mV+2k^2+ Vk+2V+5k))
b) Total number of parameters.
Mentioning the matrix and vector dimesnion for encoders and decoder-
Encoders:
E1⇒(m,V)E_1 \Rightarrow (m,V)E1​⇒(m,V) Embedding matrix
U⇒(k,m)U \Rightarrow (k,m)U⇒(k,m)
W⇒(k,k)W \Rightarrow (k,k)W⇒(k,k)
b⇒(k,1)b \Rightarrow (k,1)b⇒(k,1)
V′⇒(k,k)EncoderV' \Rightarrow (k,k) EncoderV′⇒(k,k)Encoder
c⇒(c,1)Encoderc \Rightarrow (c,1) Encoderc⇒(c,1)Encoder
Total parameters for encoder= O(mV+km+2k2+2k)O(mV + km + 2k^2 + 2k)O(mV+km+2k2+2k)
Decoder:
E2⇒(m,V)E_2 \Rightarrow (m,V)E2​⇒(m,V) 
U⇒(k,m)U \Rightarrow (k,m)U⇒(k,m)
W⇒(k,k)W \Rightarrow (k,k)W⇒(k,k)
b⇒(k,1)b \Rightarrow (k,1)b⇒(k,1)
V′⇒(V,k)DecoderV' \Rightarrow (V,k) DecoderV′⇒(V,k)Decoder
c⇒(V,1)Decoderc \Rightarrow (V,1) Decoderc⇒(V,1)Decoder
Total parameters for decoder= O(mV+km+k2+Vk+V+k)O(mV + km + k^2 + Vk +V +k)O(mV+km+k2+Vk+V+k)
Total number of parameters: O(2mV+2km+3k2+3k+Vk+V)O(2mV + 2km + 3k^2 + 3k + Vk +V)O(2mV+2km+3k2+3k+Vk+V)  
﻿
﻿
Question-2The hyperparameter configuration explored was:
Embedding dimension: 646464, 128128128, 256256256, 512512512
Number of encoder & decoder layers: 111, 222, 333, 444 
Hidden layer size: 323232, 646464, 256256256, 512512512
Cell type: RNN, GRU, LSTM
Dropout: 000, 0.10.10.1, 0.270.270.27, 0.40.40.4 
Learning rate of optimizer: 0.0010.0010.001, 0.00010.00010.0001, 0.000010.000010.00001
Epochs: 101010, 151515, 202020
Batch Size: 161616, 323232, 646464 
Teacher forcing ratio of 0.5 was used during training.
For hyperparameter sweep, I used Bayesian sweep with early stopping, in this we can specify a maximum number of iterations. The algorithm will then run a predefined number of trials, each with a randomly generated set of hyperparameters. As the algorithm runs, it updates the probabilistic model of the objective function, using the results of each trial to improve its estimates of the hyperparameters that are likely to perform well. The early stopping feature in this approach allows us to stop the hyperparameter search early if the set hyperparameters doesn't lead to good results.
﻿
﻿
﻿
Sweep: Without Attention 137
2nd Sweep16
3rd sweep2
4th sweep10
5th sweep10
﻿
Accuracy v/s created plot
﻿
﻿
﻿
Sweep: Without Attention 137
Run set 316
Run set 42
Run set 510
Run set 510
﻿
Parallel co-ordinates plot
﻿
﻿
﻿
Sweep: Without Attention 137
Sweep: Without Attention 216
Run set 32
Run set 410
Run set 510
﻿
Correlation summary table 
﻿
﻿
﻿
Sweep: Without Attention 137
Sweep: Without Attention 216
Run set 310
Run set 42
Run set 510
﻿
Question-3Insightful Observations, in context to the above plots: 
LSTM and GRU cell perform better than RNN layers in terms of accuracy and loss. LSTM performed the best in this scenario.
RNN takes a lot of time for training compared to GRU and LSTM.
Hidden layer size shows a positive correlation with accuracy, more number of layer shows higher accuracy.
A dropout value of 0.27 seems most appropriate in this case, value of 0 leads to overfitting. The same value of dropout for the normal and recurrent dropout was used.
Higher input embedding size helps in increasing performance by allowing to learn the input with greater contrast. Values like 128, 256 are optimum embedding size. A positive correlation of embedding size is seen with the accuracy.
Learning rate of 0.001, for the optimizer is the most optimum here. When learning rate is set to 0.00001, the model does not perform well.
When there are 2-3 encoder, decoder layer in the model, then the model performs the best. A slight negative correlation is seen between number of encoder-decoder layer and accuracy.
15-20 epochs are sufficient for model to train, otherwise it leads to overfitting. 15 epochs is the best.
﻿
Question-4
(a) The best model from the sweep and the accuracy on the test set.
The hyperparameter configuration:
Hidden layer size: 512512512
Number of encoder & decoder: 222
Cell type: LSTMLSTMLSTM
Learning rate:0.00010.00010.0001
Batch Size-646464
Epochs: 252525
Dropout:0.270.270.27
Embedding Dimension: 256256256
The exact string match accuracy obtained was 35.57%
﻿
﻿
﻿
﻿
﻿
﻿
(b) Sample inputs from the test data and predictions made by the best model have been uploaded. All the predictions on the test set have been uploaded as well on github page.
﻿
﻿
﻿
﻿
(c) Errors made by the model
The model can't distinguish between थरमै‍क्‍स & थर्मेेक्स, The model things the character is a matra, it seems to make this mistake in other words as well.
The model seems to make more mistakes where there are more matrae.
The model makes more mistakes on longer sequences since it's difficult to catch longer term trends comparatively
Different sounds in hindi get mapped to same letters in english because of the smaller alphabet size in english. Model gets confused in predicting these characters differently.
"aa" and "a" sounds are often transliterated in same way. The model also gets confused in predicting in these cases.
Since one vowel in English can have multiple sounds in Hindi, the model is more prone to errors on vowels rather than consonants
The model is fairly accurate on the beginning parts of the word and as it gets to the end, the performs drops 
﻿
﻿
Question-5Add an attention network to your base sequence-to-sequence model and train the model again.
The model has 1 encoder and 1 decoder layer
(a) Hyperparameters were tuned again.
The hyperparameter configuration explored was:
Embedding dimension: 646464, 128128128, 256256256, 512512512
Hidden layer size: 323232, 646464, 256256256, 512512512
Cell type: RNN, GRU, LSTM
Dropout: 000, 0.10.10.1, 0.270.270.27, 0.40.40.4 
Learning rate of optimizer: 0.0010.0010.001, 0.00010.00010.0001, 0.000010.000010.00001
Epochs: 101010, 151515, 202020
Batch Size: 161616, 323232, 646464 
﻿
﻿
﻿
Sweep: Bayesian Sweep(Attention-2) 124
Attention-211
﻿
﻿
﻿
Sweep: Bayesian Sweep(Attention-2) 124
Sweep: Bayesian Sweep(Attention-1) 211
﻿
﻿
﻿
Sweep: Bayesian Sweep(Attention-2) 124
Sweep: Bayesian Sweep(Attention-1) 211
﻿
﻿
﻿
Sweep: Bayesian Sweep(Attention-2) 124
Sweep: Bayesian Sweep(Attention-1) 211
﻿
(b) The hyperparameters for the best model:
Hidden layer size: 512512512
Cell type: LSTMLSTMLSTM
Learning rate:0.00010.00010.0001
Embedding Dimension: 128128128
Batch Size-646464
Epochs: 252525
Dropout:0.10.10.1
The exact string match accuracy obtained was 40.72%
﻿
﻿
﻿
﻿
(c) Does the attention-based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences 
Yes, the attention model performs better than the vanilla model. 
The model is now able to differentiate between 'ra' matra and character as seen in hstntrneey and it also able to differentiate between 'na' and 'ana' character as well, which was not done in the model without attention
The model now pays attention to the relevant input words at every timestep rather than looking at the input in one go from the encoder which helps in overcoming the remembrance problem.
In general, attention models is now able to predict matrae and vowels better than the simple model. The model still has some problems in differentiation between "badi" and "chote" matrae as seen in "chutile" and "bichbchv".
Higher dropout is not preferred with attention-based models.
The attention based still has difficulty in predicting hindi consonants as seen in "ftim", "fode" etc.
﻿
﻿
﻿
﻿
(d) In a 3×33 \times 33×3 grid, paste the attention heatmaps for 101010 inputs from your test data (read up on what attention heatmaps are).
﻿
Question 7 (10 Marks)GitHub Link: https://github.com/shashwat-3004/CS6910_Assignment-3
Self-DeclarationI, Shashwat Patel (Roll no: MM19B053), swear on my honour that I have written the code and the report by myself.
Some references which I used:
https://nbviewer.org/github/ethen8181/machine-learning/blob/master/deep_learning/seq2seq/1_torch_seq2seq_intro.ipynb , Took idea on how to code encoder-decoder in pytorch
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html , how to employ attention mechanism in seq2seq model
﻿
﻿
Add a comment