CS6910:Assignment 3
Instructions
- The goal of this assignment is threefold: (i) learn how to model sequence-to-sequence learning problems using Recurrent Neural Networks (ii) compare different cells such as vanilla RNN, LSTM and GRU (iii) understand how attention networks overcome the limitations of vanilla seq2seq models
- Discussions with other students is encouraged.
- You must use
Pythonfor your implementation. - You can use any and all packages from
PyTorchorPyTorch-Lightning. NO OTHER DL library, such asTensorFloworKerasis allowed. - Please confirm with the TAs before using any new external library. BTW, you may want to explore PyTorch-Lightning as it includes
fp16mixed-precision training, wandb integration, and many other black boxes eliminating the need for boiler-plate code. Also, do look out for PyTorch2.0. - You can run the code in a jupyter notebook on Colab/Kaggle by enabling GPUs.
- You have to generate the report in the same format as shown below using
wandb.ai. You can start by cloning this report using the clone option above. Most of the plots we have asked for below can be (automatically) generated using the APIs provided bywandb.ai. You will upload a link to this report on Gradescope. - You must also provide a link to your GitHub code, as shown below. Follow good software engineering practices and set up a GitHub repo for the project on Day 1. Please do not write all code on your local machine and push everything to GitHub on the last day. The commits in GitHub should reflect how the code has evolved during the course of the assignment.
- You have to check Moodle regularly for updates regarding the assignment.
Problem Statement
In this assignment, you will experiment with a sample of the Aksharantar dataset released by AI4Bharat. This dataset contains pairs of the following form:
xx,yy
ajanabee,अजनबी
i.e., a word in the native script and its corresponding transliteration in the Latin script (how we type while chatting with our friends on WhatsApp etc). Given many such (xi,yi)i=1n(x_i, y_i)_{i=1}^n pairs your goal is to train a model y=f^(x)y = \hat{f}(x) which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर).
As you would realize, this is the problem of mapping a sequence of characters in one language to a sequence of characters in another. Notice that this is a scaled-down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to a sequence of characters here).
Read this blog to understand how to build neural sequence-to-sequence models.
Question-1
Build a RNN based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari).
The code should be flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU), and the number of layers in the encoder and decoder can be changed.
(a) What is the total number of computations done by your network? (assume that the input embedding size is mm, the encoder and decoder have 11 layer each, the hidden cell state is kk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TT, the size of the vocabulary is the same for the source and target language, i.e., VV)
(b) What is the total number of parameters in your network? (assume that the input embedding size is mm, the encoder and decoder have 11 layer each, the hidden cell state is kk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TT, the size of the vocabulary is the same for the source and target language, i.e., VV)
Answer to part(a)
Matrix & Vector dimensions:
Ii⇒(V,1)I_i \Rightarrow (V,1)
E⇒(m,V)E \Rightarrow (m,V) Embedding matrix
U⇒(k,m)U \Rightarrow (k,m)
W⇒(k,k)W \Rightarrow (k,k)
si−1⇒(k,1)s_{i-1} \Rightarrow (k,1)
b⇒(k,1)b \Rightarrow (k,1)
V′⇒(k,k)EncoderV' \Rightarrow (k,k) Encoder
V′⇒(V,k)DecoderV' \Rightarrow (V,k) Decoder
c⇒(c,1)Encoderc \Rightarrow (c,1) Encoder
c⇒(V,1)Decoderc \Rightarrow (V,1) Decoder
xi⇒(m,1)x_i \Rightarrow (m,1)
Formulae:
xi=EIix_i=EI_i
si=σ(Uxi+Wsi−1+b)s_i=\sigma(Ux_i+Ws_{i-1}+b)
yi=Softmax(V′si+c)y_i=Softmax(V's_i+c)
Computations done in encoder:
- si=σ(Uxi+Wsi−1+b)s_i=\sigma(Ux_i+Ws_{i-1}+b)
Here,
Multiplication = O(km+k2)O(km+k^2)
Addition = O(2k)O(2k)
Total = O(km+k2+2k)O(km+k^2+2k)
-
EIi=O(mV)EI_i = O(mV)
-
σ=O(k)\sigma =O(k)
Total computations for encoder per step = O(mk+k2+3k+mV)O(mk + k^2 + 3k + mV)
Total computations for encoder in T step = O(T(mk+k2+3k+mV))O(T(mk + k^2 +3k + mV))
Compuations in decoder
- si=σ(Uxi+Wsi−1+b)s_i = \sigma(Ux_i+Ws_{i-1} +b)
Here,
Multiplication = O(km+k2)O(km +k^2)
Additions = O(2k)O(2k)
Total = O(km+k2+2k)O(km + k^2 + 2k), this is same as encoder
- EIi=O(mV)EI_i = O(mV)
yi=O(Vk+V)y_i = O(Vk + V)
Softmax=O(V)Softmax = O(V)
Total number of computations for decoder per step = O(mk+k2+mV+Vk+2V+2k)O(mk+k^2+mV+Vk+2V+2k)
Hence, total number of computations for decoder for T steps=
O(T(mk+k2+mV+Vk+2V+2k)O(T(mk+k^2+mV+Vk+2V+2k)
Total computations= O(T(2mk +2mV+2k^2+ Vk+2V+5k))
b) Total number of parameters.
Mentioning the matrix and vector dimesnion for encoders and decoder-
Encoders:
E1⇒(m,V)E_1 \Rightarrow (m,V) Embedding matrix
U⇒(k,m)U \Rightarrow (k,m)
W⇒(k,k)W \Rightarrow (k,k)
b⇒(k,1)b \Rightarrow (k,1)
V′⇒(k,k)EncoderV' \Rightarrow (k,k) Encoder
c⇒(c,1)Encoderc \Rightarrow (c,1) Encoder
Total parameters for encoder= O(mV+km+2k2+2k)O(mV + km + 2k^2 + 2k)
Decoder:
E2⇒(m,V)E_2 \Rightarrow (m,V)
U⇒(k,m)U \Rightarrow (k,m)
W⇒(k,k)W \Rightarrow (k,k)
b⇒(k,1)b \Rightarrow (k,1)
V′⇒(V,k)DecoderV' \Rightarrow (V,k) Decoder
c⇒(V,1)Decoderc \Rightarrow (V,1) Decoder
Total parameters for decoder= O(mV+km+k2+Vk+V+k)O(mV + km + k^2 + Vk +V +k)
Total number of parameters: O(2mV+2km+3k2+3k+Vk+V)O(2mV + 2km + 3k^2 + 3k + Vk +V)
Question-2
The hyperparameter configuration explored was:
- Embedding dimension: 6464, 128128, 256256, 512512
- Number of encoder & decoder layers: 11, 22, 33, 44
- Hidden layer size: 3232, 6464, 256256, 512512
- Cell type: RNN, GRU, LSTM
- Dropout: 00, 0.10.1, 0.270.27, 0.40.4
- Learning rate of optimizer: 0.0010.001, 0.00010.0001, 0.000010.00001
- Epochs: 1010, 1515, 2020
- Batch Size: 1616, 3232, 6464
Teacher forcing ratio of 0.5 was used during training.
For hyperparameter sweep, I used Bayesian sweep with early stopping, in this we can specify a maximum number of iterations. The algorithm will then run a predefined number of trials, each with a randomly generated set of hyperparameters. As the algorithm runs, it updates the probabilistic model of the objective function, using the results of each trial to improve its estimates of the hyperparameters that are likely to perform well. The early stopping feature in this approach allows us to stop the hyperparameter search early if the set hyperparameters doesn't lead to good results.
- Accuracy v/s created plot
- Parallel co-ordinates plot
- Correlation summary table
Question-3
Insightful Observations, in context to the above plots:
-
LSTM and GRU cell perform better than RNN layers in terms of accuracy and loss. LSTM performed the best in this scenario.
-
RNN takes a lot of time for training compared to GRU and LSTM.
-
Hidden layer size shows a positive correlation with accuracy, more number of layer shows higher accuracy.
-
A dropout value of 0.27 seems most appropriate in this case, value of 0 leads to overfitting. The same value of dropout for the normal and recurrent dropout was used.
-
Higher input embedding size helps in increasing performance by allowing to learn the input with greater contrast. Values like 128, 256 are optimum embedding size. A positive correlation of embedding size is seen with the accuracy.
-
Learning rate of 0.001, for the optimizer is the most optimum here. When learning rate is set to 0.00001, the model does not perform well.
-
When there are 2-3 encoder, decoder layer in the model, then the model performs the best. A slight negative correlation is seen between number of encoder-decoder layer and accuracy.
-
15-20 epochs are sufficient for model to train, otherwise it leads to overfitting. 15 epochs is the best.
Question-4
(a) The best model from the sweep and the accuracy on the test set.
The hyperparameter configuration:
- Hidden layer size: 512512
- Number of encoder & decoder: 22
- Cell type: LSTMLSTM
- Learning rate:0.00010.0001
- Batch Size-6464
- Epochs: 2525
- Dropout:0.270.27
- Embedding Dimension: 256256
The exact string match accuracy obtained was 35.57%
(b) Sample inputs from the test data and predictions made by the best model have been uploaded. All the predictions on the test set have been uploaded as well on github page.
(c) Errors made by the model
-
The model can't distinguish between थरमैक्स & थर्मेेक्स, The model things the character is a matra, it seems to make this mistake in other words as well.
-
The model seems to make more mistakes where there are more matrae.
-
The model makes more mistakes on longer sequences since it's difficult to catch longer term trends comparatively
-
Different sounds in hindi get mapped to same letters in english because of the smaller alphabet size in english. Model gets confused in predicting these characters differently.
-
"aa" and "a" sounds are often transliterated in same way. The model also gets confused in predicting in these cases.
-
Since one vowel in English can have multiple sounds in Hindi, the model is more prone to errors on vowels rather than consonants
-
The model is fairly accurate on the beginning parts of the word and as it gets to the end, the performs drops
Question-5
Add an attention network to your base sequence-to-sequence model and train the model again. The model has 1 encoder and 1 decoder layer
(a) Hyperparameters were tuned again.
The hyperparameter configuration explored was:
- Embedding dimension: 6464, 128128, 256256, 512512
- Hidden layer size: 3232, 6464, 256256, 512512
- Cell type: RNN, GRU, LSTM
- Dropout: 00, 0.10.1, 0.270.27, 0.40.4
- Learning rate of optimizer: 0.0010.001, 0.00010.0001, 0.000010.00001
- Epochs: 1010, 1515, 2020
- Batch Size: 1616, 3232, 6464
(b) The hyperparameters for the best model:
- Hidden layer size: 512512
- Cell type: LSTMLSTM
- Learning rate:0.00010.0001
- Embedding Dimension: 128128
- Batch Size-6464
- Epochs: 2525
- Dropout:0.10.1
The exact string match accuracy obtained was 40.72%
(c) Does the attention-based model perform better than the vanilla model? If so, can you check some of the errors that this model corrected and note down your inferences
Yes, the attention model performs better than the vanilla model.
- The model is now able to differentiate between 'ra' matra and character as seen in hstntrneey and it also able to differentiate between 'na' and 'ana' character as well, which was not done in the model without attention
- The model now pays attention to the relevant input words at every timestep rather than looking at the input in one go from the encoder which helps in overcoming the remembrance problem.
- In general, attention models is now able to predict matrae and vowels better than the simple model. The model still has some problems in differentiation between "badi" and "chote" matrae as seen in "chutile" and "bichbchv".
- Higher dropout is not preferred with attention-based models.
- The attention based still has difficulty in predicting hindi consonants as seen in "ftim", "fode" etc.
(d) In a 3×33 \times 3 grid, paste the attention heatmaps for 1010 inputs from your test data (read up on what attention heatmaps are).
Question 7 (10 Marks)
GitHub Link: https://github.com/shashwat-3004/CS6910_Assignment-3
Self-Declaration
I, Shashwat Patel (Roll no: MM19B053), swear on my honour that I have written the code and the report by myself.
Some references which I used:
- https://nbviewer.org/github/ethen8181/machine-learning/blob/master/deep_learning/seq2seq/1_torch_seq2seq_intro.ipynb , Took idea on how to code encoder-decoder in pytorch
- https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html , how to employ attention mechanism in seq2seq model