CS6910 - Assignment 3

Using recurrent neural networks to build a transliteration system.
Created on May 17|Last edited on May 21
Comment
CS20B001 - Abdullah Mohammed
Problem StatementIn this assignment, we experiment with a sample of the Aksharantar dataset released by AI4Bharat. This dataset contains pairs of the following form: 
xxx,yyy
ajanabee,अजनबी
i.e., a word in the native script and its corresponding transliteration in the Latin script (how we type while chatting with our friends on WhatsApp etc). Given many such (xi,yi)i=1n(x_i, y_i)_{i=1}^n(xi​,yi​)i=1n​ pairs our goal is to train a model y=f^(x)y = \hat{f}(x)y=f^​(x) which takes as input a romanized string (ghar) and produces the corresponding word in Devanagari (घर). 
This is the problem of mapping a sequence of characters in one language to a sequence of characters in another. Notice that this is a scaled-down version of the problem of translation where the goal is to translate a sequence of words in one language to a sequence of words in another language (as opposed to a sequence of characters here).
Question 1 (15 Marks)An RNN-based seq2seq model which contains the following layers: (i) input layer for character embeddings (ii) one encoder RNN which sequentially encodes the input character sequence (Latin) (iii) one decoder RNN which takes the last state of the encoder as input and produces one output character at a time (Devanagari) has been built.
The code is flexible such that the dimension of the input character embeddings, the hidden states of the encoders and decoders, the cell (RNN, LSTM, GRU), and the number of layers in the encoder and decoder can be changed.
(a) What is the total number of computations done by the network? (assume that the input embedding size is mmm, the encoder and decoder have 111 layer each, the hidden cell state is kkk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TTT, the size of the vocabulary is the same for the source and target language, i.e., VVV)
Let the equations corresponding to the computations performed by the network be as follows:
Encoder:
xenc,i=Lenceenc,ix_{enc, i}=L_{enc}e_{enc, i}xenc,i​=Lenc​eenc,i​
senc,i=σ(Uencxenc,i+Wencsenc,i−1+benc)s_{enc, i}=\sigma(U_{enc}x_{enc, i}+W_{enc}s_{enc, i-1}+b_{enc})senc,i​=σ(Uenc​xenc,i​+Wenc​senc,i−1​+benc​)
Decoder:
xdec,i=Ldecedec,ix_{dec, i}=L_{dec}e_{dec, i}xdec,i​=Ldec​edec,i​
sdec,i=σ(Udecxdec,i+Wdecsdec,i−1+bdec)s_{dec, i}=\sigma(U_{dec}x_{dec, i}+W_{dec}s_{dec, i-1}+b_{dec})sdec,i​=σ(Udec​xdec,i​+Wdec​sdec,i−1​+bdec​)
ydec,i=y_{dec, i}=ydec,i​= softmax(Vdecsdec,i+cdec)(V_{dec}s_{dec, i}+c_{dec})(Vdec​sdec,i​+cdec​)
In the above equations, eenc,ie_{enc, i}eenc,i​ and edec,ie_{dec, i}edec,i​ refer to the one-hot encoding of the character in the source language and the target language respectively. LLL refers to the lookup table employed by nn.Embedding to convert these one-hot encodings into the required embeddings. We also note that i∈{1,2,...,T}i\in\{1, 2, ..., T\}i∈{1,2,...,T} for the encoder as well as the decoder, with senc,0s_{enc, 0}senc,0​ 
The sizes of the tensors used in the above computations are as follows:
eenc,i,edec,ie_{enc, i}, e_{dec, i}eenc,i​,edec,i​: V×1V \times 1V×1
Lenc,LdecL_{enc}, L_{dec}Lenc​,Ldec​: m×Vm \times Vm×V
xenc,i,xdec,ix_{enc, i}, x_{dec, i}xenc,i​,xdec,i​: m×1m \times 1m×1
Uenc,UdecU_{enc}, U_{dec}Uenc​,Udec​: k×mk \times mk×m
senc,i,sdec,is_{enc, i}, s_{dec, i}senc,i​,sdec,i​: k×1k \times 1k×1
Wenc,WdecW_{enc}, W_{dec}Wenc​,Wdec​: k×kk \times kk×k
benc,bdecb_{enc}, b_{dec}benc​,bdec​: k×1k \times 1k×1
VdecV_{dec}Vdec​: V×kV \times kV×k
cdecc_{dec}cdec​: V×1V \times 1V×1
ydec,iy_{dec, i}ydec,i​: V×1V \times 1V×1
We will assume that the number of computations required to multiply two matrices of size p×qp \times qp×q and q×rq \times rq×r is O(pqr)O(pqr)O(pqr).
Now, for one time step in the encoder, we have:
xenc,i=Lenceenc,i  ⟹  O(mV)x_{enc, i}=L_{enc}e_{enc, i} \implies O(mV)xenc,i​=Lenc​eenc,i​⟹O(mV) computations for matrix multiplication
senc,i=σ(Uencxenc,i+Wencsenc,i−1+benc)  ⟹  O(mk+k2)s_{enc, i}=\sigma(U_{enc}x_{enc, i}+W_{enc}s_{enc, i-1}+b_{enc})\implies O(mk+k^2)senc,i​=σ(Uenc​xenc,i​+Wenc​senc,i−1​+benc​)⟹O(mk+k2) computations for matrix multiplications, O(2k)O(2k)O(2k) computations for additions and O(k)O(k)O(k) computations for the sigmoid function (assuming that calculating the sigmoid of a real number counts as a single computation)
Hence, a total of O(mV+mk+k2+3k)O(mV+mk+k^2+3k)O(mV+mk+k2+3k) computations are performed by the encoder during each time step.
Similarly, for one time step in the decoder, we have:
xdec,i=Ldecedec,i  ⟹  O(mV)x_{dec, i}=L_{dec}e_{dec, i} \implies O(mV)xdec,i​=Ldec​edec,i​⟹O(mV) computations for matrix multiplication
sdec,i=σ(Udecxdec,i+Wdecsdec,i−1+bdec)  ⟹  O(mk+k2)s_{dec, i}=\sigma(U_{dec}x_{dec, i}+W_{dec}s_{dec, i-1}+b_{dec})\implies O(mk+k^2)sdec,i​=σ(Udec​xdec,i​+Wdec​sdec,i−1​+bdec​)⟹O(mk+k2) computations for matrix multiplications, O(2k)O(2k)O(2k) computations for additions and O(k)O(k)O(k) computations for the sigmoid function (assuming that calculating the sigmoid of a real number counts as a single computation)
ydec,i=y_{dec, i}=ydec,i​= softmax(Vdecsdec,i+cdec)  ⟹  O(Vk)(V_{dec}s_{dec, i}+c_{dec})\implies O(Vk)(Vdec​sdec,i​+cdec​)⟹O(Vk) computations for matrix multiplication, O(V)O(V)O(V) computations for addition and O(V)O(V)O(V) computations for the softmax function (assuming that calculating the softmax of an element in an array counts as a single computation)
Hence, a total of O(mV+mk+k2+2k+Vk+2V)O(mV+mk+k^2+2k+Vk+2V)O(mV+mk+k2+2k+Vk+2V) computations are performed by the decoder during each time step.
We observe that the total number of computations performed by the network is simply the number of computations performed by the encoder and decoder over TTT time steps.
Hence, the total number of computations performed by the network is O(T×(mV+mk+k2+3k)+T×(mV+mk+k2+2k+Vk+2V))=O(T×(2mV+2mk+2k2+5k+Vk+2V))O(T \times (mV+mk+k^2+3k)+T \times (mV+mk+k^2+2k+Vk+2V))=O(T \times (2mV+2mk+2k^2+5k+Vk+2V))O(T×(mV+mk+k2+3k)+T×(mV+mk+k2+2k+Vk+2V))=O(T×(2mV+2mk+2k2+5k+Vk+2V)).
(b) What is the total number of parameters in the network? (assume that the input embedding size is mmm, the encoder and decoder have 111 layer each, the hidden cell state is kkk for both the encoder and decoder, and the length of the input and output sequence is the same, i.e., TTT, the size of the vocabulary is the same for the source and target language, i.e., VVV)
Using the same notation in part (a), we observe that the parameter matrices in the network are LencL_{enc}Lenc​, UencU_{enc}Uenc​, WencW_{enc}Wenc​, bencb_{enc}benc​, LdecL_{dec}Ldec​, UdecU_{dec}Udec​, WdecW_{dec}Wdec​, bdecb_{dec}bdec​, VdecV_{dec}Vdec​ and cdecc_{dec}cdec​. The number of parameters is simply the number of elements in these matrices, which, according to part (a), is: mV+km+k2+k+mV+km+k2+k+Vk+V=2(mV+km+k2+k)+Vk+VmV+km+k^2+k+mV+km+k^2+k+Vk+V=2(mV+km+k^2+k)+Vk+VmV+km+k2+k+mV+km+k2+k+Vk+V=2(mV+km+k2+k)+Vk+V.
Question 2 (10 Marks)The model has been trained using the Manipuri language from the Aksharantar dataset. The standard train, valid and  test sets from the folder aksharantar_sampled/mni have been used. 
Using the sweep feature in wandb, the following values for the hyperparameters were explored:
number of epochs: 555, 101010
input embedding size: 646464, 128128128, 256256256
number of layers: 111, 222, 333 
hidden layer size: 646464, 128128128, 256256256
cell type: RNN, GRU, LSTM
dropout: 000, 0.20.20.2, 0.30.30.3 
optimizer: SGD, Adam
learning rate: 5e-4, 1e-3, 5e-3, 1e-2
﻿
﻿
﻿
Sweep: bayes-sweep 156
﻿
﻿
The following ideas were employed to reduce the number of runs and still obtain a satisfactory accuracy:
The bayes search strategy provided by wandb was utilized to focus on hyperparameter choices that were more likely to improve upon previous choices. By ignoring the regions of space expected to yield poor results and instead focusing on the promising regions, it was possible to reach near-optimal solutions in far lesser iterations.
Furthermore, initial sweeps contained all the hyperparameter values listed above. This was done to obtain a rudimentary understanding of hyperparameter combinations that work well.
Subsequent sweeps were performed on a reduced hyperparameter space consisting of those combinations expected to work well based on observations from the initial sweeps. For example, some of the subsequent sweeps were restricted to LSTM cells with a low learning rate and large sizes for the input embedding and hidden layers.
Some later sweeps also incorporated a focus on those promising hyperparameter combinations that were unexplored by the previous sweeps due to similar explored combinations that had a few extremely unfavourable hyperparameters, since these assigned a negative correlation to all hyperparameter values in the combination.
Question 3 (15 Marks)Some insightful observations based on the sweeps used to generate the above plots are as follows:
LSTM cells outperform RNN and GRU cells by a significant margin and hence constitute a majority (in fact, all of the top 10) of the runs with high validation accuracy.
RNN cells in particular perform terribly in comparison to the other cell types owing to the lack of memorization ability.
GRU cells manage to obtain satisfactory performance albeit at the cost of a slightly higher number of hidden layers, which LSTM cells do not necessarily need to generate good results. 
In general, adding hidden layers leads to an improvement in prediction quality (assuming other hyperparameter values are fixed), thereby indicating the possibility of underfitting in case of lesser hidden layers. The same result is observed for the size of these hidden layers as well. This can be observed from the fact that all the top-performing models have 3 hidden layers of size 128 or 256.
Models provided more epochs to train for yield better results. Again, the overwhelming number of top-performing models with 10 epochs serve as evidence for this.
Adding dropout does not make much of a positive impact, if at all, again hinting at possible underfitting in the vanilla models. The top models have a healthy mix of all values provided for dropout.
A low learning rate is required to ensure stable learning. Models with higher learning rates often end up with vanishing and exploding gradients and near-zero accuracy.
The Adam optimizer is prevalent in the models yielding high accuracy (again, in all of the top 10), thus indicating that it outperforms the SGD optimizer.
Question 4 (10 Marks)(a) The best model from the sweep was employed on the test set. The hyperparameter values for this model were as follows:
Cell type: LSTM
Input embedding size: 256
Number of hidden layers: 3
Hidden layer size: 256
Dropout: 0
Optimizer: Adam
Learning rate: 5e-4
When trained for 10 epochs, the accuracy obtained on the test set was 43.85%. 
(b) The following image depicts sample inputs from the test data and predictions made by the best model. 
﻿
Sample predictions generated by the best model, and the associated Levenshtein distances
﻿
All the predictions on the test set have been uploaded to the folder predictions_vanilla on the GitHub project.
(c) Some of the significant errors made by the model are as follows:
The model finds it difficult to predict transliterations of words containing character(s) which are pronounced differently from their usual pronunciation due to certain adjacent characters. Some examples of this from the test set include 'ng' in 'reading' and 'ave' in 'wave'. A solution for this would be to appropriately increase the number of training pairs with such sets of characters. For example, the model predicts the transliteration of 'ch' quite well due to sufficient training data.
For such sets of characters, instead of using a character from the Meitei script that convey the combined pronunciation, the model sometimes wrongly transliterates the sequence of characters one by one.
For English words where simple vowels are understood by the surrounding consonants, the model fails to add these vowels in the output due to lack of explicit presence in the input word.
Since only the final state of the encoder is passed to the decoder, the order of letters seems to be jumbled for some predictions since this method of encoding and subsequent decoding may fail to keep track of such order in longer words.
Question 5 (20 Marks)Add an attention network to your base sequence-to-sequence model and train the model again. For the sake of simplicity, you can use a single-layered encoder and a single-layered decoder (if you want, you can use multiple layers also). Please answer the following questions:
(a) The hyperparameters were tuned yet again and the results have been depicted below.
﻿
﻿
﻿
Sweep: bayes-sweep 134
﻿
Some interesting observations from the plots for the attention-based models are as follows:
Models with RNN cells obtain a huge boost in performance with the addition of attention. This is evident from the presence of such models among the ones with the best performance.
Unlike the vanilla models, adding too many hidden layers degrades the performance, implying the possibility of overfitting.
Dropout also starts playing an important role, and models with dropout perform significantly better than those without, again hinting at overfitting in the case of lack of dropout.
Overall, the performance of the models improves with the addition of attention.
(b) The best model has been evaluated on the test set and the accuracy is 48.56%. The predictions on the test set have been uploaded to the folder predictions_attention on the GitHub project.
(c) The attention-based model does indeed perform better than the vanilla model. Some of the errors that the attention model has corrected are as follows:
The problem arising in words where certain sets of characters have a different pronunciation (in comparison to their usual pronunciation) by virtue of being placed adjacent to each other is handled much better by the attention-based model. By being able to focus on multiple time steps of the encoder while decoding, and being able to learn the best time steps to do so on as well, the attention-based model learns to combine these characters during transliteration and output the appropriate character from the Meitei script. 
The attention-base model is also able to correct the errors stemming from incorrect ordering of the output word by learning the order through attention.
Since data from the encoder can be passed to the decoder at multiple time steps using attention, the number of characters forgotten during transliteration is reduced. 
(d) Sample attention heatmaps from the test data are depicted below:
﻿
﻿
﻿
Run: pleasant-smoke-201
﻿
Question 7 (10 Marks)Link to the GitHub repository:
https://github.com/Abdullah-Mohammed0/cs6910_assignment3
Self-DeclarationI, Abdullah Mohammed (Roll no: CS20B001), swear on my honour that I have written the code and the report by myself and have not copied it from the internet or other students.
﻿
﻿
Add a comment