CS6910 Assignment-3
Question - 1
For this question, we only consider the number of additions and multiplications as computations. Applying non-linearity to an element (scalar) is also considered as a single computation. Further, we assume batch size is 1 for all these computations. **
(a) What is the total number of computations done by your network? (assume that the input embedding size is m, encoder and decoder have 1 layer each, the hidden cell state is k for both the encoder and decoder, the length of the input and output sequence is the same, i.e., T, the size of the vocabulary is the same for the source and target language, i.e., V)
Solution\textbf{Solution}:
Pre-requisite: For multiplying a matrix of size m×nm \times n with a vector of size n×1n \times 1, we have mnmn point-wise multiplications and m(n−1)m(n-1) additions. ∴\therefore we have a total of m(2n−1)m(2n-1) computations.
- 
Encoder-Embedding: xt′=Wextx'_t = W_ex_t - dim of xt:V×1x_t: V \times 1
- dim of We:m×VW_e: m \times V
- dim of xt′:m×1x'_t: m \times 1
- Number of computations: m(2V−1)m (2V - 1)
 
- 
Encoder-RNN: ht+1=σ(Ueht+Vext′+be)h_{t+1} = \sigma(U_eh_{t} + V_ex'_t + b_e) and yt=O(Zeht+ce)y_t = \mathcal{O}(Z_e h_t + c_e) - dim of ht:k×1h_t: k \times 1 (same for ht+1h_{t+1})
- dim of xt′:m×1x'_t: m \times 1
- dim of Ue:k×kU_e: k \times k
- dim of Ve:k×mV_e: k \times m
- dim of be:k×1b_e: k \times 1
- dim of Ze:k×kZ_e: k \times k
- dim of ce:k×1c_e: k \times 1
- Number of computations for the former: k(2k−1)+k(2m−1)+2k+k=2k2+2km+kk(2k - 1) + k(2m - 1) + 2k + k = 2k^2 + 2km + k →\rightarrow [here, the final kk is for non-linear op]
- Number of computations for the latter: k(2k−1)+k+k=2k2+kk(2k - 1) + k + k = 2k^2 + k →\rightarrow [here also, the final kk is for non-linear op]
 
- 
Number of computations for complete Encoder (Embedding + RNN): 4k2+2km+2k4k^2 + 2km + 2k →\rightarrow [This is for only one timestep] 
- 
Number of computations for complete Encoder with TT timesteps: T(2mV−m+4k2+2km+2k)T (2mV - m + 4k^2 + 2km + 2k) 
- 
Since the decoder also has a similar architecture, it also performs same number of computations, and hence, the total number of computations by the whole model are 2T(2mV−m+4k2+2km+2k)2T (2mV - m + 4k^2 + 2km + 2k) 
(b) What is the total number of parameters in your network?  (assume that the input embedding size is m, encoder and decoder have 1 layer each, the hidden cell state is k for both the encoder and decoder and the length of the input and output sequence is the same, i.e., T, the size of the vocabulary is the same for the source and target language, i.e., V)
Solution\textbf{Solution}:
- 
Encoder-Embedding: xt′=Wextx'_t = W_ex_t - dim of xt:V×1x_t: V \times 1
- dim of We:m×VW_e: m \times V
- dim of xt′:m×1x'_t: m \times 1
- Number of parameters: mVmV
 
- 
Encoder-RNN: ht+1=σ(Ueht+Vext′+be)h_{t+1} = \sigma(U_eh_{t} + V_ex'_t + b_e) and yt=O(Zeht+ce)y_t = \mathcal{O}(Z_e h_t + c_e) - dim of ht:k×1h_t: k \times 1 (same for ht+1h_{t+1})
- dim of xt′:m×1x'_t: m \times 1
- dim of Ue:k×kU_e: k \times k
- dim of Ve:k×mV_e: k \times m
- dim of be:k×1b_e: k \times 1
- dim of Ze:k×kZ_e: k \times k
- dim of ce:k×1c_e: k \times 1
- Number of parameters: k2+km+k+k2+kk^2 + km + k + k^2 + k (parameters are UeU_e, VeV_e, beb_e, ZeZ_e and cec_e)
 
- 
Number of parameters complete Encoder (Embedding + RNN): mV+2k2+km+2kmV + 2k^2 + km + 2k 
- 
Since the decoder also has a similar architecture, it also has same number of parameters, and hence, the total number of parameters in the whole model are 2(mV+2k2+km+2k)2(mV + 2k^2 + km + 2k) 
Question - 2
- 
We used Hindi language for transliteration. The following hyperparameters and corresponding values were sweeped across to find the best set. We used Bayes\textbf{Bayes} search strategy to quickly converge to the hyperparameter set which gives better word-level validation accuracy. This strategy also helped us avoid combinations which give lower results. We also used EarlyStopping\texttt{EarlyStopping} to terminate runs if the performance continuously decreased over 4 epochs. This saved a lot of time while training the models. 
- 
Here, we used dropout at the linear transformation of the inputs (argument dropout\texttt{dropout} in the LSTM, GRU and RNN constructor corresponds to that). We did this to inhibit the model from using the complete input and overfitting on it, especially when we have stacked LSTMs, GRUs or RNNs. The model learns to not depend on some specific neurons that process the input and the whole architecture participates in the learning. 
sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'val_word_accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'lr': {
            'values': [1e-2, 1e-3, 1e-4]
        },
        'batch_size': {
            'values': [128, 256, 512]
        },
        'num_enc_recurrent_layers': {
            'values': [1, 2, 3]
        },
        'num_dec_recurrent_layers': {
            'values': [1, 2, 3]
        },
        'num_recurrent_units': {
            'values': [64, 128, 256],
        },
        'dropout': {
            'values': [0.1, 0.2, 0.3]
        },
        'recurrent_layer': {
            'values': ["rnn", "gru", "lstm"],
        },
        'weight_decay': {
            'values': [1e-4, 1e-3, 1e-2],
        },
        'enc_embed_dim': {
            'values': [64, 128, 256],
        },
        'dec_embed_dim': {
            'values': [64, 128, 256],
        },
        'optimizer': {
            'values': ["ranger", "adamw", "sgdw", "novograd"],
        }
    }
}
Question - 3
Solution\textbf{Solution}
For this assignment, we work with two new optimizers namely Ranger and NovoGrad which have shown great performance for NLP and Machine Translation tasks. We also use the decoupled weight-decay versions of Adam and SGD, i.e. AdamW and SGDW.
Observations\textbf{Observations}
- Some observation about the data: After the padding, almost 60-65% of the sequence had pad-tokens or 00s towards the end. Even if the model starts predicting everything as 00s in the output, we achieve a character-level accuracy of over 60%. Hence, we use word-level accuracy to gauge the models and as the metric in the Bayesian sweep as well.
- RNN takes much lesser time to run than LSTMs/GRUs. This is because RNNs have a very simple architecture and much fewer computations to be performed. But this also comes at the cost that RNNs do not perform well due to less number of parameters. The models with RNN layers did failed to predict anything, i.e. their word-level accuracy was 00. Their predictions over the entire timesteps was always pad-tokens. Infact LSTMs > GRUs > RNN in terms of performance.
- Higher learning rate helped achieve faster convergence, and the same is also seen in the correlation plot (lr has the highest import and high positive correlation). The amsgrad\texttt{amsgrad} argument in the optimizers was set to true, which enabled us to have stable learning even with higher learning rates without being stuck in local minima.
- The performance of the learning algorithms was as follows: Ranger > AdamW > Novograd > SGDW. SGDW was the slowest and even after 25 full epochs, the the character-level accuracy just improved by 3-4% and the word-level accuracy was always stuck at 00. Ranger on the other hand converged to almost 25% or higher word-level accuracy within 5 epochs. It also has a positive correlation and was the preferred optimizer in most cases. The sweeps never chose NovoGrad as it always juggled between AdamW and Ranger which give really good performance.
- Increasing the number of layers of LSTMs, GRUs and RNNs did aid the learning, especially increasing the decoder layers. This helped the model create more abstract representations of the sequences and it also added more parameters which could be tuned to get finer results. A single layer model was highly underfitting and was predicting the entire output as pad-tokens (word accuracy was almost 00).
- Higher embedding dimensions were preferred for the decoder because Hindi script is complex. We have single Devanagari characters for words which will require 2-3 letters of English. Hence, projecting them into higher dimensions might help the model get distinctive qualities between them. The encoder did not require high embedding dimensions and even smaller projections could result in better performance. Even a higher embedding dimension will be able to get good results, but it will involve unnecessary computations and parameters. This could be observe from the table as decoder embedding size as positive correlations and higher importance than encoder embedding size which as negative correlation and lower importance.
- Number of recurrent units (LSTM/GRU/RNN cells) also has positive correlations to the learning. Higher number of cells in each layer implies more parameters which can better fit the data. We have same number of cells in encoder and decoder as they must have the same cell_state dimensions.
- Higher batch sizes were preferred by the model and it also reduced the runtime. This may be because, larger batches produce better approximation of the gradients than mini-batches (say of size 16, 32, 64 etc.) and the training would be stabilized.
- Dropout has a positive correlation to the learning. With larger model which involve multiple encoder-decoder layers with more neurons, the model can very easily over fit the data, hence as the model size increases, we also increase the dropout. This allows the subsequent layers to not overfit on the output of the previous layers, but even learn when a part of that output is not available or drops out. It also causes some of the "dorment" neurons (which were not contributing to the learning) to become active and learn some redundant feature-representations.
- We also added decoupled weight decay to the optimizer. It had more importance than dropout. Decoupled weight decay is similar to L2 regularization but is independent of the learning rate and can constraint the model parameters more. Dropout and Weight decay were found to be complementary to each other, i.e. is a lower dropout is used, higher weight decay is better and vice-versa. If both are quite high, the model is constrained too much and underfits, and if both are very low, the model overfits.
Question - 4
Solution\textbf{Solution}
(a) METRICS:
- 
Train Word-Accuracy: 0.62630.6263 
- 
Validation Word-Accuracy: 0.40060.4006 
- 
Test Word-Accuracy: 0.38800.3880 
(b) SAMPLE PREDICTIONS: (similar prediction format is also seen when you the run the code)
Complete predictions on the test set are available in the predictions_vanilla folder


(c) SOME INSIGHTFUL COMMENTS:
- (Not an observation) In all the examples that follow, xx -- yy -- zz represents xx is the English word, yy is the target transliteration and zz is the predicted transliteration.
- The model confuses between ि and ी (as seen in agrist -- ग्रीस्ट -- ग्रिस्ट) and as there is no way to stress the matraa in English. Similarly, it fails to distinguish अ and आ matraa.
- It also mis-predits ड as द and त as ट (both are almost similar when written in English, i.e. da-da and ta-ta respectively)
- In some predictions, the model, instead of predicting a matraa, it predicts the character responsible for the matraa. For example, kide -- कीड़े -- काइड
- Longer predicted sequences are start correctly but more mistakes are made as we go along. Shorter ones get predicted more accurately.
- Half-consonants in Hindi are mis-predicted a lot as its representation is English does not convey the half-representation. The model predicts half-consonants even when not required as in gadadhar -- गदाधर -- गद्धार and gangrel -- गंगरेल -- गैंग्रेल. But in the latter, the model is very close to the prediction in terms of the pronunciation.
- The model is unaware that a matraa symbol can be used as it is. For example, in grenite -- ग्रेनाइट -- ग्रेनिट, the model predicts the ि matraa whereas the original इ symbol was needed. The model always prefers to predict matraas than these characters, but once again these predictions are close sounding.
- Vowels make more mistakes than consonents as these vowels correspond to matraas, like (a -- े, e -- ि/ी, i -- ै, o -- ो, u -- ु/ू) and there is always an ambiguity in those as seen above.
Question - 5
Solution\textbf{Solution}
We swept over the following hyperparameters for attention models. This time, we only worked with the Ranger and AdamW optimizers as they gave best performance with vanilla models. Once again we used the Bayes\textbf{Bayes} method to quickly converge to the best hyperparameter setting and also used EarlyStopping\texttt{EarlyStopping} to terminate runs if their performance consecutively worsens for 4 epochs.
sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'val_word_accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'lr': {
            'values': [1e-2, 1e-3, 1e-4]
        },
        'attn_dim': {
            'values': [128, 256, 512]
        },
        'batch_size': {
            'values': [128, 256, 512]
        },
        'num_recurrent_units': {
            'values': [64, 128, 256],
        },
        'dropout': {
            'values': [0.1, 0.2, 0.3]
        },
        'num_enc_recurrent_layers': {
            'values': [1, 2, 3]
        },
        'num_dec_recurrent_layers': {
            'values': [1, 2, 3]
        },
        'recurrent_layer': {
            'values': ["lstm", "gru", "rnn"],
        },
        'weight_decay': {
            'values': [1e-4, 1e-3, 1e-2],
        },
        'enc_embed_dim': {
            'values': [64, 128, 256],
        },
        'dec_embed_dim': {
            'values': [64, 128, 256],
        },
        'optimizer': {
            'values': ["ranger", "adamw"],
        }
    }
}
(a) METRICS WITH ATTENTION MECHANISM
- Train Word-Accuracy: 0.65000.6500
- Validation Word-Accuracy: 0.43250.4325
- Test Word-Accuracy: 0.41400.4140
(b) SOME INSIGHTFUL COMMENTS
- With attention, longer sequences are also predicted properly. This was not the case with vanilla models. For example, words like aparivartanshali -- अंचर्तिकारियों (without attention) -- अपरिवर्तशली (with attention) are predicted correctly.
- It also learns the difference between characters like (त, थ) and (ड, द) by attending to the 'h' sound after the 't' or 'd' respectively.
- It also learns to differentiate न from ं. Further, with attention, the model properly uses the ्. Without attention, this character was misused, and the model was never able to identify when to use it. It almost always printed the "full" character without the "halank".
- Without attention, the model uses आई for 'ai' (as in ail), but with attention, it corrects to the proper ऐ character.
- The most confused character for ई is ऐ without attention, but with attention this misprediction is eliminated. For example, neto -- नीटो (without attention) -- नेटो (with attention). Now, the 'ae'/'e' character properly maps based on the context.
- The model also starts to differentiate between the ु and ू matraa with attention which it was not able to earlier. For example, bhukhi -- भुखी (without attention) -- भूखी (with attention)
(c) HEAT-MAPS
- We observe that most of the time we have one-to-one attention (that's why diagonal is more prominant). But occasionally, we find one one Devanagiri character paying attention to more than one English character as in kuposhit, the final श attends to both 's' and 'h' in the input. Similarly, in gaye, य attends to both 'y' and 'e'.


Question - 6
Solution\textbf{Solution}
- Red indicates more importance to that character and blue means less importance. Here, we observe which English character from the input the model is looking at while predicting the current timestep Devanagiri character.
- Very similar to the attention heat-map, there is a one-to-one corresponding between the English characters and Devanagiri characters being predicted. But occasionally we observe that to predict one character in Hindi, we look at two English characters (we even focus on the matraa after it).
- The following five GIFs denote the connectivity as an animation for five different words (परवा, बरसा, बरात, प्रहलाद and दाव respectively).





Question - 7
Question - 8
Solution\textbf{Solution}
GENERATED SONG
I love deep learning,
I just can't get enough of it.
Deep Learning is a great way to learn and you know what that means? 
It's the best thing in life right now! 
Deeplearning has been around for quite some time
but we've never really had any real success with this stuff 
so if there was ever an opportunity like today 
where people could use our technology 
then let me tell ya how much better off they would be 
than when somebody else did their own research on us back before 9/11
SOME COMMENTS ABOUT GPT TRAINING
- We used the standard trainer driver program available here in HuggingFace as it helps us cleanly load the dataset and train on a TPU.
- We used all the song lyrics from the Kaggle dataset provided above, merged them in a single file for training except Rihanna's as it was used as validation set.
- The final perplexity of the model on the validation dataset was 8.5 after 3 epochs. It was observed that after 3 epochs, the model began overfitting and the validation perplexity increased.
- While prediction, we used a beam size of 5 and took the longest prediction among those as most of them were quite short even when sequence length of 128 was used. If a sequence length of 256 or higher was used, the model started producing meaningless text (syntactically correct, but semantically wrong) and somehow the aggression in the song increased it started producing non-parliamentary language (probably due to the rap dataset which has such words). For sequences of size more than 512, the final lines of the text produced are not even syntactically proper and have no correlation to the text generated before.
- Without Beam Search, the output produced is very unnatural, though it is semantically correct.
Self Declaration\textbf{Self Declaration}
- 
CS21M070CS21M070: Varun Gumma (100%(100\% contribution)contribution) - Computed number of parameters and computations for the model
- Coded Encoder, Decoder and Full Transliteration model without attention
- Plotted all required metrics and plots for vanilla model
- Wrote observations for the vanilla model and made inferences after running it on the test data
- Coded Encoder, AttentionDecoder and Full Transliteration model with attention
- Plotted all required metrics and plots for attention model
- Wrote observations for the attention model and made inferences after running it on the test data
- Generated heat-maps and connectivity visualizations
- Cleaned songs data and trained the GPT2 model
 
- 
CS21M022CS21M022: Hanumantappa Budihal (100%(100\% contribution)contribution) - Computed number of parameters and computations for the model
- Coded Encoder, Decoder and Full Transliteration model without attention
- Plotted all required metrics and plots for vanilla model
- Wrote observations for the vanilla model and made inferences after running it on the test data
- Coded Encoder, AttentionDecoder and Full Transliteration model with attention
- Plotted all required metrics and plots for attention model
- Wrote observations for the attention model and made inferences after running it on the test data
- Generated heat-maps and connectivity visualizations
- Cleaned songs data and trained the GPT2 model
 
We, Varun Gumma and Hanumantappa Budihal, swear on our honour that the above declaration is correct.