ANLP Assignment-1

Description of hyperparameter tuning and final results.
Created on September 11|Last edited on September 11
Comment
﻿
General plot of Validation losses and perplexities﻿
Val Loss
Val Loss
Showing first 10 runs
02468Step5.25.45.65.866.2
lstm
transformer
transformer
lstm
lstm
transformer
transformer
lstm
nnlm
nnlm
Val perplexity
Val perplexity
Showing first 10 runs
02468Step200250300350400450
lstm
transformer
transformer
lstm
lstm
transformer
transformer
lstm
nnlm
nnlm
﻿
NNLM:﻿
Run set11
﻿
Here, AdamW is best optimizer with lr=0.001. We can also see it is only marginally better than Adam. 
SGD takes too much time to converge even with high lr. 
﻿
Perplexity at batch_size=1﻿
﻿
We see than NNLM has huge perplexity at test time if batch size=1. The reason for this is there exists a sentence (probably very short) for which the loss is very high (Max loss per batch). In our dataset it is ~19. 
Which means its perplexity is e19=1.68×108e^{19} = 1.68 \times 10^{8}e19=1.68×108﻿. Keep is mind that our dataset is 3k words (for test set). On average this sample alone contributes1.68×108÷3000=0.56×105=56,0001.68 \times 10^8 \div 3000 = 0.56 \times 10^5 = 56, 0001.68×108÷3000=0.56×105=56,000﻿. So despite good average loss, the average perplexity per sentence is mostly dependent on largest perplexity coming from few wrong predictions.
Bad perplexity in this case has no relation to bad loss (brown point). 
LSTM:﻿
﻿
Adam with lr = 0.0025 is best model hyperparameter  here.
﻿
Peplexity with batch size = 1﻿
﻿
Here also, correlation of perplexity is more with max_loss than average loss. Max loss is 121212﻿, here contributing 2.7112=156,0002.71^{12} = 156, 0002.7112=156,000﻿.
If we see effect on average of exponent (perplexity), 156000/3000=52156000 / 3000 = 52156000/3000=52﻿. So out of ~275 of perplexity 52 comes from single example. 
Transformers:﻿
﻿
Best hyperparameters are AdamW at lr = 0.1 and 5 epochs
Perplexity at batch size =1 ﻿
﻿
Interestingly, there minimal effect (but still noticable) on perplexity. Here contribution is ~16 to perplexity. 
﻿
Final comparison of perplexity:Best of NNLM:  224,   Peplexity for batch_size=1: 70706   Biggest contribution to perplexity (1 sentence):   56000
Best of LSTM:   157,    Peplexity for batch_size=1:  276      Biggest contribution to perplexity (1 sentence):   52
Best of Trans:   162,    Peplexity for batch_size=1:   227     Biggest contribution to perplexity (1 sentence):   23
﻿
Everything here is trained on context length = 5. 
﻿
Add a comment