Skip to main content

ANLP Assignment-1

Description of hyperparameter tuning and final results.
Created on September 11|Last edited on September 11

General plot of Validation losses and perplexities


Showing first 10 runs
02468Step5.25.45.65.866.2
Showing first 10 runs
02468Step200250300350400450


NNLM:


Run set
11

  • Here, AdamW is best optimizer with lr=0.001. We can also see it is only marginally better than Adam.
  • SGD takes too much time to converge even with high lr.


Perplexity at batch_size=1



  • We see than NNLM has huge perplexity at test time if batch size=1. The reason for this is there exists a sentence (probably very short) for which the loss is very high (Max loss per batch). In our dataset it is ~19.
  • Which means its perplexity is e19=1.68×108e^{19} = 1.68 \times 10^{8}. Keep is mind that our dataset is 3k words (for test set). On average this sample alone contributes1.68×108÷3000=0.56×105=56,0001.68 \times 10^8 \div 3000 = 0.56 \times 10^5 = 56, 000. So despite good average loss, the average perplexity per sentence is mostly dependent on largest perplexity coming from few wrong predictions.
  • Bad perplexity in this case has no relation to bad loss (brown point).

LSTM:



  • Adam with lr = 0.0025 is best model hyperparameter here.


Peplexity with batch size = 1



  • Here also, correlation of perplexity is more with max_loss than average loss. Max loss is 1212, here contributing 2.7112=156,0002.71^{12} = 156, 000.
  • If we see effect on average of exponent (perplexity), 156000/3000=52156000 / 3000 = 52. So out of ~275 of perplexity 52 comes from single example.

Transformers:



  • Best hyperparameters are AdamW at lr = 0.1 and 5 epochs

Perplexity at batch size =1



  • Interestingly, there minimal effect (but still noticable) on perplexity. Here contribution is ~16 to perplexity.


Final comparison of perplexity:

  • Best of NNLM: 224, Peplexity for batch_size=1: 70706 Biggest contribution to perplexity (1 sentence): 56000
  • Best of LSTM: 157, Peplexity for batch_size=1: 276 Biggest contribution to perplexity (1 sentence): 52
  • Best of Trans: 162, Peplexity for batch_size=1: 227 Biggest contribution to perplexity (1 sentence): 23

  • Everything here is trained on context length = 5.