ANLP Assignment-1
Description of hyperparameter tuning and final results.
Created on September 11|Last edited on September 11
Comment
General plot of Validation losses and perplexities
Showing first 10 runs
Showing first 10 runs
NNLM:
Run set
11
- Here, AdamW is best optimizer with lr=0.001. We can also see it is only marginally better than Adam.
- SGD takes too much time to converge even with high lr.
Perplexity at batch_size=1
- We see than NNLM has huge perplexity at test time if batch size=1. The reason for this is there exists a sentence (probably very short) for which the loss is very high (Max loss per batch). In our dataset it is ~19.
- Which means its perplexity is . Keep is mind that our dataset is 3k words (for test set). On average this sample alone contributes. So despite good average loss, the average perplexity per sentence is mostly dependent on largest perplexity coming from few wrong predictions.
- Bad perplexity in this case has no relation to bad loss (brown point).
LSTM:
- Adam with lr = 0.0025 is best model hyperparameter here.
Peplexity with batch size = 1
- Here also, correlation of perplexity is more with max_loss than average loss. Max loss is , here contributing .
- If we see effect on average of exponent (perplexity), . So out of ~275 of perplexity 52 comes from single example.
Transformers:
- Best hyperparameters are AdamW at lr = 0.1 and 5 epochs
Perplexity at batch size =1
- Interestingly, there minimal effect (but still noticable) on perplexity. Here contribution is ~16 to perplexity.
Final comparison of perplexity:
- Best of NNLM: 224, Peplexity for batch_size=1: 70706 Biggest contribution to perplexity (1 sentence): 56000
- Best of LSTM: 157, Peplexity for batch_size=1: 276 Biggest contribution to perplexity (1 sentence): 52
- Best of Trans: 162, Peplexity for batch_size=1: 227 Biggest contribution to perplexity (1 sentence): 23
- Everything here is trained on context length = 5.
Add a comment