Grid Search 1
Created on October 19|Last edited on October 30
Comment
We discussed the following grid for small models:
Updates on the small danish language models. After the above experiment is done the plan is to train a small-sized encoder to find the correct hyperparameters for larger models. This is the grid I propose we might want to search:
dataset distributions (nat, twitter, news, dagw):
[0.50. 0.20, 0.20. 0.10] (default) [0.25. 0.25, 0.25. 0.25] [0.70, 0.10, 0.10, 0.10] adjust learning rate (BERT is the only one which uses 1e-4 so this seems low, we use 2e-5 which is def. too low)
1e-4 2e-4 (default) 6e-4 adam_epsilon (Unspecified is 1e-8, everyone else use 1e-6, probably not a big difference):
1e-6 (default) gradient clipping (probably better w. 0.1)
0.0 0.1 (default) warm up steps (only roberta uses 24k for small models):
10k (default) 24k adam_beta2 (seems like most later paper uses 0.98 including roberta, electra, debertav3)
0.99 0.98 (default) architectures (which we can currently train)
RoBERTa debertav2 vocabulary size of tokenizer
32k (default) 50k 128k tokenizer type (here unigram is probably better)
BPE unigram (default)For these hyperparameters I reviewed (deberta v1-3, electra, roberta and bert)potentially we could reduce the search space, by assuming (some) of the following:tokenizer_type:unigram >= BPEadam_beta2: 0.98 >= 0.99gradient clipping: 0.1 >= 0.0learning_rate: [2e-4, 6e-4] >= 1e-4 # especially since we will use a larger batch size than BERTwarm-up steps: 10k ~= 24k # approximately equal
Some other things we might try are:
Changing activation function (e.g. using geglu) Similarly, try RMS Norm which also shows better performance for transformers (also used in gopher). We could also examine alternative positional embeddings (we kinda do this with the debertav2)
Experiments
Here we experiment with 10-hour runs using 4 RTX 8000 GPUs:
Run set
21
Add a comment