Grid Search 1

Created on October 19|Last edited on October 30
Comment
﻿
We discussed the following grid for small models:
Updates on the small danish language models. After the above experiment is done the plan is to train a small-sized encoder to find the correct hyperparameters for larger models. This is the grid I propose we might want to search:

dataset distributions (nat, twitter, news, dagw):
[0.50. 0.20, 0.20. 0.10] (default)
[0.25. 0.25, 0.25. 0.25]
[0.70, 0.10, 0.10, 0.10]
adjust learning rate (BERT is the only one which uses 1e-4 so this seems low, we use 2e-5 which is def. too low)
1e-4
2e-4 (default)
6e-4
adam_epsilon (Unspecified is 1e-8, everyone else use 1e-6, probably not a big difference):
1e-6 (default)
gradient clipping (probably better w. 0.1)
0.0
0.1 (default)
warm up steps (only roberta uses 24k for small models):
10k (default)
24k
adam_beta2 (seems like most later paper uses 0.98 including roberta, electra, debertav3)
0.99
0.98 (default)
architectures (which we can currently train)
RoBERTa
debertav2
vocabulary size of tokenizer
32k (default)
50k
128k
tokenizer type (here unigram is probably better)
BPE
unigram (default)
For these hyperparameters I reviewed (deberta v1-3, electra, roberta and bert)
potentially we could reduce the search space, by assuming (some) of the following:
tokenizer_type:unigram >= BPE
adam_beta2: 0.98 >= 0.99
gradient clipping: 0.1 >= 0.0
learning_rate: [2e-4, 6e-4] >= 1e-4 # especially since we will use a larger batch size than BERT
warm-up steps: 10k ~= 24k # approximately equal
Some other things we might try are:
Changing activation function (e.g. using geglu)
Similarly, try RMS Norm which also shows better performance for transformers (also used in gopher).
We could also examine alternative positional embeddings (we kinda do this with the debertav2)
Experiments﻿
Here we experiment with 10-hour runs using 4 RTX 8000 GPUs:
﻿
Run set21
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment