Skip to main content

Grid Search 1

Created on October 19|Last edited on October 30
We discussed the following grid for small models:
Updates on the small danish language models. After the above experiment is done the plan is to train a small-sized encoder to find the correct hyperparameters for larger models. This is the grid I propose we might want to search:
  • dataset distributions (nat, twitter, news, dagw):
    • [0.50. 0.20, 0.20. 0.10] (default)
    • [0.25. 0.25, 0.25. 0.25]
    • [0.70, 0.10, 0.10, 0.10]
  • adjust learning rate (BERT is the only one which uses 1e-4 so this seems low, we use 2e-5 which is def. too low)
    • 1e-4
    • 2e-4 (default)
    • 6e-4
  • adam_epsilon (Unspecified is 1e-8, everyone else use 1e-6, probably not a big difference):
    • 1e-6 (default)
  • gradient clipping (probably better w. 0.1)
    • 0.0
    • 0.1 (default)
  • warm up steps (only roberta uses 24k for small models):
    • 10k (default)
    • 24k
  • adam_beta2 (seems like most later paper uses 0.98 including roberta, electra, debertav3)
    • 0.99
    • 0.98 (default)
  • architectures (which we can currently train)
    • RoBERTa
    • debertav2
  • vocabulary size of tokenizer
    • 32k (default)
    • 50k
    • 128k
  • tokenizer type (here unigram is probably better)
    • BPE
    • unigram (default)
For these hyperparameters I reviewed (deberta v1-3, electra, roberta and bert)
potentially we could reduce the search space, by assuming (some) of the following:
tokenizer_type:unigram >= BPE
adam_beta2: 0.98 >= 0.99
gradient clipping: 0.1 >= 0.0
learning_rate: [2e-4, 6e-4] >= 1e-4 # especially since we will use a larger batch size than BERT
warm-up steps: 10k ~= 24k # approximately equal
Some other things we might try are:
  • Changing activation function (e.g. using geglu)
  • Similarly, try RMS Norm which also shows better performance for transformers (also used in gopher).
  • We could also examine alternative positional embeddings (we kinda do this with the debertav2)

Experiments



Here we experiment with 10-hour runs using 4 RTX 8000 GPUs:

Run set
21