Skip to main content
pszemraj
Projects
mega-tuning-longctx
Reports
MLM: whole word masking?
Log in
Sign up
Share
Comment
Star
MLM: whole word masking?
comparison of a handful of MLM runs using a 6-layer MEGA arch
Peter Szemraj
Created on January 16
|
Last edited on January 16
Comment
Charts
eval
eval/accuracy
eval/accuracy
0.5
1
1.5
2
train/epoch
0.4
mega-encoder-6-v0-simple_wikipedia_LM_1024-noWW-drop
mega-encoder-6-v0-MR0.40-1024ctx-noWW
mega-encoder-6-v0-MR0.40-1024ctx-noWW
mega-encoder-6-v0-MR0.40-1024ctx-vN
eval/loss
eval/loss
0.5
1
1.5
2
train/epoch
3.5
4
4.5
5
mega-encoder-6-v0-simple_wikipedia_LM_1024-noWW-drop
mega-encoder-6-v0-MR0.40-1024ctx-noWW
mega-encoder-6-v0-MR0.40-1024ctx-noWW
mega-encoder-6-v0-MR0.40-1024ctx-vN
eval/samples_per_second
eval/samples_per_second
0.5
1
1.5
2
train/epoch
30
40
50
60
70
80
mega-encoder-6-v0-simple_wikipedia_LM_1024-noWW-drop
mega-encoder-6-v0-MR0.40-1024ctx-noWW
mega-encoder-6-v0-MR0.40-1024ctx-noWW
mega-encoder-6-v0-MR0.40-1024ctx-vN
Run set
4
train
Run set
4
Run set
4
Add a comment