929 MoE

Created on April 9|Last edited on May 19
Comment
﻿
MoE gains (OLMoE replication) [05-05]See other report.
Shared Experts [04-29]
Training setupTPU: v5e-256
Data: 42B token
Models:
Mixtral 8x8b (1 shared expert, 1 routed expert)
Mixtral 8x8b (2 routed expert)
ResultsTraining loss (minus lbl) similar to before
Much better MFU (6% increase)
﻿
﻿
Load balancing [04-15]
Training setupTPU: v5e-256
Data: 4B & 42B token
Models:
Mixtral 8x8b with load balancing loss
Mixtral 8x8b without load balancing loss
ResultsThe blue one (with LBL) achieves almost balanced load across the layers, except for the first layer. Overall, the blue run achieves a high routing entropy comparing to the red run.
We can see that the blue training loss is lower than the red training loss, indicating the LBL helps the model. We can also observe that LBL converges at 30% of the training run.
﻿
﻿
This difference is less obvious in longer runs (42B tokens)
note that the actual training loss (perplexity) of the purple run should subtract the load balancing loss, which is ~0.16. After the subtraction, the purple run should be ~0.01 better than the green run.
The entropies of the layers' routing network shows that the green run is imbalanced.
﻿
﻿
﻿
Initial experiment (PoC) [04-10]
Training setupTPU: v5e-256
Data: 42B token
Models:
Mixtral 8x8b (2 activated experts) -> # activated parameters = ~13b
Llama 13b (dense)
ResultsWe can see that the 2 models have approximately the same flops (mixtral has ~6% more)
With the same amount of steps, mixtral is better.
With the same amount of flops, mixtral is better.
With the same amount of time, llama is better. This is because the MoE implementation has a much worse MFU (sparse MFU is 33% worse than dense model)
﻿
﻿
Issues﻿
Add a comment