929 MoE
Created on April 9|Last edited on May 19
Comment
MoE gains (OLMoE replication) [05-05]
Shared Experts [04-29]
Training setup
- TPU: v5e-256
- Data: 42B token
- Models:
- Mixtral 8x8b (1 shared expert, 1 routed expert)
- Mixtral 8x8b (2 routed expert)
Results
- Training loss (minus lbl) similar to before
- Much better MFU (6% increase)
Load balancing [04-15]
Training setup
- TPU: v5e-256
- Data: 4B & 42B token
- Models:
- Mixtral 8x8b with load balancing loss
- Mixtral 8x8b without load balancing loss
Results
- The blue one (with LBL) achieves almost balanced load across the layers, except for the first layer. Overall, the blue run achieves a high routing entropy comparing to the red run.
- We can see that the blue training loss is lower than the red training loss, indicating the LBL helps the model. We can also observe that LBL converges at 30% of the training run.
This difference is less obvious in longer runs (42B tokens)
- note that the actual training loss (perplexity) of the purple run should subtract the load balancing loss, which is ~0.16. After the subtraction, the purple run should be ~0.01 better than the green run.
- The entropies of the layers' routing network shows that the green run is imbalanced.
Initial experiment (PoC) [04-10]
Training setup
- TPU: v5e-256
- Data: 42B token
- Models:
- Mixtral 8x8b (2 activated experts) -> # activated parameters = ~13b
- Llama 13b (dense)
Results
We can see that the 2 models have approximately the same flops (mixtral has ~6% more)
With the same amount of steps, mixtral is better.
With the same amount of flops, mixtral is better.
With the same amount of time, llama is better. This is because the MoE implementation has a much worse MFU (sparse MFU is 33% worse than dense model)
Issues
Add a comment