MoE vs Dense vs Hybrid LLM architectures
Train 600M MoE, Dense, Hybrid large langage model architectures.
Created on April 30|Last edited on May 2
Comment
LLMs are based on transformer architectures. The underlying transformer is made up of encoder-decoder layers. How we construct those layers brings us to three interesting transformer architectures:
The Mixture of Experts (MoE) transformer architecture
The Mixture of Experts (MoE) transformer architecture allows increasing the model’s size and in effect the output quality and inference speed while keeping the compute cost fixed. Instead of dense transformer layers, an MoE uses sparse Feedforward Neural Network (FFN) layers (aka experts) and a gate network (router) to determine which tokens are sent to each expert, usually, top-k experts are selected.
Mixture of Experts can scale to larger sizes for better inference quality without significantly raising the compute costs. This is because only a subset of experts (top-k) are activated for each token regardless of the total number of experts. Also since MoEs have a limited number of active parameters during inference when compared to a dense transformer which activates all parameters, it’s relatively more computationally economical.
The Dense transformer architecture
The Dense transformer architecture is the one described in the groudbreaking paper “Attention is all you need”. The encoder stack consists of layers each of which encodes the input, normalizes it, carries out Multihead attention (to identify contextual dependencies), and then passes these vectors through a fully connected network until the output of the final ecoder layer is concatenated to the decoder input. The decoder stack consists of another set of feedforward networks responsible for next-token prediction.
Increasing the model’s size improves the output quality. However, due to the quadratic complexity of multi-head attention, scaling the traditional dense transformer architecture may be compute-costly for a given number of parameters.
Hybrid-MoE transformer architecture
The more expert choices an Mixture of Experts has, the higher the quality, however, it’s inefficient due to high all-to-all communication overheads. Hybrid-MoE overlaps this communication by combining a residual MoE with a dense transformer, making training faster.
A typical MoE for a batch size of 1 can undergo enough latency just to read the active parameters. In contrast, a Hybrid-MoE can be more efficient than equivalent vanilla MoEs or a Dense transformer. Also, Hybrid-MoEs are capable of handling larger batch sizes for faster inference as well.
Snowflake’s MoE experiments
Snowflake AI Research has recently conducted a series of experiments to compare MoE vs Dense transformers, how to optimize MoE training and inference, and particularly trained high-performing, low-cost Hybrid MoE model.
Optimum MoE architecture
- Snowflake studied the ideal number of experts, the size of each expert, and the number (top-k) of active experts. They compared the popular top-1 and top-2 gating under fair and controlled conditions. They found top-2 gating more effective and concluded that an increasing number of experts also incurs higher communication costs besides better model quality.

- They also studied the frequency of MoE layers in contrast to the standard transformer-decoder layer within the model by comparing the MoE layer used in every transformer layer and using it in every alternate layer.

Snowflake Arctic: A Hybrid-MoE
Snowflake recently open-sourced Arctic, a Dense-Hybrid-MoE transformer-based LLM which is highly efficient and performs better than Llama3–8B and Llama3–70B models on enterprise metrics (HumanEval+/MBPP+, SQL-Spider, Instruction following-IFVAL) while saving 17x compute.
- Arctic combines a 10B Dense transformer with a residual 128x3.36B MoE transformer making the total model size 480B. Selecting Top-2 experts creates 17B active parameters. Unlike traditional MoEs based on fewer experts (4 or 8), Arctic spreads across 128 experts to enhance the model’s capability for top-tier intelligence. Also combining a dense transformer model allows good training efficiency, significantly reducing the communication overhead.
- Arctic’s inference capabilities are also astounding as it can handle higher batch sizes (e.g. 1000 token/s) while saving 4x compute-costs than CodeLlama-70B and Llama3–70B.
Comparing 600M Dense/ MoE / Hybrid-MoE models
I also experimented to compare the performance and costs of 600M models based on Dense/ MoE / Hybrid-MoE transformers in an equivalent setting using 2XA40. The architectural configurations for each of the three models are listed:
Dense Model Configuration
Params: 589561856d_model: 2048n_heads: 8n_layers: 8mlp_ratio: 8n_kv_heads: 2
MoE Model Configuration
Params: 520261632d_model: 1024n_heads: 8n_layers: 6mlp_ratio: 8n_kv_heads: 2moe_num_experts: 6moe_top_k: 2
Hybrid MoE Model Configuration
Params: 595759104d_model: 1024n_heads: 8n_layers: 6mlp_ratio: 8n_kv_heads: 2moe_num_experts: 6moe_top_k: 2
W&B Reports
I evaluated each model on downstream tasks : sciq_acc , arc_easy_acc , hellaswag_len_norm , piqa_len_norm , openbook_qa_len_norm , sst2_acc , commitment_bank_acc , copa_acc , winogrande_acc , mrpc_f1 and rte_len_norm.
The optimizer trajectory was recorded as:
I also analyzed the efficiency of each model. In our experiment, Base MoE was capable of processing more tokens/sec and stood more efficiently.
- Base MoE throughput: 34k tps
- Hybrid MoE throughput: 27k tps
- Dense throughput: 18k tps
Finally, the Perplexity and CrossEntropyLoss readings:
Conclusion
While MoE and Hybrid-MoE can compete with each other depending on Model sizes (Hybrid MoEs perform better for larger parameter sizes), it’s evident that Dense transformer-based models require a greater compute budget for a similar parameter budget than MoE architectures for the same performance.
TPS of Base-MoE is almost 2x of Dense and 25% more than that of Hybrid-MoE. CSE and Perplexity loss curves are almost the same. Next, I will be trying out a large number fine-grained experts probably moving from 6 to 18 experts. Stay tuned!
Special thanks to QueryLoopAI for sponsoring the compute of these experiments.
Also, feel free to drop me a message or:
- Follow me on 📚 Medium
- Subscribe to my 📢 weekly AI newsletter!
- Check out my 🤗 HuggingFace
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.