Skip to main content

Qwen3-Next: A Leap in Training and Inference Efficiency

Created on September 12|Last edited on September 12
Qwen Team has introduced Qwen3-Next, a new model architecture designed to address two major trends shaping large-scale AI development: context length scaling and total parameter scaling. Building on the MoE structure of Qwen3, Qwen3-Next integrates hybrid attention mechanisms, an ultra-sparse Mixture-of-Experts (MoE) design, stability-focused optimizations, and multi-token prediction for faster inference.
The first release, Qwen3-Next-80B-A3B-Base, contains 80 billion parameters but activates only 3 billion during inference. This design allows it to match or exceed the performance of Qwen3-32B while consuming less than 10 percent of the training compute. At long context lengths above 32K tokens, throughput improves by more than tenfold, highlighting its efficiency advantage.

Hybrid Attention Design

The model employs a mix of Gated DeltaNet and standard attention layers, with a 3:1 ratio favoring the more efficient Gated DeltaNet. This setup balances fast processing with strong recall capabilities, outperforming architectures that rely exclusively on one method. Enhancements such as output gating, increased head dimensions, and selective rotary position encoding further strengthen its performance across long-context tasks.


Ultra-Sparse Mixture-of-Experts

Qwen3-Next expands the MoE approach to 512 experts, with only about 3.7 percent of parameters activated at each step. This strategy maintains performance while substantially cutting computation costs. The model routes ten experts plus one shared expert, improving efficiency compared to Qwen3’s earlier MoE structure.

Stability Enhancements

Training large sparse models often leads to instability. Qwen3-Next mitigates this with measures such as attention output gating to prevent numerical errors, Zero-Centered RMSNorm to replace QK-Norm, and weight decay on normalization layers. Router parameters are normalized at initialization to ensure balanced expert usage, making large-scale training more reliable.

Multi-Token Prediction

A built-in multi-token prediction mechanism improves speculative decoding efficiency, boosting both speed and accuracy. Multi-step training ensures consistency between training and inference, which raises the acceptance rate of speculative outputs in real-world applications.

Training and Inference Efficiency

Qwen3-Next was trained on 15 trillion tokens, a subset of Qwen3’s corpus, requiring less than 10 percent of the compute of Qwen3-32B while achieving stronger results. In inference, it is nearly seven times faster in the prefill stage and four times faster in decoding at 4K tokens, with advantages that grow beyond 32K.

Performance of Post-Trained Models

Two post-trained versions have been released: Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. The instruct model performs close to the much larger Qwen3-235B flagship, particularly excelling in tasks involving ultra-long context lengths up to 256K tokens. The thinking variant outperforms Qwen3’s mid-sized reasoning models and even surpasses Google’s Gemini-2.5-Flash-Thinking on several benchmarks, nearing the performance of Qwen3’s largest reasoning model.



Deployment and Ecosystem Support

Qwen3-Next is available on Hugging Face and ModelScope, with serving supported by frameworks like SGLang and vLLM, both of which offer OpenAI-compatible APIs. The model also integrates with Qwen-Agent, designed to simplify tool-calling and agentic workflows. Native support extends to 262K tokens, and with YaRN scaling techniques, it can handle contexts up to 1 million tokens.

Summary

Qwen3-Next represents a significant advance in model architecture, delivering efficiency gains in both training and inference without compromising accuracy. By combining hybrid attention, extreme MoE sparsity, and stability-driven design choices, it reaches the performance levels of much larger models while keeping resource costs low. With applications spanning instruction following, reasoning, and ultra-long-context understanding, Qwen3-Next positions itself as a major milestone in scalable AI development.