Tencent’s 389B-Parameter MoE Hunyuan-Large: An Efficient Alternative to LLama 3.1-405B?

Created on November 7|Last edited on November 7
Comment
Scale has driven consistent performance improvements for LLM's. Models like OpenAI's GPT-4o and Meta’s LLama series have set impressive benchmarks, but each comes with trade-offs, primarily around computational demands. Tencent’s Hunyuan-Large model, built with 389 billion total parameters and 59 billion 'active' parameters, stands out not just for its scale but for its use of synthetic data and its efficient Mixture of Experts (MoE) architecture. These innovations enable Hunyuan-Large to outperform similar-sized models, including Meta's LLama3.1 series, with fewer activated parameters per task.
Architecture and dataset Tencent’s Hunyuan-Large is the largest open-source MoE model to date, with a groundbreaking 52 billion activated parameters per input. This sparse architecture enables the model to use only relevant parts of its extensive network, making it both scalable and efficient. Unlike dense models, which engage all parameters for any given input, Hunyuan-Large activates only the "experts" relevant to the specific task, conserving computational resources. 
This efficiency translates into practical performance: Hunyuan-Large not only competes with LLama3.1-405B but often surpasses it in benchmarks such as commonsense reasoning, mathematics, and coding, while using fewer active parameters.
Tencent’s synthetic data strategy was a game-changer for Hunyuan-Large’s pre-training. Leveraging 1.5 trillion tokens of synthetic data within a 7-trillion-token dataset, Tencent focused on generating high-quality, diverse data that bolsters specific skills, including mathematics, logical reasoning, and programming. This enriched dataset allows Hunyuan-Large to develop robust capabilities in areas where naturally occurring data is sparse or inconsistent.
Long-Context Processing AdvantagesAt its core, Hunyuan-Large is a sparse model that intelligently routes inputs to relevant “experts” within its network, minimizing computational overhead. This architecture not only improves inference speed but also reduces costs, as fewer parameters are activated for each query compared to dense models like LLama3.1-405B.
A distinctive advantage of Hunyuan-Large is its ability to handle inputs up to 256K tokens—the longest context length among open-source models—which addresses one of the biggest challenges in NLP: maintaining coherence and context over extended interactions or document reviews. This advantage is particularly impactful for applications in legal, technical, and multi-turn conversational contexts, where long memory is essential.
Synthetic Data Generation: A Four-Step ProcessOne of the standout elements of Hunyuan-Large’s training pipeline is its approach to synthetic data. Synthetic data isn’t just about volume; it’s about diversity and task relevance. Tencent used a structured, four-step process to generate high-quality synthetic data that fills knowledge gaps and amplifies learning.
The first stage is instruction generation. Tencent began by sourcing knowledge-rich data, like web pages, academic texts, and code, to create varied instructions that ensure the dataset covered a broad range of topics and instruction styles. In the instruction evolution phase, initial instructions were iteratively refined for clarity and complexity, incorporating multiple levels of difficulty and expanding coverage for low-resource domains. This evolution helps the model learn progressively, handling increasingly sophisticated requests.
After refinement, specialized smaller models generated responses to these instructions in the response generation phase. This stage provided detailed, domain-specific answers simulating expert-level responses, which proved particularly useful for areas like coding or mathematical proofs. In the response filtering phase, generated responses were subjected to self-consistency checks and a critique model to remove inaccurate or low-quality outputs. This filtering step helped ensure that only high-quality synthetic data was used in training.
﻿
Comparative Performance with LLama3.1This synthetic data strategy and efficient architecture enable Hunyuan-Large to outperform even larger dense models like LLama3.1-405B. Highlights from comparative evaluations show Hunyuan-Large scored 3.6% higher on math tasks, including the challenging GSM8K benchmark, and also outperformed on HumanEval for coding tasks. This strength stems from its focused synthetic data training, which hones the model’s skills in computational tasks. In commonsense and reasoning tasks, Hunyuan-Large achieved a 3.2% improvement over LLama3.1-405B on the MMLU benchmark, which assesses multitask language understanding and highlights its superior grasp of complex, nuanced questions.
﻿
Synthetic Data and the Future of LLMsThe Hunyuan-Large model underscores a key insight for the future of LLMs: quality synthetic data can complement natural data and expand model capabilities in targeted, efficient ways. As models scale and generalize across tasks, synthetic data provides a feasible way to train LLMs on specialized tasks without exorbitant data collection costs or limitations of naturally sourced datasets.
The paper: https://arxiv.org/pdf/2411.02265v3﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.