Skip to main content

DeepSeek-V3: Training 671 billion parameters with a $6 million dollar budget

Created on December 26|Last edited on January 9
DeepSeek AI has launched its latest language model, DeepSeek-V3, marking a significant development in the world of LLM's. The company announced the release through its updated API documentation, signaling a new offering with enhanced capabilities and improved performance. The documentation outlines key advancements that position DeepSeek-V3 as a serious contender in the increasingly competitive landscape of large language models.

The Architecture

DeepSeek-V3 is built on a Mixture-of-Experts (MoE) architecture and boasts a substantial 671 billion parameters, with 37 billion parameters actively engaged during inference. The model was trained on a massive dataset of 14.8 trillion tokens, contributing to its robust performance.

DeepSeek-V3 Benchmark Performance: A Closer Look

The release includes benchmark results that provide a basis for comparison with other leading language models. On the MMLU (Massive Multitask Language Understanding) benchmark, DeepSeek-V3 achieves a score of 88.5, slightly trailing Llama3.1's 88.6 but surpassing Qwen2.5's 85.3 and Claude-3.5 Sonnet's 88.3. This places DeepSeek-V3 among the top performers on this general language understanding assessment.

In the domain of complex reasoning, as evaluated by the DROP (3-shot F1) benchmark, DeepSeek-V3 demonstrates strong capabilities with a score of 91.6. This outperforms Qwen2.5's 76.7, Llama3.1's 88.7, and Claude-3.5 Sonnet's 88.3, suggesting a particular strength in tasks requiring intricate reasoning.
DeepSeek-V3's performance in code generation is impressive. On the Codeforces (Percentile) benchmark, where a higher score is better, DeepSeek-V3 scores 51.6. This surpasses Qwen2.5 (24.8), Llama3.1 (25.3), and notably, Claude-3.5 Sonnet (20.3), indicating a strong aptitude for competitive programming challenges. On the Aider-Edit (Acc.) benchmark, DeepSeek-V3 achieves 79.7, outperforming Llama3.1's 63.9 and Qwen2.5's 65.4, but falling short of Claude-3.5 Sonnet at 84.2.
In mathematical reasoning, measured by the MATH-500 (EM) benchmark, DeepSeek-V3 demonstrates impressive proficiency with a score of 90.2. This result is considerably higher than Qwen2.5's 80, Llama3.1's 73.8, and Claude-3.5 Sonnet's 78.3, positioning DeepSeek-V3 as a leader in this category.
For Chinese language understanding, DeepSeek-V3 scores 90.9 on the CLUEWSC (EM) benchmark, slightly below Qwen2.5's 91.4 but above Llama3.1's 84.7 and Claude-3.5 Sonnet's 85.4.

Efficiency and Cost-Effectiveness

What makes DeepSeek-V3 particularly noteworthy is the efficiency with which it was developed. As highlighted by Andrej Karpathy, the model was reportedly trained on a "joke of a budget" of $6 million, using a cluster of only 2048 GPUs for just two months. Karpathy points out that this level of capability typically requires significantly larger GPU clusters, often in the range of 16,000 GPUs or more. He contrasts this with Llama 3 405B, which used 30.8 million GPU-hours, while DeepSeek-V3 achieved seemingly stronger results with only 2.8 million GPU-hours, representing approximately 11 times less compute. This remarkable efficiency, achieved on a relatively modest hardware setup, raises questions about the necessity of massive GPU clusters for training frontier LLMs. It suggests that advancements in data and algorithms may offer alternative pathways to achieving state-of-the-art results. The specific type of GPU used in the 2048 GPU cluster was not mentioned in the available information.



Pricing and Future Implications

DeepSeek AI has outlined a new API pricing structure for DeepSeek-V3, effective from February 8th. The pricing is set at $0.27 per million tokens for input (cache miss), $0.07 per million tokens for input (cache hit), and $1.10 per million tokens for output. The company maintains that this pricing remains competitive within the market.
DeepSeek-V3, with its strong performance across various benchmarks, a competitive pricing model, and most notably, its unprecedented development efficiency on a relatively small GPU cluster, marks a significant development in the AI landscape. The model's ability to achieve such results with limited resources challenges conventional notions about the requirements for training cutting-edge LLMs.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.