Unsloth unveils 1.58 bit DeepSeek R1
Created on January 28|Last edited on January 28
Comment
DeepSeek R1, an open-source, 671-billion-parameter MoE (Mixture of Experts) model, is gaining attention for competing with proprietary giants like OpenAI's O1 reasoning model. Its massive size initially presented a challenge for local users, requiring extensive resources for efficient deployment. However, through the innovative efforts of Unsloth, dynamic quantization techniques were developed to significantly reduce its size while retaining functionality.
By applying advanced quantization methods, Unsloth managed to shrink DeepSeek R1 from 720GB to as low as 131GB, an 80% reduction. The breakthrough lies in selectively quantizing layers to varying precision levels (e.g., 1.58-bit for most MoE layers, 4-bit for sensitive layers). This nuanced approach prevents the model from breaking, a common issue when naively applying uniform quantization to all layers.
Performance and Hardware Requirements
The dynamic quantized versions of DeepSeek R1, crafted by Unsloth, provide flexibility for different hardware setups. For high-speed inference, the 1.58-bit version requires approximately 160GB of VRAM (e.g., two NVIDIA H100 GPUs). However, users with less powerful setups can run the model on 20GB of RAM, albeit at slower speeds. The combined VRAM and RAM should ideally exceed 80GB for optimal performance.
Benchmarks show that the 1.58-bit quantized model, created by Unsloth, can process around 140 tokens per second, making it highly efficient for a model of its size. Four versions of dynamically quantized models are available, ranging from 131GB to 212GB, catering to various performance and resource needs.
Dynamic Quantization Process
Dynamic quantization, a key innovation from Unsloth, involves careful calibration of the precision levels across different parts of the model. For DeepSeek R1, Unsloth’s approach included:
- Keeping the first three dense layers in higher precision (4-bit or 6-bit) since they contain only 0.5% of all weights.
- Using 1.58-bit quantization for 88% of the MoE layers, which hold the majority of weights.
- Preserving higher precision in specific areas like down_proj matrices and attention outputs to avoid performance degradation.
This selective strategy ensures that critical computations remain accurate while dramatically reducing storage and memory requirements. Naive quantization approaches, by contrast, lead to issues such as endless output loops or entirely incorrect results.
Implementation and Accessibility
Running DeepSeek R1 with dynamic quantization does not require specialized software beyond standard tools like llama.cpp, Ollama, or Transformers libraries. Unsloth has ensured that prequantized versions are available on Hugging Face, enabling easy access for users. For systems with GPUs, users can offload a portion of the model's layers to optimize performance.
For example, on a 24GB GPU, the 1.58-bit model can offload seven layers, while on an 80GB GPU, 33 layers can be offloaded. Users without GPUs can still run the model with CPU-only setups, ensuring broad accessibility.
Future Directions
Unsloth's work with DeepSeek R1 exemplifies how advanced quantization can democratize access to large-scale AI models. Moving forward, the team aims to further refine these methods, applying insights from papers like *Super Weights* and *Llama.cpp's 1.5-bit quantization* to push the boundaries of efficiency.
Conclusion
The dynamic quantization of DeepSeek R1, achieved by Unsloth, marks a significant leap in AI accessibility and efficiency. By shrinking a massive 671-billion-parameter model to a manageable size without compromising quality, Unsloth has set a benchmark for scalable AI deployment. With tools like these, the future of AI looks increasingly inclusive and efficient.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.