HuggingFace Unveils 1.58-bit Fine-Tuning Recipe for Llama 3

Extreme quantization without training from scratch!
Created on September 19|Last edited on September 19
Comment
﻿HuggingFace has announced the release of a fine-tuning recipe for applying the 1.58-bit quantization to large language models. This advancement brings BitNet’s powerful ternary quantization to the widely-used Transformers library, enabling researchers to fine-tune models with unprecedented efficiency. The recipe simplifies the process of using 1.58-bit precision, which drastically reduces memory usage and computational overhead while retaining competitive performance.
BitNet and HuggingFace IntegrationBitNet, developed by Microsoft Research, uses ternary quantization to compress model parameters in linear layers into three values: -1, 0, and 1. This reduces the precision of these weights to 1.58 bits, which significantly cuts memory usage and improves computation speeds.
﻿
Source: BitNet paper https://arxiv.org/abs/2402.17764
While this technique has existed, HuggingFace’s contribution lies in making it accessible through an easy-to-use fine-tuning recipe integrated into its Transformers library. This removes the need for resource-heavy pre-training from scratch and allows users to directly fine-tune pre-trained models like Llama 3 in this low-bit format.
HuggingFace’s implementation uses BitLinear layers to replace the standard linear layers in the model architecture. This approach enables seamless quantization of weights and activations without altering the core workflows for fine-tuning and inference. By simplifying the process, HuggingFace democratizes the use of 1.58-bit quantization, making it available to researchers and developers who may not have had the resources or expertise to implement extreme quantization before.
Tuning the Hyperparameters for 1.58-bit Fine-TuningHuggingFace’s recipe for 1.58-bit fine-tuning required careful tuning of several critical hyperparameters to ensure model performance remains strong despite the extreme reduction in precision. One of the most important parameters was the learning rate, where a value of 1e-4 provided the stability needed for fine-tuning. This learning rate prevented the model from diverging during training while allowing it to gradually adjust to the reduced precision.
Another key element was the quantization schedule, which controls how quickly the model transitions to full 1.58-bit quantization. Rather than applying quantization abruptly, HuggingFace introduced a dynamic lambda schedule, gradually increasing quantization strength over time. The function min(2 * training_step / total_training_steps, 1) was found to be effective in making this shift smooth, ensuring the model retained much of its pre-trained knowledge without destabilization.
The team also explored per-row and per-column quantization strategies for the weight matrices. This allowed more structured information to be preserved during fine-tuning, particularly in larger models like Llama 3 8B. Furthermore, the fine-tuning batch size was set at 2 million tokens when working with large datasets like FineWeb-edu, providing enough signal for effective gradient updates despite the low-bit precision.
Activation Scaling in 1.58-bit Fine-TuningAn important aspect of HuggingFace’s fine-tuning recipe is the treatment of activations. Unlike the weights, which are quantized to 1.58 bits, the activations were quantized to 8 bits. This allowed the model to retain more information during training, as activations are more sensitive to precision loss than weights. To perform this quantization, HuggingFace used a scaling technique based on the maximum absolute value of the activations. For each activation matrix, the largest absolute value was identified, and all other values were scaled relative to this maximum. This method ensured the activations fit within the 8-bit range of [-128, 127], preserving essential information while benefiting from the reduced bit-width.
Performance Metrics with 1.58-bit Fine-TuningDespite the extreme quantization, models fine-tuned using HuggingFace’s 1.58-bit recipe demonstrated competitive performance. The Llama3 8B-1.58 model, fine-tuned on 100 billion tokens, outperformed the Llama 1 7B model on the MMLU benchmark by 5%, showing that even with 1.58-bit quantization, the model can excel at general knowledge tasks. 
The BitNet model dramatically reduced energy consumption, using 71.4 times less energy for matrix multiplication compared to models that used FP16 precision. These energy savings make 1.58-bit models especially attractive for large-scale deployments where computational costs are a critical concern.
Using 1.58-bit fine-tuning with TransformersHuggingFace introduced a new quantitization method called "bitnet", replacing the linear layers with BitLiner layers.
Loading and testing the model requires no changes to the API:
model = AutoModelForCausalLM.from_pretrained(
    "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
    device_map="cuda",
    torch_dtype=torch.bfloat16
)    
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
﻿
input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
﻿
input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_new_tokens=10)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
All that's required for this to work is the latest version of the Transformers library.
ConclusionHuggingFace’s unveiling of a fine-tuning recipe for 1.58-bit quantization marks a significant step forward in the practical application of extreme model compression. By making BitNet’s low-bit precision accessible through its Transformers library, HuggingFace allows a broader range of researchers and developers to benefit from memory-efficient, energy-saving models without sacrificing performance. The careful tuning of hyperparameters, including learning rate and quantization schedules, combined with techniques for activation scaling and structured weight quantization, ensures that these models can still compete with higher-precision alternatives. HuggingFace’s 1.58-bit fine-tuning recipe offers a powerful tool for anyone looking to deploy efficient LLMs at scale.
You can find the announcement in the HuggingFace blog here.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.