Skip to main content

New Method For LLM Quantization

A new quantization method that plays nicely with CUDA
Created on April 2|Last edited on April 2
Quantization is a well-studied method for making neural networks more efficient. As neural nets grow in size, the cost to run the model grows, the memory required to store the model weights also grows, and the result is higher costs for compute, especially in LLMs that contain billions of parameters.
Quantization makes models more efficient by changing the data type of the parameters in the model (eg. FP16 to INT8), resulting in less memory and higher efficiency, ideally with marginal losses in the accuracy of the model. Applying quantization to LLM’s has proven to be effective, however, one issue that has plagued existing methods is that of activation outliers, where these activation inputs vary at a 100x multiple in comparison to the majority of the neighboring activations, which results in large drops in accuracy after applying a post-training quantization method.
Previous solutions to this issue have relied on using mixed precision decomposition (keeping FP16 outliers and quantizing remaining parameters). The limitation of this method is that it can be difficult to implement on current hardware accelerators.

The Method

The researchers call their new quantization method “SmoothQuant.”
SmoothQuant essentially proposes to “smooth” the input activation by a smoothing factor. The division by the smoothing factor is then reversed by multiplying the smoothing factor by the weights, which results in an equivalent operation as before while making the activations quantizable.
The equation for this method can be seen below, where s is the smoothing factor.



Their next contribution is a method for choosing a scaling factor. A naive approach to choosing this scalar is to simply choose the maximum activation value, which makes quantizing the activation easy but migrates the difficulty to the weight quantization step (see below).



In order to make the activations quantizable without making the weights impossible to quantize, the solution is to balance quantization difficulty across both the weights and the activations, which requires selecting a scaling factor that is aware of the scale of both the weights and the input activations. As seen below, j represents the channel of the model, and alpha is a parameter that controls where the scaling factor should place more emphasis (weights or activation inputs).



Efficiency is Growth

The work claims a 1.56x inference speedup as well as a 2x memory reduction!
Overall, this new work is a highly creative method for quantization that not is not only mathematically valid but also compatible with modern hardware accelerators like NVIDIA’s CUDA.
As LLM’s are made more efficient, costs will be reduced for their deployment, which is great for all areas of the machine learning world.

The paper:

https://arxiv.org/abs/2211.10438](https://arxiv.org/abs/2211.10438
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.