A Tale of Model Quantization in TF Lite
Model optimization strategies and quantization techniques to help deploy machine learning models in resource constrained environments.
Created on April 29|Last edited on October 11
Comment
State-of-the-art machine learning models are often bulky which often makes them inefficient for deployment in resource-constrained environments, like mobile phones, Raspberry Pis, microcontrollers, and so on.
Even if you think that you might get around this problem by hosting your model on the Cloud and using an API to serve results – think of constrained environments where internet bandwidths might not be always high, or where data must not leave a particular device.
We need a set of tools that make the transition to on-device machine learning seamlessly. In this article, we'll look at how TensorFlow Lite (TF Lite) can really shine in situations like this. We'll cover model optimization strategies and quantization techniques supported by TensorFlow.
Thanks to Arun, Khanh, and Pulkit (Google) for sharing incredibly useful tips for this report.
Table of Contents
OverviewNeed for on-device machine learningModel optimization strategies supported in TensorFlowQuantization TechniquesExperimental settingsPerformance with normal fine-tuningQuantizing the fine-tuned modelA Note on Setting Configuration Options for the ConversionsQuantization-Aware Training (QAT) With the Same ModelBrief Comparison Between the QAT & Non QAT ModelEvaluating the Quantized QAT ModelQuantizing to Float ModelsExplore Other Quantization Schemes & Concluding Thoughts
Run set
4
Overview
In this report, we'll cover the following topics –
- Need for on-device machine learning
- Model optimization strategies supported in TensorFlow
- Quantization Techniques
- Things to keep in mind while performing quantization
Need for on-device machine learning
In their talk TensorFlow Lite: ML for mobile and IoT devices (TF Dev Summit '20), Tim Davis and T.J. Alumbaugh emphasize the following:
- Lower latency & close-knit interactions: There can be many critical applications where you might like to have zero-latency in predictions, self-driving cars for example. You might also need to keep all the internal interactions of your system really compact so that no extra latency is introduced.
- Network connectivity: As I mentioned earlier, when you depend on a cloud-hosted model, you essentially constrain your application to depend on a certain level of network bandwidth that might not be always achievable.
- Privacy-preserving: There can hard requirements on privacy, e.g. that the data must not leave the device.
Run set
11
To make the heavy-weight ML models deployable on tiny devices we need to optimize them, for instance, to fit a 1.9GB model into a 2GB application. To help ML developers and mobile application developers, the TensorFlow team has come up with two solutions:
Model optimization strategies supported in TensorFlow
TensorFlow, via TensorFlow Lite and the Model Optimization Toolkit, supports the following model optimization strategies today -
- Quantization where you'd play with different lower precision formats to reduce the size of your models.
- Pruning where you'd discard the parameters in your model that have very little significance on the model's predictions.
In this article, we will focus on quantization.
Run set
11
Quantization Techniques
Generally, our machine learning models operate in float32 precision format. All the model parameters are stored in this precision format, which often leads to heavier models. The heaviness of a model has a direct correlation to the speed at which the model makes predictions. So, it might occur to you naturally that what if we could reduce the precision in which our models would operate, we could cut down on prediction times. That is what quantization does - it reduces the precision to lower forms like float16, int8, etc to represent the parameters of a model.
Quantization can be applied to a model in two flavors -
- Post-training quantization is applied to a model after it is trained.
- Quantization-aware training where a model is typically trained to compensate for the loss in precision that might be introduced due to quantization. When you reduce the precision of the parameters of your model, it can result in information loss and you might see some reduction in the accuracy of your model. In these situations, quantization-aware training can be really helpful.
We will see both these flavors in this report. Let's get started!
Experimental settings
All of the experiments that we do in this report were performed on Colab. I used the flowers dataset for the experiments and fine-tuned a pre-trained MobileNetV2 network to start off with. Here's the code that defines the network architecture -
# Load the MobileNetV2 model but exclude the classification layersEXTRACTOR = MobileNetV2(weights="imagenet", include_top=False,input_shape=(224, 224, 3))# We will set it to both True and FalseEXTRACTOR.trainable = True# Construct the head of the model that will be placed on top of the# the base modelclass_head = EXTRACTOR.outputclass_head = GlobalAveragePooling2D()(class_head)class_head = Dense(512, activation="relu")(class_head)class_head = Dropout(0.5)(class_head)class_head = Dense(5, activation="softmax")(class_head)# Create the new modelclassifier = Model(inputs=EXTRACTOR.input, outputs=class_head)# Compile and return the modelclassifier.compile(loss="sparse_categorical_crossentropy",optimizer="adam",metrics=["accuracy"])
The networks were trained for 10 epochs with a batch size of 32.
Performance with normal fine-tuning
Run set
1
Quantizing the fine-tuned model
After you have trained a model in tf.keras, the quantization part is just a matter of a few lines of code. So, the way you would do that is as follows -
converter = tf.lite.TFLiteConverter.from_keras_model(non_qat_flower_model)converter.optimizations = [tf.lite.Optimize.DEFAULT]quantized_tflite_model = converter.convert()
You are first loading your model into a TFLiteConverter converter class, then specifying an optimization policy, and finally, you ask TFLite to convert your model with the optimization policy. Serializing the converted TF Lite file is straight-forward -
f = open("normal_flower_model.tflite", "wb")f.write(quantized_tflite_model)f.close()
This form of quantization is also referred to as post-training dynamic range quantization. It quantizes the weights of your model to 8-bits of precision. Here you can find more details about this and other post-training quantization schemes.
A Note on Setting Configuration Options for the Conversions
TF Lite allows us to specify a number of different configurations when converting our models. We saw one such configuration in the aforementioned code, where we specified the optimization policy.
Apart from tf.lite.Optimize.DEFAULT, there are other two policies available - tf.lite.Optimize.OPTIMIZE_FOR_SIZE & tf.lite.Optimize.OPTIMIZE_FOR_LATENCY. From the names, you can see that, based on the choice of policy, TF Lite will try to optimize the models accordingly.
We can specify other things like -
- target_spec
- representative_dataset
Learn more about the TFLiteConverter class here. It's important to note that these different configuration options allow us to maintain trade-offs between a model's prediction speed and it's accuracy. Here, you can find a number of trade-offs with respect to different post-training quantization schemes available in TF Lite.
Below we can see some useful statistics on this converted model.
Run set
1
Quantization-Aware Training (QAT) With the Same Model
A good first approach here is to train your model in a way in which it would learn to compensate for the information loss that might be induced from quantization. With quantization-aware training we can do just that. To train our network in a quantization-aware manner, we just add the following lines of code -
import tensorflow_model_optimization as tfmotqat_model = tfmot.quantization.keras.quantize_model(your_keras_model)
Now, you can train qat_model in the same way you would train a tf.keras model. Here you can find a comprehensive coverage of QAT.
Below, we can see that this quarantization aware model does slightly better than our previous model.
Run set
2
Brief Comparison Between the QAT & Non QAT Model
In terms of model size, the QAT model is similar to the non-QAT model:

Run set
3
But in terms of model training time, we see that the QAT model takes more time. This is because during QAT, fake quantization nodes are introduced in the model to compensate for the information loss, which makes the QAT model takes more time to converge.
This is important to keep in mind in cases where you are optimizing for time to convergence. If your training model takes a long time to train, then introducing QAT will further increase this time.
Quantizing a QAT model is exactly the same (we will use the same quantization configurations) that what we saw in the section above.
Let's now compare the performance of the quantized version of the QAT model.
Evaluating the Quantized QAT Model
In the following table, we see that the quantized version of the QAT model indeed performs better than the previous model.
Run set
2
Quantizing to Float Models
To quantize our models to float precision, we just need to discard this line - converter.optimizations = [tf.lite.Optimize.DEFAULT]. Note that, float16 quantization is also supported in TensorFlow Lite. This policy is particularly helpful if you were to take advantage of GPU delegates. In the table below, we can see the size and accuracy of the models quantized using this scheme.
Run set
6
Although the size of these models has increased, we see the original performance of the models remains high. Scroll left in case you are not able to see the Accuracy (%) column. Note that converting a QAT model using this scheme is not recommended since, during QAT, the fake quantization ops that are inserted are in int precision. So, when we quantize a QAT model using this scheme the converted model can show inconsistencies.
Additionally, hardware accelerators, like the Edge TPU USB Accelerator, will not support float models.
Explore Other Quantization Schemes & Concluding Thoughts
There are other post-training quantization techniques available as well, such as full integer quantization, float16 quantization, etc. This is where you can learn more about them. Keep in mind that the full integer quantization scheme might not always be compatible with a QAT model.
There are a number of SoTA pre-trained TF Lite models hosted for the developers to use for their applications and they can be found here:
For mobile developers who are looking to integrate machine learning in their applications, there are a number of example applications in TF Lite worth checking out. TensorFlow Lite also provides tooling for embedded systems and microcontrollers and you can learn more about it from here.
If you'd like to reproduce the results of this analysis, you can –
Check out the code on GitHub →
Add a comment
Tags: Intermediate, Computer Vision, Keras, TensorFlow Lite, Experiment, MobileNet v2, Panels, Plots
Iterate on AI agents and models faster. Try Weights & Biases today.