Mixed precision training with tf.keras

Train large neural nets significantly faster with very little decrease in the performance
Created on September 22|Last edited on September 22
Comment
In this article, we are going to see how to incorporate mixed precision (MP) training in your tf.keras training workflows. Mixed precision training was proposed by NVIDIA in this paper. It has allowed us to train large neural networks significantly faster with zero to very little decrease in the performance of the networks. Here’s what we are gonna cover -
Several options to incorporate mixed precision training for tf.keras models
Things to remember while performing mixed precision training
Hands-on examples of these options
Use Weights and Biases (W&B) to compare the results of different mixed precision training experiments
If you want to get into the details of mixed precision training, then I highly suggest the following resources:
﻿Mixed precision training by Sylvian Gugger
﻿This PyCon SG 2019 tutorial by Timothy Liu
﻿Training with Mixed Precision by NVIDIA
Let’s get started!
Incorporating mixed precision training in tf.keras (TensorFlow 2.0)TensorFlow 2.0 offers the following options to help you easily incorporate mixed precision training -
﻿tf.train.experimental.enable_mixed_precision_graph_rewrite﻿
﻿tf.keras.mixed_precision.experimental.LossScaleOptimizer﻿
﻿tf.keras.mixed_precision.experimental.set_policy﻿
In my experience, I have found the last two ones to be more performant. There are some configurations needed, however, in order to activate mixed precision training. We will see them in a later section. But before that, let’s discuss some pointers to keep mind while using mixed precision training.
Things to remember while using MPPlease note that you will not be able to attain considerable performance improvements with MP training out of the box. There are some pointers we need to keep in mind when using MP training -
MP training is only supported on the Volta generation of NVIDIA GPUs, Tesla V100, Tesla T4, for example.
To be able to get the most out of MP training, you should use it along with large networks Transformers, ResNet50 etc. The architecture of a network (i.e. the kind of layers, activations a network consists) has an important role to play here. I was not able to gain any performance improvements with networks like VGG16.
Dimensions (batch, channels, image size, dense nodes) should be in multiples of 8.
When using computer vision models, the higher resolution images you use, the better it is.
Hands-on mixed precision training with tf.kerasIn this section, we will see some hands-on examples for using mixed precision training with tf.keras. You can find the full-length experiments in this repo.
Introduction to the datasetThe data for my experiments came from this Analytics Vidhya Hackathon. You’re given a set of images like the following and you need to predict the category of a given ship -
﻿
The labels of the images were given as the following encodings -
{'Cargo': 1,
'Military': 2,
'Carrier': 3,
'Cruise': 4,
'Tankers': 5}
There are 6252 images in train set and 2680 images in test set. Unfortunately, the images in the test set were completely unlabeled (as it should be in a Hackathon). For the experiments, I only used those 6252 images.
The dataset comes in the following format -
﻿
﻿
﻿
﻿
├── train
﻿
│   ├── images [8932 entries]
﻿
│   ├── images [8932 entries]
﻿
└── test_ApKoW4T.csv
﻿
﻿
where, train.csv and test_ApKoW4T.csv contain the names of the training and testing images respectively.
Now that we have a fair introduction to the dataset, we can proceed towards exploring the hands-on examples.
Setting explicit policies for MP trainingtf.keras.mixed_precision.experimental.set_policy allows us to set the default policy for the layers of a network. Policy here refers to the dtype of a specific layer. There are multiple ways to set policies in tf.keras for the layers -
If you specify  tf.keras.mixed_precision.experimental.set_policy('mixed_float16') before defining your network, the default policy of the network’s layers would be mixed_float16 i.e. all the computations in the network would take place in float16 while the parameters would be in float32. However, keep in mind that the computation of a layer containing the softmax activation function should be done in float32 to avoid numerical instability.
Instead of setting the default policy of the layers, you can be more specific and do something like -
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
model = tf.keras.models.Sequential(
   tf.keras.layers.Input((100,)),
   tf.keras.layers.Dense(10, dtype=policy),
   tf.keras.layers.Dense(10, dtype=policy),
﻿
   # Softmax should be done in float32 for numeric stability.
   tf.keras.layers.Activation('Softmax', dtype='float32')
)
model.fit(...)  
# Train `model`
(Code taken from this example)
Now, to be able to use the policies for mixed precision, you need to enable the XLA compiler (an introduction to XLA compilers is available here) like so - tf.config.optimizer.set_jit(True). Note that, this should be done per session basis i.e. if you plan to use XLA compiler (in most of the cases you really should) you need to enable it for each of the new sessions.
It’s also a good idea to clear any existing session before you start your experiments to prevent unforeseen issues -
def create_model(img_size=(224,224), num_class=5, train_base=True):
﻿
   # Accept float16 image inputs
﻿
   input_layer = Input(shape=(img_size[0],img_size[1],3), dtype=tf.float16)
﻿
   base = ResNet50(input_tensor=input_layer,
﻿
                   include_top=False,
﻿
                   weights="imagenet")
﻿
   base.trainable = train_base
﻿
   x = base.output
﻿
   x = GlobalAveragePooling2D()(x)
﻿
   # softmax only accepts float32 - need to manually cast (likely a bug)
﻿
   preds = Dense(num_class, activation="softmax", dtype=tf.float32)(x)
﻿
   return Model(inputs=input_layer, outputs=preds)
We added a (multi) classification head on top of a pre-trained ResNet50 network. Note that the inputs of the model should in float16. Also, take a look at the dtype of the Dense layer. Now, when I trained the network with the above-mentioned data on my GCP Notebook instance (consisting of a Tesla T4), I got the following -
Train for 78 steps
﻿
Epoch 1/5
﻿
78/78 [==============================] - 71s 916ms/step - loss: 0.3686 - accuracy: 0.7503
﻿
Epoch 2/5
﻿
78/78 [==============================] - 18s 227ms/step - loss: 0.2259 - accuracy: 0.7822
﻿
Epoch 3/5
﻿
78/78 [==============================] - 18s 229ms/step - loss: 0.1535 - accuracy: 0.7880
﻿
Epoch 4/5
﻿
78/78 [==============================] - 18s 230ms/step - loss: 0.1268 - accuracy: 0.7979
﻿
Epoch 5/5
﻿
78/78 [==============================] - 18s 230ms/step - loss: 0.1180 - accuracy: 0.7928
As a sanity check, I trained the same network without the mixed precision configurations and I got the following results -
Train for 78 steps
﻿
Epoch 1/5
﻿
78/78 [==============================] - 80s 1s/step - loss: 0.3619 - accuracy: 0.7359
﻿
Epoch 2/5
﻿
78/78 [==============================] - 59s 756ms/step - loss: 0.2056 - accuracy: 0.7683
﻿
Epoch 3/5
﻿
78/78 [==============================] - 60s 770ms/step - loss: 0.1513 - accuracy: 0.7761
﻿
Epoch 4/5
﻿
78/78 [==============================] - 61s 778ms/step - loss: 0.1305 - accuracy: 0.7835
﻿
Epoch 5/5
﻿
78/78 [==============================] - 61s 788ms/step - loss: 0.1017 - accuracy: 0.7869
You can see the improvement in the training time! Let’s visualize some of the most important metrics here:
﻿
﻿
The above set of graphs provides us with information about the following metrics that too from the different experiments (you can see the name of the experiments with a bit struggle) -
Accuracy
Loss
Training time
It can be clearly seen that mixed precision training indeed allows to train our networks faster that too without any loss in the performance. Note that “training_time” is not tracked by W&B by default but it is as easy as the following block of code:
import time
﻿
start = time.time()
﻿
model.fit(...,
﻿
        callbacks=[WandbCallback()])
﻿
training_time = time.time() - start
﻿
wandb.log({"training_time":training_time})
You can log many other things like tables, images, audios etc. If you are interested to explore that space, please check out the documentation here. W&B also allows us to generate the above plots within our Jupyter Notebooks with the magic command %%wandb (read more here) but I like to keep these plots separated from my notebooks.
Additionally, I created the following scatter plot to capture the trends in “training time vs. loss” using the “Add a visualization” button you get on your run page.
﻿
Creating the scatter plot was a matter of a few keystrokes and you can easily.
As you would expect, the graph is interactive as well.
Let’s now explore the second option for doing MP training in tf.keras.
Loss scaling for MP trainingLoss scaling is an important concept in MP training as it prevents the numerical underflow that might happen to lower precision in the computations. To be able to use tf.keras.mixed_precision.experimental.LossScaleOptimizer, you need to enable mixed precision by tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True}) along enabling the XLA compiler. Be sure to review some of the important things you may need to follow if you are using loss scaling with a custom training loop. We could use the same model we saw in the previous section and in order to plug in loss scaling we would do the following -
opt = Adam(learning_rate=1e-4)
﻿
opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt,  
﻿
                                                      "dynamic")
﻿
model.compile(loss="categorical_crossentropy",
﻿
             optimizer=opt,
﻿
             metrics=["accuracy"])
﻿
And that’s it! You should get similar performance as you got with policies in this case as well.
W&B for comparing our experimentsMixed precision does not only result in faster training - it is supposed to reduce memory footprint of the models as well. Because you are not using double precision or single precision fully. But this might not be that evident since we are dealing with less data here. Had it been a larger dataset (for example, say dataset of 500 MB) the difference would have been very evident. But it’s always very practical to see the memory footprint of your machine learning models in general. W&B makes it super easy to visualize that. On my project page, for individual runs like this, you get a tab called “System”. It provides you with a bunch of information like so -
﻿
Apart from the above metrics related to your system, you also get information about GPU memory usage in the bottom. In my experiment, I used the data API for TensorFlow which allows computations related to data loading, preprocessing etc. to be performed on your GPU (if available). This truly sets tf.data apart from other alternatives like ImageDataGenerator as it does not allow for GPU computation. You can use the graphs on GPU memory usage for this kind of comparisons.
The following figure shows the GPU usage from different experiments -
﻿
As you can see the different flavors of mixed precision training tend to follow similar patterns in terms of GPU usage. When I did not use mixed precision, the GPU was actually utilized better. This might hint the developers to delve further and figure out ways to improve the GPU utilization. But there is always trade-off between GPU utilization and model performance and it changes from project to project.
Following are some comprehensive graphs that represent all the necessary stuff from the conducted experiments -
﻿
The legends of the graphs are the different runs (one run = one experiment). It’s very important to turn on the XLA compiler when using mixed precision training. As can be seen from the above figure, without XLA compiler if mixed precision training is used model training will be more time consuming.
As a machine learning practitioner, you will be expected to care about aspects like this more often than not.
For each of the different runs, we get a dedicated logs page as well where we get the local training log -
﻿
﻿
You can also initialize W&B (wandb.init()) just before you are starting training process and it will appear under Runtime tab.
Thank you!Thanks you for sticking together with me till the end. I hope you will benefit from this article and incorporate mixed precision training in your own experiments. Don’t forget to let me know the results!
I would like to thank Lavanya from the W&B team who carefully reviewed the codes and the article itself. Her feedback was immensely productive.
﻿
Add a comment