Skip to main content

XLA Compatibility of Vision Models in Keras

A set of comprehensive benchmarks around XLA compatibility of computer vision models implemented in Keras.
Created on June 27|Last edited on May 14

Introduction

In this article, we'll present a comprehensive set of benchmarks of XLA-compatible, pre-trained vision models in Keras.
Specifically, we'll use pre-trained computer vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub. These models were benchmarked across different image resolutions and different GPU devices (A100, V100, and T4) to provide a holistic overview of the possible gains from XLA.
This report was made possible by the massive effort of benchmarking all these models by Sayak Paul as part of the Keras Sprint, 2023. This report was co-authored by Sayak Paul, Ayush Thakur, and Soumik Rakshit. In this article, we make extensive use of Weave, an open-source toolkit developed by the team at Weights & Biases for generating performant, interactive, and insightful data visualization panels.
The code used for creating this benchmark is available on github.com/sayakpaul/keras-xla-benchmarks.

Table of Contents



🐌 Executing a Typical TensorFlow Program

Let's look at a typical TensorFlow program:
from tensorflow import keras

inputs = keras.Input(shape=(None,))
x = keras.layers.Dense(512, activation="relu")(inputs)
x = keras.layers.Dense(256, activation="relu")(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs, outputs)
When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a pre-compiled GPU kernel implementation that the executor dispatches to.
A GPU kernel is a program that can be executed on the GPU in a massively parallel manner. Like any normal program, these kernels need access to the memory and compute, and running too many kernels with too many memory accessed can make the program costlier to run.
💡

☢️ Optimization using Operator Fusion

The following thread on Twitter posted by Horace He neatly sums up the idea behind operator fusion:

The moral of the story is that if we can fuse operators, they would likely run faster because we are reducing their memory operations, which would lead to faster execution of the program.
Let's explore how we can extend this idea to neural networks!

✈️ XLA: Optimizing Compiler for Machine Learning

First off, we need to define XLA:
XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.
XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for a given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization.
Let's look at an optimization XLA does in the context of a simple TensorFlow computation:
def model_fn(x, y, z):
return tf.reduce_sum(x + y * z)
Typically, the graph launches three kernels if executed without XLA:
  • One for the multiplication operation
  • One for the addition operation
  • One for the reduction operation
However, XLA can optimize the graph to compute the result in a single kernel launch. It does this by fusing the addition, multiplication, and reduction into a single GPU kernel. Moreover, this fused operation does not write out the intermediate values produced by yzy * z and x+yzx + y * z to the memory. Instead, it streams the results of these intermediate computations directly to their users while keeping them entirely in GPU registers.
Memory bandwidth is typically the scarcest resource on hardware accelerators, so removing memory operations is one of the best ways to improve performance. Hence, Operator Fusion is the single most important optimization in XLA.
💡

🚀 Using XLA to Accelerate TensorFlow Programs

Let's explore the different ways we can use XLA to accelerate our TensorFlow programs, shall we?

Explicit Compilation

The explicit compilation API—given by tf.function(jit_compile=True)—offers a fine-grained control for choosing which functions should be compiled. Let's see how we can accelerate the function we mentioned above:
def model_fn(x, y, z):
return tf.reduce_sum(x + y * z)

compiled_model_fn = tf.function(model_fn, jit_compile=True)
Since tf.function accepts a function and returns a compiled function, we can also use this as a decorator. For example, the following TensorFlow function, which performs the MNIST training, is compiled with XLA:
@tf.function(jit_compile=True)
def train_mnist(images, labels):
images, labels = cast(images, labels)

with tf.GradientTape() as tape:
predicted_labels = layer(images)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=predicted_labels, labels=labels
))
layer_variables = layer.trainable_variables
grads = tape.gradient(loss, layer_variables)
optimizer.apply_gradients(zip(grads, layer_variables))

Usage with Keras

For Keras models, jit_compile=True can be set as an argument to model.compile:
from tensorflow import keras

inputs = keras.Input(shape=(None,))
x = keras.layers.Dense(512, activation="relu")(inputs)
x = keras.layers.Dense(256, activation="relu")(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="adam", jit_compile=True)

Auto-clustering

A simple way to start using XLA in TensorFlow models without any changes is to enable auto-clustering, which automatically finds clusters (connected subgraphs) within the TensorFlow functions, which can be compiled and executed using XLA.
Auto-clustering on GPU can be enabled by setting the TF_XLA_FLAGS environment variable:
TF_XLA_FLAGS=--tf_xla_auto_jit=2 path/to/your/tf/program
It can also be enabled in your TensorFlow program by calling the tf.config.optimizer.set_jit() function.



📒 A Comprehensive Benchmark of TensorFlow Models

As mentioned above, we're presenting comprehensive benchmarks of XLA-compatible pre-trained models in Keras. We used pre-trained computer vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub. These benchmarks were conducted across different image resolutions and different GPU devices to provide a holistic overview of the possible gains from XLA.
We measure these benchmarks in full-precision (FP32) and do not make use of mixed-precision.
Below, you'll see an exhaustive benchmark plot containing results from all the experiments we conducted.

A comprehensive XLA benchmark of all image recognition models implemented in Keras shipped by Keras Applications, KerasCV Models and TensorFlow Hub.
1



Model Family: RegNet_X
72
Model Family: RegNet_Y
72
Model Family: DeiT
48
Model Family: EfficientNet_V1
48
Model Family: EfficientNet_V2
42
Model Family: ViT
42
Model Family: Swin
42
Model Family: ResNetRS
42
Model Family: ConvNext
30
Model Family: MLP-Mixer
18
Model Family: ResNet_V1
18
Model Family: ResNet_V2
18
Model Family: DenseNet
18
Model Family: YOLOv8
12
Model Family: NASNet
12
Model Family: RetinaNet
12
Model Family: Inception
12
Model Family: VGG
12
Model Family: Xception
6
Model Family: MobileNet_V1
3
Model Family: MobileNet_V2
3
Model Family: MobileNet_V3
6



Model Family: RegNet_X
72
Model Family: RegNet_Y
72
Model Family: DeiT
48
Model Family: EfficientNet_V1
48
Model Family: EfficientNet_V2
42
Model Family: ViT
42
Model Family: Swin
42
Model Family: ResNetRS
42
Model Family: ConvNext
30
Model Family: MLP-Mixer
18
Model Family: ResNet_V1
18
Model Family: ResNet_V2
18
Model Family: DenseNet
18
Model Family: YOLOv8
12
Model Family: NASNet
12
Model Family: RetinaNet
12
Model Family: Inception
12
Model Family: VGG
12
Model Family: Xception
6
Model Family: MobileNet_V1
6
Model Family: MobileNet_V2
6
Model Family: MobileNet_V3
0




💡Let's look at some interesting insights...

The curious case of the MobileNet Family


Interesting insight about the MobileNet family of models
12



Interesting insight about the MobileNet family of models
12

We notice that MobileNetV1, despite being a larger model, yields better throughput than the smaller model variants. We encourage you to explore other model families for which this observation holds. How many did you find?

Across different GPUs, how fast are the models with XLA?


Across different GPUs, how fast are the models with XLA?
1



Resolution-wise distribution of the throughputs with XLA


Resolution-wise distribution of Throughputs with XLA.
1



The fastest model changes for a fixed resolution when the GPU (being used for benchmarking) changes. This phenomenon becomes less evident when the resolution increases.
1



🎊 Which model is the Winner in terms of Throughput?


Podium for Throughput!
601


🥳 Which model is the Winner in terms of Absolute Speedup?


Podium for Absolute Speedup!
1



🏆 Which model is the Winner in terms of Speedup Percentage?


Podium for Speedup Percentage!
1




🏁 Conclusion

  • In this article, we present a comprehensive set of benchmarks of XLA-compatible pre-trained vision models in Keras.
  • We analyze the possible gains from XLA across different image resolutions and different GPU devices (A100, V100, and T4) for all vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub.
  • We briefly explore how XLA can be used to optimize TensorFlow programs using Operator Fusion and other techniques.
  • We use Weave, an open-source toolkit developed by the team at Weights and Biases for generating performant, interactive, and insightful data visualization panels to aid our analysis and present insights.
  • We are immensely grateful to
    • Stacey Svetlichnaya without whose constant guidance on Weave, this report wouldn't be possible.
    • The Keras Team for organizing the Keras Sprint, 2023.
  • For further resources on XLA, we recommend the following resources:


Iterate on AI agents and models faster. Try Weights & Biases today.