XLA Compatibility of Vision Models in Keras
A set of comprehensive benchmarks around XLA compatibility of computer vision models implemented in Keras.
Created on June 27|Last edited on May 14
Comment
Introduction
In this article, we'll present a comprehensive set of benchmarks of XLA-compatible, pre-trained vision models in Keras.
Specifically, we'll use pre-trained computer vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub. These models were benchmarked across different image resolutions and different GPU devices (A100, V100, and T4) to provide a holistic overview of the possible gains from XLA.
This report was made possible by the massive effort of benchmarking all these models by Sayak Paul as part of the Keras Sprint, 2023. This report was co-authored by Sayak Paul, Ayush Thakur, and Soumik Rakshit. In this article, we make extensive use of Weave, an open-source toolkit developed by the team at Weights & Biases for generating performant, interactive, and insightful data visualization panels.
The code used for creating this benchmark is available on github.com/sayakpaul/keras-xla-benchmarks.
Table of Contents
Introduction🐌 Executing a Typical TensorFlow Program☢️ Optimization using Operator Fusion✈️ XLA: Optimizing Compiler for Machine Learning🚀 Using XLA to Accelerate TensorFlow Programs📒 A Comprehensive Benchmark of TensorFlow Models💡Let's look at some interesting insights...🏁 Conclusion
🐌 Executing a Typical TensorFlow Program
Let's look at a typical TensorFlow program:
from tensorflow import kerasinputs = keras.Input(shape=(None,))x = keras.layers.Dense(512, activation="relu")(inputs)x = keras.layers.Dense(256, activation="relu")(x)outputs = keras.layers.Dense(10, activation="softmax")(x)model = keras.Model(inputs, outputs)
When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a pre-compiled GPU kernel implementation that the executor dispatches to.
A GPU kernel is a program that can be executed on the GPU in a massively parallel manner. Like any normal program, these kernels need access to the memory and compute, and running too many kernels with too many memory accessed can make the program costlier to run.
💡
☢️ Optimization using Operator Fusion
The moral of the story is that if we can fuse operators, they would likely run faster because we are reducing their memory operations, which would lead to faster execution of the program.
Let's explore how we can extend this idea to neural networks!
✈️ XLA: Optimizing Compiler for Machine Learning
First off, we need to define XLA:
XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.
XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for a given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization.
Let's look at an optimization XLA does in the context of a simple TensorFlow computation:
def model_fn(x, y, z):return tf.reduce_sum(x + y * z)
Typically, the graph launches three kernels if executed without XLA:
- One for the multiplication operation
- One for the addition operation
- One for the reduction operation
However, XLA can optimize the graph to compute the result in a single kernel launch. It does this by fusing the addition, multiplication, and reduction into a single GPU kernel. Moreover, this fused operation does not write out the intermediate values produced by and to the memory. Instead, it streams the results of these intermediate computations directly to their users while keeping them entirely in GPU registers.
Memory bandwidth is typically the scarcest resource on hardware accelerators, so removing memory operations is one of the best ways to improve performance. Hence, Operator Fusion is the single most important optimization in XLA.
💡
🚀 Using XLA to Accelerate TensorFlow Programs
Let's explore the different ways we can use XLA to accelerate our TensorFlow programs, shall we?
Explicit Compilation
The explicit compilation API—given by tf.function(jit_compile=True)—offers a fine-grained control for choosing which functions should be compiled. Let's see how we can accelerate the function we mentioned above:
def model_fn(x, y, z):return tf.reduce_sum(x + y * z)compiled_model_fn = tf.function(model_fn, jit_compile=True)
Since tf.function accepts a function and returns a compiled function, we can also use this as a decorator. For example, the following TensorFlow function, which performs the MNIST training, is compiled with XLA:
@tf.function(jit_compile=True)def train_mnist(images, labels):images, labels = cast(images, labels)with tf.GradientTape() as tape:predicted_labels = layer(images)loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=predicted_labels, labels=labels))layer_variables = layer.trainable_variablesgrads = tape.gradient(loss, layer_variables)optimizer.apply_gradients(zip(grads, layer_variables))
Usage with Keras
from tensorflow import kerasinputs = keras.Input(shape=(None,))x = keras.layers.Dense(512, activation="relu")(inputs)x = keras.layers.Dense(256, activation="relu")(x)outputs = keras.layers.Dense(10, activation="softmax")(x)model = keras.Model(inputs, outputs)model.compile(optimizer="adam", jit_compile=True)
Auto-clustering
A simple way to start using XLA in TensorFlow models without any changes is to enable auto-clustering, which automatically finds clusters (connected subgraphs) within the TensorFlow functions, which can be compiled and executed using XLA.
Auto-clustering on GPU can be enabled by setting the TF_XLA_FLAGS environment variable:
TF_XLA_FLAGS=--tf_xla_auto_jit=2 path/to/your/tf/program
It can also be enabled in your TensorFlow program by calling the tf.config.optimizer.set_jit() function.
📒 A Comprehensive Benchmark of TensorFlow Models
As mentioned above, we're presenting comprehensive benchmarks of XLA-compatible pre-trained models in Keras. We used pre-trained computer vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub. These benchmarks were conducted across different image resolutions and different GPU devices to provide a holistic overview of the possible gains from XLA.
We measure these benchmarks in full-precision (FP32) and do not make use of mixed-precision.
Below, you'll see an exhaustive benchmark plot containing results from all the experiments we conducted.
A comprehensive XLA benchmark of all image recognition models implemented in Keras shipped by Keras Applications, KerasCV Models and TensorFlow Hub.
1
72
72
48
48
42
42
42
42
30
18
18
18
18
12
12
12
12
12
6
Model Family: MobileNet_V1
3
Model Family: MobileNet_V2
3
Model Family: MobileNet_V3
6
72
72
48
48
Model Family: EfficientNet_V2
42
42
42
42
30
18
18
18
18
12
12
12
12
12
6
6
6
0
💡Let's look at some interesting insights...
The curious case of the MobileNet Family
Interesting insight about the MobileNet family of models
12
Interesting insight about the MobileNet family of models
12
We notice that MobileNetV1, despite being a larger model, yields better throughput than the smaller model variants. We encourage you to explore other model families for which this observation holds. How many did you find?
Across different GPUs, how fast are the models with XLA?
Across different GPUs, how fast are the models with XLA?
1
Resolution-wise distribution of the throughputs with XLA
Resolution-wise distribution of Throughputs with XLA.
1
The fastest model changes for a fixed resolution when the GPU (being used for benchmarking) changes. This phenomenon becomes less evident when the resolution increases.
1
🎊 Which model is the Winner in terms of Throughput?
Podium for Throughput!
601
🥳 Which model is the Winner in terms of Absolute Speedup?
Podium for Absolute Speedup!
1
🏆 Which model is the Winner in terms of Speedup Percentage?
Podium for Speedup Percentage!
1
🏁 Conclusion
- In this article, we present a comprehensive set of benchmarks of XLA-compatible pre-trained vision models in Keras.
- We analyze the possible gains from XLA across different image resolutions and different GPU devices (A100, V100, and T4) for all vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub.
- We briefly explore how XLA can be used to optimize TensorFlow programs using Operator Fusion and other techniques.
- We use Weave, an open-source toolkit developed by the team at Weights and Biases for generating performant, interactive, and insightful data visualization panels to aid our analysis and present insights.
- We are immensely grateful to
- For further resources on XLA, we recommend the following resources:
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.