XLA Compatibility of Vision Models in Keras

A set of comprehensive benchmarks around XLA compatibility of computer vision models implemented in Keras.
Created on June 27|Last edited on May 14
Comment
﻿
IntroductionIn this article, we'll present a comprehensive set of benchmarks of XLA-compatible, pre-trained vision models in Keras. 
Specifically, we'll use pre-trained computer vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub. These models were benchmarked across different image resolutions and different GPU devices (A100, V100, and T4) to provide a holistic overview of the possible gains from XLA.
This report was made possible by the massive effort of benchmarking all these models by Sayak Paul as part of the Keras Sprint, 2023. This report was co-authored by Sayak Paul, Ayush Thakur, and Soumik Rakshit. In this article, we make extensive use of Weave, an open-source toolkit developed by the team at Weights & Biases for generating performant, interactive, and insightful data visualization panels.
The code used for creating this benchmark is available on github.com/sayakpaul/keras-xla-benchmarks.﻿
Table of ContentsIntroduction🐌 Executing a Typical TensorFlow Program☢️ Optimization using Operator Fusion✈️ XLA: Optimizing Compiler for Machine Learning🚀 Using XLA to Accelerate TensorFlow Programs📒 A Comprehensive Benchmark of TensorFlow Models💡Let's look at some interesting insights...🏁 Conclusion
﻿
🐌 Executing a Typical TensorFlow ProgramLet's look at a typical TensorFlow program:
from tensorflow import keras
﻿
inputs = keras.Input(shape=(None,))
x = keras.layers.Dense(512, activation="relu")(inputs)
x = keras.layers.Dense(256, activation="relu")(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs, outputs)
When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a pre-compiled GPU kernel implementation that the executor dispatches to.
A GPU kernel is a program that can be executed on the GPU in a massively parallel manner. Like any normal program, these kernels need access to the memory and compute, and running too many kernels with too many memory accessed can make the program costlier to run.
💡
☢️ Optimization using Operator FusionThe following thread on Twitter posted by Horace He neatly sums up the idea behind operator fusion:
﻿
The moral of the story is that if we can fuse operators, they would likely run faster because we are reducing their memory operations, which would lead to faster execution of the program.
Let's explore how we can extend this idea to neural networks!
✈️ XLA: Optimizing Compiler for Machine LearningFirst off, we need to define XLA: 
XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for a given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization.
Let's look at an optimization XLA does in the context of a simple TensorFlow computation:
def model_fn(x, y, z):
  return tf.reduce_sum(x + y * z)
Typically, the graph launches three kernels if executed without XLA:
One for the multiplication operation
One for the addition operation 
One for the reduction operation
However, XLA can optimize the graph to compute the result in a single kernel launch. It does this by fusing the addition, multiplication, and reduction into a single GPU kernel. Moreover, this fused operation does not write out the intermediate values produced by y∗zy * zy∗z﻿ and x+y∗zx + y * zx+y∗z﻿ to the memory. Instead, it streams the results of these intermediate computations directly to their users while keeping them entirely in GPU registers.
Memory bandwidth is typically the scarcest resource on hardware accelerators, so removing memory operations is one of the best ways to improve performance. Hence, Operator Fusion is the single most important optimization in XLA.
💡
🚀 Using XLA to Accelerate TensorFlow ProgramsLet's explore the different ways we can use XLA to accelerate our TensorFlow programs, shall we? 
Explicit CompilationThe explicit compilation API—given by tf.function(jit_compile=True)—offers a fine-grained control for choosing which functions should be compiled. Let's see how we can accelerate the function we mentioned above:
def model_fn(x, y, z):
  return tf.reduce_sum(x + y * z)
﻿
compiled_model_fn = tf.function(model_fn, jit_compile=True)
Since tf.function accepts a function and returns a compiled function, we can also use this as a decorator. For example, the following TensorFlow function, which performs the MNIST training, is compiled with XLA:
@tf.function(jit_compile=True)
def train_mnist(images, labels):
    images, labels = cast(images, labels)
﻿
    with tf.GradientTape() as tape:
      predicted_labels = layer(images)
      loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
          logits=predicted_labels, labels=labels
      ))
    layer_variables = layer.trainable_variables
    grads = tape.gradient(loss, layer_variables)
    optimizer.apply_gradients(zip(grads, layer_variables))
Usage with KerasFor Keras models, jit_compile=True can be set as an argument to model.compile:
from tensorflow import keras
﻿
inputs = keras.Input(shape=(None,))
x = keras.layers.Dense(512, activation="relu")(inputs)
x = keras.layers.Dense(256, activation="relu")(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs, outputs)
﻿
model.compile(optimizer="adam", jit_compile=True)
Auto-clusteringA simple way to start using XLA in TensorFlow models without any changes is to enable auto-clustering, which automatically finds clusters (connected subgraphs) within the TensorFlow functions, which can be compiled and executed using XLA.
Auto-clustering on GPU can be enabled by setting the TF_XLA_FLAGS environment variable:
TF_XLA_FLAGS=--tf_xla_auto_jit=2 path/to/your/tf/program
It can also be enabled in your TensorFlow program by calling the tf.config.optimizer.set_jit() function.
﻿
📒 A Comprehensive Benchmark of TensorFlow ModelsAs mentioned above, we're presenting comprehensive benchmarks of XLA-compatible pre-trained models in Keras. We used pre-trained computer vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub. These benchmarks were conducted across different image resolutions and different GPU devices to provide a holistic overview of the possible gains from XLA.
We measure these benchmarks in full-precision (FP32) and do not make use of mixed-precision. 
Below, you'll see an exhaustive benchmark plot containing results from all the experiments we conducted.  
﻿
A comprehensive XLA benchmark of all image recognition models implemented in Keras shipped by Keras Applications, KerasCV Models and TensorFlow Hub.1
﻿
﻿
﻿
 
Model Family: RegNet_X72
 
Model Family: RegNet_Y72
 
Model Family: DeiT48
 
Model Family: EfficientNet_V148
 
Model Family: EfficientNet_V242
 
Model Family: ViT42
 
Model Family: Swin42
 
Model Family: ResNetRS42
 
Model Family: ConvNext30
 
Model Family: MLP-Mixer18
 
Model Family: ResNet_V118
 
Model Family: ResNet_V218
 
Model Family: DenseNet18
 
Model Family: YOLOv812
 
Model Family: NASNet12
 
Model Family: RetinaNet12
 
Model Family: Inception12
 
Model Family: VGG12
 
Model Family: Xception6
Model Family: MobileNet_V13
Model Family: MobileNet_V23
Model Family: MobileNet_V36
﻿
﻿
﻿
 
Model Family: RegNet_X72
 
Model Family: RegNet_Y72
 
Model Family: DeiT48
 
Model Family: EfficientNet_V148
Model Family: EfficientNet_V242
 
Model Family: ViT42
 
Model Family: Swin42
 
Model Family: ResNetRS42
 
Model Family: ConvNext30
 
Model Family: MLP-Mixer18
 
Model Family: ResNet_V118
 
Model Family: ResNet_V218
 
Model Family: DenseNet18
 
Model Family: YOLOv812
 
Model Family: NASNet12
 
Model Family: RetinaNet12
 
Model Family: Inception12
 
Model Family: VGG12
 
Model Family: Xception6
 
Model Family: MobileNet_V16
 
Model Family: MobileNet_V26
 
Model Family: MobileNet_V30
﻿
﻿
💡Let's look at some interesting insights...
The curious case of the MobileNet Family﻿
Interesting insight about the MobileNet family of models12
﻿
﻿
﻿
Interesting insight about the MobileNet family of models12
﻿
We notice that MobileNetV1, despite being a larger model, yields better throughput than the smaller model variants. We encourage you to explore other model families for which this observation holds. How many did you find?
Across different GPUs, how fast are the models with XLA?﻿
Across different GPUs, how fast are the models with XLA?1
﻿
﻿
Resolution-wise distribution of the throughputs with XLA﻿
Resolution-wise distribution of Throughputs with XLA.1
﻿
﻿
﻿
The fastest model changes for a fixed resolution when the GPU (being used for benchmarking) changes. This phenomenon becomes less evident when the resolution increases.1
﻿
﻿
🎊 Which model is the Winner in terms of Throughput?﻿
Podium for Throughput!601
﻿
🥳 Which model is the Winner in terms of Absolute Speedup?﻿
Podium for Absolute Speedup!1
﻿
﻿
🏆 Which model is the Winner in terms of Speedup Percentage?﻿
Podium for Speedup Percentage!1
﻿
﻿
🏁 ConclusionIn this article, we present a comprehensive set of benchmarks of XLA-compatible pre-trained vision models in Keras.
We analyze the possible gains from XLA across different image resolutions and different GPU devices (A100, V100, and T4) for all vision models shipped by Keras Applications, KerasCV Models, and TensorFlow Hub.
We briefly explore how XLA can be used to optimize TensorFlow programs using Operator Fusion and other techniques.
 We use Weave, an open-source toolkit developed by the team at Weights and Biases for generating performant, interactive, and insightful data visualization panels to aid our analysis and present insights.
We are immensely grateful to 
﻿Stacey Svetlichnaya without whose constant guidance on Weave, this report wouldn't be possible.
The Keras Team for organizing the Keras Sprint, 2023.
For further resources on XLA, we recommend the following resources: 
XLA: Optimizing Compiler for Machine Learning
Official documentation for XLA
Accelerate Your TensorFlow Models with XLA
A presentation on XLA
How Hugging Face improved Text Generation performance with XLA
A blog post by the HuggingFace team 🤗
﻿
﻿
﻿
Add a comment
Tags: Articles, Computer Vision, Keras, Experiment, Panels, Sweeps, Plots, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.