Speed Up Stable Diffusion on Your M1Pro Macbook Pro

How to speed up your Stable Diffusion inference and get it running as fast as possible on your M1Pro Macbook Pro laptop.
Thomas Capelle
Created on September 20|Last edited on November 9
Comment
Everyone is playing with Stable Diffusion (SD) and so am I. But using my own computer to perform inference and generate the hot new memes has been painful. There are multiple implementations, some of which are considerably faster than others. Today we will look at what one can do to run the models as fast as possible. Why? So we can generate more images!
NOTE: this is moving terrain, and this article outlines the findings today, September 27th 2022.
💡
﻿
Table Of ContentsThe HardwarePerformance Evaluation of Inference on Stable DiffusionPyTorch implementationsTensorflowWhat is CoreML?What model is the best for the Apple M1Pro/Max processor?How is the actual GPU usage on my machine?In Closing
﻿
The HardwareMy current computer is the M1Pro Macbook Pro, which is equipped with the new generation of Apple silicon. PyTorch and TensorFlow have been making progress on running natively on this new hardware but support is still very much in beta.
"PyTorch version of Stable Diffusion on Apple Mac." Courtesy of Tensorflow SD on an Apple M1, 😎 very meta indeed 😎
Performance Evaluation of Inference on Stable DiffusionWhat we'll be doing is computing the average time per diffusion step. To do this, we will instrument the codebase with W&B and log the time spent on the diffusion. Basically, we'll be wrapping the corresponding diffusion step with a timing block similar to this one:
import wandb
﻿
## previous code
## somewhat down the stable diffusion
﻿
for step in diffusion_steps:
  t0 = time.perf_counter()
﻿
  ## Perform diffusion step
﻿
  tf = time.perf_counter() - t0
﻿
  ## log the timing
  wandb.log({"seg_per_iter":tf})
Why you ask? This way, we time the inference of the forward pass of the model and not the pre and post-processing of the images. This is a better comparison as some processing tasks are faster/slower in different architectures. Notably, the preprocessing functions are very fast on apple hardware, thanks to the unified memory.
PyTorch implementationsThe official implementation of SD is in PyTorch, and notably, HuggingFace has integrated them into their Diffusers library. 
Some resources:
﻿The Original implementation from CompVis ﻿﻿
﻿Check this article in the Huggingface Diffusers library to get up to speed!
The CoreML effort on porting the PyTorch code, really speeds things up!
Here we'll use the Diffusers implementation and some tweaks to make it run faster on Apple hardware. 
💻 Our machine is an M1Pro with 16 GPU cores and 16GB of memory. The libraries we are using are:
python:                   3.9.13
diffusers:                0.3.0 + 0.7.0
torch:                      1.13.0
coreml:                   6.0b2
We add a V100-16GB machine as a reference.
﻿
﻿
If you want to know more on how to install PyTorch to run natively on your Macbook, check this article. To use the CoreML exported model, you need to follow the steps here.﻿
💡
TensorflowThere is currently only one implementation of Stable Diffusion in Tensorflow/Keras. It was created by @divamgupta and later upgraded by Francois Chollet himself. This is the best implementation so far; extremely readable and concise. It's also fast!
﻿
﻿
To install Tensorflow on your Mac you can check this report﻿
💡
What is CoreML?﻿
The packaging paradigm of CoreML
﻿CoreML is an Apple framework to integrate machine learning models into your app that enables you to package your model in code that is capable using the Apple hardware specifically designed to run neural networks fast. Exporting our model to CoreML (we only exported the UNET part of the SD) made our inference twice as fast.
Core ML optimizes on-device performance by leveraging the CPU, GPU, and Apple Neural Engine (ANE) while minimizing its memory footprint and power consumption.Still: it's not all roses. Stable Diffusion is a complex model with multiple blocks. Converting the whole model is not possible right now, as some parts are simply not compatible with CoreML. The slowest part of the model is the UNET, so converting this block was a priority. Some torch operations are not supported, like the torch.einsum ot torch.gelu , so some patching is needed. If you want to see the patching used, you can check this repo.
What model is the best for the Apple M1Pro/Max processor?We compared the Tensorflow and PyTorch implementations on my machine. 
The Tensorflow implementation was the most stable and less finicky. It always performed similarly, regarding how many programs I had open. In contrast, the PyTorch implementation is very tricky, especially on GPUs. Sometimes it will run fast (around 2.5 samples per second) and sometimes it will be extremely slow.
The best results are obtained with a cold start, and nothing open beside your terminal running the code.
﻿
﻿
How is the actual GPU usage on my machine?We can check how the GPU is being when running by checking the activity monitor GPU history (cmd+4)
The Tensorflow code runs maxing out the Apple GPU
PyTorch codebase with CoreML exported UNET
In ClosingIt is totally possible to run Stable Diffusion on your Mac with decent performance, but if you want to go on an exploratory analysis of architecture or painters, you are better off leveraging some CUDA cores!
I am hoping to see a full end-to-end CoreML model so I can run this on an iPhone and use the integrated camera to feed the diffusion or do impaintings.
﻿
Building a House with a famous architect
What if I asked famous architects to build a house for me?
PyTorch Runs On the GPU of Apple M1 Macs Now! - Announcement With Code Samples 
Let's try PyTorch's new Metal backend on Apple Macs equipped with M1 processors!
Santiago and Valparaiso paintings
What if Dali (the painter) dreamt these cities
﻿
﻿
Add a comment
Tags: Articles, Hardware, Intermediate, Stable Diffusion, GenAI, Computer Vision, Large Models, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.