PyTorch Runs On the GPU of Apple M1 Macs Now! - Announcement With Code Samples

Let's try PyTorch's new Metal backend on Apple Macs equipped with M1 processors!

Created on May 18|Last edited on June 21

Comment

Since Apple launched the M1-equipped Macs we have been waiting for PyTorch to come natively to make use of the powerful GPU inside these little machines. TensorFlow has been available since the early days of the M1 Macs, but for us PyTorch lovers, we had to fall back to CPU-only PyTorch. 
Today, 🔥 PyTorch announced that the wait is finally over, and we can have access to the nightly PyTorch preview that supports the Metal backend (similar to the Cuda backend).
🧐 A backend is a highly optimized low-level library that enables take full use of the specific instructions that the GPU has. This makes computations way faster and in parallel.
😎 Early Benchmark Tensorflow vs PyTorch
Results of PyTorch on Apple Hardware (and some Nvidia)We will run two trianing scripts: 
A vision ResNet50
A Huggingface Bert model
We use PyTorch Nightly, which is a beta release, but training works out of the box. Just pass device="mps" to your training script and you are good to go!
import torch
torch.tensor([1,2,3], device="mps") #that's it!
﻿Apple claims a nice speedup between CPU and GPU, this is confirmed in both ResNet50 and Bert
﻿
﻿
Relative to NvidiaLet's put some Nvidia hardware for comparison...😱
Here are the results of a simple training script of a Resnet50 on the Oxford Pets dataset, see the section below to run it by yourself! You can actually compare this table to the Tensorflow one here.
﻿
Run set173
﻿
😎 The Nvidia GPU  supports  Mixed Precision Training so you get even more out of the hardware!
For NLP let's try BERT trainingUsing. PyTorch Huggingface Bert model. The training script can be found here﻿
﻿
Run set179
﻿
The Max GPU is considerably faster for this Attention-based model, probably the extra ram helps here more than the extra GPU cores.We are not yet chasing Nvidia...
 Run this benchmark and contribute to this table 🚀I have put a repo with a setup.sh script to help automate this procedure here﻿
InstallationYou will need Python installed; the preferred way is using MiniForge﻿
# The version of Anaconda may be different depending on when you are installing`
$ curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
$ sh Miniforge3-MacOSX-arm64.sh
# and follow the prompts. The defaults are generally good.`
Then, create an environment to use Python and PyTorch:
$ conda create --name="pt" "python<3.11"
﻿
# activate the environment
$ conda activate pt
﻿
# install Pytorch
$ conda install pytorch torchvision -c pytorch
﻿
# install dependencies of this training script 😎
$ pip install wandb tqdm transformers datasets
For more details on installing Python check this report.
Verifying the installationIn python, run the following:
import torch
﻿
torch.__version__
>>> '1.13.1'
﻿
torch.tensor([1,2,3], device="mps")
If this works, you are done and have MPS (Metal) backend support available.
Training a ModelWe will train a model on the Oxford Pets dataset, feel free to modify and play with it!
You can verify that the GPU is being used by going to Activity Monitor and check the GPU history window, you should get something like this while training:
As soon as you launch training, the bars fill up!
How to run this benchmark on your machine?You will need an environment with a nightly PyTorch setup first. You can download the train_pets.py file from here and you are good to go!
Then, you can run the training:
python train_pets.py --device="mps" --gpu_name="M1Pro GPU 16 Cores"
Pass the --gpu_name flag to group the runs, I am not able to detect this automatically on Apple.
To run on cpu pass --device="cpu" or for CUDA --device="cuda" (you need a linux PC with an Nvidia GPU)
You can also pass other params, and play with different batch_size and model_name.
If you need any help, please contact me or reply to this:
﻿
See the training script
ConclusionsIt is available today, but it's not ready for prime time. Keep an eye on the PyTorch github repo, there are already a bunch of issues of missing ops and little problems here and there.
The best thing you can do is play and submit issues so it keeps improving.
﻿
Deep Learning on the M1 Pro with Apple Silicon 
Let's take my new Macbook Pro for a spin and see how well it performs, shall we?
Can Apple’s M1 Help You Train Models Faster & Cheaper Than NVIDIA’s V100?
In this article, we analyze the runtime, energy usage, and performance of Tensorflow training on an M1 Mac Mini and Nvidia V100. 
﻿
﻿

Add a comment

Jonathan Badger • 2 years ago

I was a bit confused about the performance of the 1080Ti in the figure showing samples/sec for Resnet50. It's a lot higher than I would have expected. When I poked around in the run data I noticed it says the number of GPUs in the run is 4...not 1. Not sure if this was an oversight or not. Great article and comparison chart though...love it!