Skip to main content

PyTorch Runs On the GPU of Apple M1 Macs Now! - Announcement With Code Samples

Let's try PyTorch's new Metal backend on Apple Macs equipped with M1 processors!
Created on May 18|Last edited on June 21
Since Apple launched the M1-equipped Macs we have been waiting for PyTorch to come natively to make use of the powerful GPU inside these little machines. TensorFlow has been available since the early days of the M1 Macs, but for us PyTorch lovers, we had to fall back to CPU-only PyTorch.
Today, 🔥 PyTorch announced that the wait is finally over, and we can have access to the nightly PyTorch preview that supports the Metal backend (similar to the Cuda backend).
🧐 A backend is a highly optimized low-level library that enables take full use of the specific instructions that the GPU has. This makes computations way faster and in parallel.

😎 Early Benchmark Tensorflow vs PyTorch

Results of PyTorch on Apple Hardware (and some Nvidia)

We will run two trianing scripts:
  • A vision ResNet50
  • A Huggingface Bert model
We use PyTorch Nightly, which is a beta release, but training works out of the box. Just pass device="mps" to your training script and you are good to go!
import torch
torch.tensor([1,2,3], device="mps") #that's it!
Apple claims a nice speedup between CPU and GPU, this is confirmed in both ResNet50 and Bert



Relative to Nvidia

Let's put some Nvidia hardware for comparison...😱
Here are the results of a simple training script of a Resnet50 on the Oxford Pets dataset, see the section below to run it by yourself! You can actually compare this table to the Tensorflow one here.

Run set
173

😎 The Nvidia GPU supports Mixed Precision Training so you get even more out of the hardware!

For NLP let's try BERT training

Using. PyTorch Huggingface Bert model. The training script can be found here

Run set
179

The Max GPU is considerably faster for this Attention-based model, probably the extra ram helps here more than the extra GPU cores.
We are not yet chasing Nvidia...

Run this benchmark and contribute to this table 🚀

I have put a repo with a setup.sh script to help automate this procedure here

Installation

You will need Python installed; the preferred way is using MiniForge
# The version of Anaconda may be different depending on when you are installing`
$ curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
$ sh Miniforge3-MacOSX-arm64.sh
# and follow the prompts. The defaults are generally good.`
Then, create an environment to use Python and PyTorch:
$ conda create --name="pt" "python<3.11"

# activate the environment
$ conda activate pt

# install Pytorch
$ conda install pytorch torchvision -c pytorch

# install dependencies of this training script 😎
$ pip install wandb tqdm transformers datasets
For more details on installing Python check this report.

Verifying the installation

In python, run the following:
import torch

torch.__version__
>>> '1.13.1'

torch.tensor([1,2,3], device="mps")
If this works, you are done and have MPS (Metal) backend support available.

Training a Model

We will train a model on the Oxford Pets dataset, feel free to modify and play with it!
  • You can verify that the GPU is being used by going to Activity Monitor and check the GPU history window, you should get something like this while training:
As soon as you launch training, the bars fill up!

How to run this benchmark on your machine?

You will need an environment with a nightly PyTorch setup first. You can download the train_pets.py file from here and you are good to go!
Then, you can run the training:
python train_pets.py --device="mps" --gpu_name="M1Pro GPU 16 Cores"
  • Pass the --gpu_name flag to group the runs, I am not able to detect this automatically on Apple.
  • To run on cpu pass --device="cpu" or for CUDA --device="cuda" (you need a linux PC with an Nvidia GPU)
  • You can also pass other params, and play with different batch_size and model_name.
If you need any help, please contact me or reply to this:


See the training script

Conclusions

It is available today, but it's not ready for prime time. Keep an eye on the PyTorch github repo, there are already a bunch of issues of missing ops and little problems here and there.
The best thing you can do is play and submit issues so it keeps improving.


Jonathan Badger
Jonathan Badger •  
I was a bit confused about the performance of the 1080Ti in the figure showing samples/sec for Resnet50. It's a lot higher than I would have expected. When I poked around in the run data I noticed it says the number of GPUs in the run is 4...not 1. Not sure if this was an oversight or not. Great article and comparison chart though...love it!
3 replies
Willian Zhang
Willian Zhang •  *
Suggest to use "max" instead of "average" as aggregation in
Training: Samples/second
. Also filter out results tagged with "hidden", hence irrelevant results could be excluded without deleting data.
1 reply
Morgan
Morgan •  
Training Time M1Pro: 16 GPU cores vs CPU 10 cores - PyTorch
Just to clarify, this is for PyTorch right?
1 reply
Karrtik Iyer
Karrtik Iyer •  
Thanks Thomas, however unable to access this github link: https://github.com/tcapelle/m1_pro_pytorch
1 reply
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.