PyTorch Runs On the GPU of Apple M1 Macs Now! - Announcement With Code Samples
Let's try PyTorch's new Metal backend on Apple Macs equipped with M1 processors!
Created on May 18|Last edited on June 21
Comment
Since Apple launched the M1-equipped Macs we have been waiting for PyTorch to come natively to make use of the powerful GPU inside these little machines. TensorFlow has been available since the early days of the M1 Macs, but for us PyTorch lovers, we had to fall back to CPU-only PyTorch.
Today, 🔥 PyTorch announced that the wait is finally over, and we can have access to the nightly PyTorch preview that supports the Metal backend (similar to the Cuda backend).
🧐 A backend is a highly optimized low-level library that enables take full use of the specific instructions that the GPU has. This makes computations way faster and in parallel.
😎 Early Benchmark Tensorflow vs PyTorch
Results of PyTorch on Apple Hardware (and some Nvidia)
We will run two trianing scripts:
- A vision ResNet50
- A Huggingface Bert model
We use PyTorch Nightly, which is a beta release, but training works out of the box. Just pass device="mps" to your training script and you are good to go!
import torchtorch.tensor([1,2,3], device="mps") #that's it!
Relative to Nvidia
Let's put some Nvidia hardware for comparison...😱
Here are the results of a simple training script of a Resnet50 on the Oxford Pets dataset, see the section below to run it by yourself! You can actually compare this table to the Tensorflow one here.
Run set
173
😎 The Nvidia GPU supports Mixed Precision Training so you get even more out of the hardware!
For NLP let's try BERT training
Run set
179
The Max GPU is considerably faster for this Attention-based model, probably the extra ram helps here more than the extra GPU cores.
We are not yet chasing Nvidia...
Run this benchmark and contribute to this table 🚀
Installation
# The version of Anaconda may be different depending on when you are installing`$ curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh$ sh Miniforge3-MacOSX-arm64.sh# and follow the prompts. The defaults are generally good.`
Then, create an environment to use Python and PyTorch:
$ conda create --name="pt" "python<3.11"# activate the environment$ conda activate pt# install Pytorch$ conda install pytorch torchvision -c pytorch# install dependencies of this training script 😎$ pip install wandb tqdm transformers datasets
Verifying the installation
In python, run the following:
import torchtorch.__version__>>> '1.13.1'torch.tensor([1,2,3], device="mps")
If this works, you are done and have MPS (Metal) backend support available.
Training a Model
We will train a model on the Oxford Pets dataset, feel free to modify and play with it!
- You can verify that the GPU is being used by going to Activity Monitor and check the GPU history window, you should get something like this while training:

As soon as you launch training, the bars fill up!
How to run this benchmark on your machine?
You will need an environment with a nightly PyTorch setup first. You can download the train_pets.py file from here and you are good to go!
Then, you can run the training:
python train_pets.py --device="mps" --gpu_name="M1Pro GPU 16 Cores"
- Pass the --gpu_name flag to group the runs, I am not able to detect this automatically on Apple.
- To run on cpu pass --device="cpu" or for CUDA --device="cuda" (you need a linux PC with an Nvidia GPU)
- You can also pass other params, and play with different batch_size and model_name.
If you need any help, please contact me or reply to this:
See the training script
Conclusions
It is available today, but it's not ready for prime time. Keep an eye on the PyTorch github repo, there are already a bunch of issues of missing ops and little problems here and there.
The best thing you can do is play and submit issues so it keeps improving.
Deep Learning on the M1 Pro with Apple Silicon
Let's take my new Macbook Pro for a spin and see how well it performs, shall we?
Can Apple’s M1 Help You Train Models Faster & Cheaper Than NVIDIA’s V100?
In this article, we analyze the runtime, energy usage, and performance of Tensorflow training on an M1 Mac Mini and Nvidia V100.
Add a comment
I was a bit confused about the performance of the 1080Ti in the figure showing samples/sec for Resnet50. It's a lot higher than I would have expected. When I poked around in the run data I noticed it says the number of GPUs in the run is 4...not 1. Not sure if this was an oversight or not. Great article and comparison chart though...love it!
3 replies
Suggest to use "max" instead of "average" as aggregation in
Training: Samples/second
.
Also filter out results tagged with "hidden", hence irrelevant results could be excluded without deleting data. 1 reply
Training Time M1Pro: 16 GPU cores vs CPU 10 cores - PyTorch
Just to clarify, this is for PyTorch right? 1 reply
Thanks Thomas, however unable to access this github link: https://github.com/tcapelle/m1_pro_pytorch
1 reply
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.