Can Apple’s M1 Help You Train Models Faster & Cheaper Than NVIDIA’s V100?

In this article, we analyze the runtime, energy usage, and performance of Tensorflow training on an M1 Mac Mini and Nvidia V100.

Chris Van Pelt

Created on December 9|Last edited on November 30

Comment

In this article, we run a sweep of eight different configurations of our training script and analyze the runtime, energy usage, and performance of Tensorflow training on an Apple M1 Mac Mini and compare it with that of the Nvidia V100.
Table of ContentsTable of ContentsTLDRMethodologyEnergyCaveatsApple M1Nvidia V100Conclusion
﻿
TLDRWe ran a sweep of 8 different configurations of our training script and show that the Apple M1 offers impressive performance within reach of much more expensive and less energy efficient accelerators such as the Nvidia V100 for smaller architectures and datasets.
﻿
﻿
﻿
Run set24
﻿
﻿
MethodologyWe trained a computer vision model using the MobileNetV2 architecture on Cifar 10.  We trained one in this colab on an Nvidia V100 and an identical model using the tensorflow_macos fork on a 16GB M1 Mac Mini.  We varied the following hyper-parameters using W&B Sweeps:
  batch_size:
    - 32
    - 64
  img_dim:
    - 96
    - 128
  trainable:
    - true
    - false
When trainable is false we only train the final layer in the network.  When trainable is true we update all weights in MobileNetV2.
We can see better performance gains with the m1 when there are fewer weights to train likely due to the superior memory architecture of the M1.
﻿
﻿
﻿
﻿
Trainable8
Non-Trainable8
﻿
﻿
EnergyThe Apple hardware used was an M1 Mac Mini with 16GB of RAM.  During training the fan was never audible and the case was cool to the touch.  It's remarkable how much less energy the M1 used to achieve the same amount of computation as the V100.  The V100 is using a 12nm process while the m1 is using 5nm but the V100 consistently used close to 6 times the amount of energy.
﻿
Run set16
﻿
﻿
CaveatsSetting up the Mac Mini to run the new accelerated Tensorflow package was less than trivial.  I found the simplest way to get various packages requiring compilation was from the arm64 branch of Miniconda.  The tensorflow library is supposed to choose the best path for acceleration by default, however I was seeing consistent segmentation faults unless I explicitly told the library to use the gpu with the following code:
from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name="gpu")
I chose MobileNetV2 to make iteration faster.  When I tried ResNet50 or other larger models the gap between the M1 and Nvidia grew wider.  I also experienced segmentation faults when my inputs exceeded 196x196 dimensions on the M1. 
In general it seems these entry level Macs are only suitable for smaller architectures for now.
I also observed the trials on the M1 which only trained the final layer of the network failed to converge.  This was not the case on the NVidia V100.  Upon further experimentation I was able to get the M1 runs to converge by reducing the learning rate.  It's unclear why the learning rate was more finicky on the M1 hardware.
﻿
Apple M1﻿
﻿
﻿
Run set8
﻿
﻿
Nvidia V100﻿
Run set8
﻿
﻿
ConclusionIt's still early days but these preliminary results look very promising.  When Apple releases Pro hardware with more Cores and RAM, training machine learning models on Apple hardware could become commonplace.  At W&B all of our staff uses Mac hardware for development and I know much of our team can't wait to get their hands on the next generation of hardware.  We'll be keeping a close eye on the tensorflow_macos fork and it's eventual incorporation into the main TensorFlow repository.  I'm also curious to see if the PyTorch team decides to integrate with Apple's ML Compute libraries; there's currently an ongoing discussion on Github.
Our most recent release of the W&B library (0.10.13) automatically captures GPU metrics from Apple M1 hardware like you see in this report.
We're excited for other early adopters to give this a try!  Please leave comments or ask questions about these results here in this report.
﻿
﻿
﻿

Add a comment

Joe • 4 years ago

Hi, Will you plan to evaluate the latest M1 chip (2021)?

Ernesto Mininno • 4 years ago

Heeeeeello! I made a comparison by myself using a Xeon E5-2698 48 threads 32GB and NVidia K40 against a then brand new MacBook Air M1. The M1 was a bit faster untile the training data did not become larger than some threshold (most likely a function of the Mac shared ram vs the K40 vram). Of course it is now time to revamp these benchmarks with M1 Pros and Max: I am definitely curious to compare them with more expensive gear from NVIDIA, and moreover find a good reasons to get an M1 Max. If it is a booster for Ml the ok, otherwise the M1PRO shall be Anybody here with insights, comments , anything? Ciao

2 replies

Kevlyn Kadamala • 4 years ago

Hello! I tried running this training script however on my Mac Mini M1 it would take me around 2 mins 45 seconds an epoch. The total run time came to about 35m 29s while for you it came to about 13m 44s for 128 image dims and batch size 64 with trainable to false. Do you have an idea as to why it would take me twice the amount? I've logged it on WandB too, you can find the link here - https://wandb.ai/kad99kev/m1-benchmark Any help would be greatly appreciated. Thank you!

1 reply

Vishnu Krishnaprasad • 4 years ago

There's a reddit thread on this, which basically trashes this benchmark as - "This is one of the most useless and disingenuous benchmark comparisons I have ever seen, to the point where I am not sure whether this is out of malignancy or sheer incompetence." https://www.reddit.com/r/MachineLearning/comments/kwqev4/d_comparing_performance_on_apples_m1_with_nvidias/

1 reply

Antonin Sumner • 4 years ago

Wow, the idea of being able to work locally on my Apple laptop in the near future is great news. Hope access to various cores gets clearer too.

Scott Atchley • 5 years ago

Launching kernels to an accelerator such as the V100 has a certain amount of latency. If the kernel runtime was short, the latency was not amortized and was reflected in lower V100 performance. If the kernels are long running, then the latency was amortized and these results are more impressive.

2 replies

Nick Gold • 5 years ago

I want to make sure I understand this. These tests were taxing CPU cores and the memory architecture, correct? Were the M1's GPU cores utilized at all for these training runs? What about the so-called Neural Engine? My sense is that none of these SoC subsystems were leveraged, based on your descriptions? Thanks for looking into this, I've been extremely curious about the M1 for these types of applications...

4 replies

Tags: Intermediate, Domain Agnostic, Hardware, Keras, Experiment, MobileNet v2, Panels, Plots, CIFAR10

Iterate on AI agents and models faster. Try Weights & Biases today.