Can Apple's M1 help you train models faster & cheaper than NVIDIA's V100?

Analyzing the runtime, energy usage, and performance of Tensorflow training on a M1 Mac Mini and Nvidia V100. Made by Chris Van Pelt using Weights & Biases
Chris Van Pelt

TLDR

We ran a sweep of 8 different configurations of our training script and show that the Apple M1 offers impressive performance within reach of much more expensive and less energy efficient accelerators such as the Nvidia V100 for smaller architectures and datasets.

Section 2

Methodology

We trained a computer vision model using the MobileNetV2 architecture on Cifar 10. We trained one in this colab on an Nvidia V100 and an identical model using the tensorflow_macos fork on a 16GB M1 Mac Mini. We varied the following hyper-parameters using W&B Sweeps:

  batch_size:
    - 32
    - 64
  img_dim:
    - 96
    - 128
  trainable:
    - true
    - false

When trainable is false we only train the final layer in the network. When trainable is true we update all weights in MobileNetV2.

We can see better performance gains with the m1 when there are fewer weights to train likely due to the superior memory architecture of the M1.

Section 4

Energy

The Apple hardware used was an M1 Mac Mini with 16GB of RAM. During training the fan was never audible and the case was cool to the touch. It's remarkable how much less energy the M1 used to achieve the same amount of computation as the V100. The V100 is using a 12nm process while the m1 is using 5nm but the V100 consistently used close to 6 times the amount of energy.

Section 6

Caveats

Setting up the Mac Mini to run the new accelerated Tensorflow package was less than trivial. I found the simplest way to get various packages requiring compilation was from the arm64 branch of Miniconda. The tensorflow library is supposed to choose the best path for acceleration by default, however I was seeing consistent segmentation faults unless I explicitly told the library to use the gpu with the following code:

from tensorflow.python.compiler.mlcompute import mlcompute
mlcompute.set_mlc_device(device_name="gpu")

I chose MobileNetV2 to make iteration faster. When I tried ResNet50 or other larger models the gap between the M1 and Nvidia grew wider. I also experienced segmentation faults when my inputs exceeded 196x196 dimensions on the M1.

In general it seems these entry level Macs are only suitable for smaller architectures for now.

I also observed the trials on the M1 which only trained the final layer of the network failed to converge. This was not the case on the NVidia V100. Upon further experimentation I was able to get the M1 runs to converge by reducing the learning rate. It's unclear why the learning rate was more finicky on the M1 hardware.

Apple M1

Section 8

Nvidia V100

Section 8

Conclusion

It's still early days but these preliminary results look very promising. When Apple releases Pro hardware with more Cores and RAM, training machine learning models on Apple hardware could become commonplace. At W&B all of our staff uses Mac hardware for development and I know much of our team can't wait to get their hands on the next generation of hardware. We'll be keeping a close eye on the tensorflow_macos fork and it's eventual incorporation into the main tensorflow repository. I'm also curious to see if the PyTorch team decides to integrate with Apples ML Compute libraries, there's currently an ongoing discussion on Github.

Our most recent release of the wandb library (0.10.13) automatically captures GPU metrics from Apple M1 hardware like you see in this report.

We're excited for other early adopters to give this a try! Please leave comments or ask questions about these results here in this report.