🤔 GitHub tensorflow macOS alpha had better performance on M1?

Hello,

I noticed a substantial decrease in performance compared to previous releases of tensorflow for M1 Macs.

I previously installed the alpha release of tensorflow for M1 from GitHub, found here: https://github.com/apple/tensorflow_macos and was very impressed by the performance.

I used the following script to benchmark my M1 Mac and other systems: https://gist.github.com/tampapath/662aca8cd0ef6790ade1bf3c23fe611a#file-fashin_mnist-py Running the alpha release from GitHub, my M1 Mac handsomely outperformed both google colab's random GPU offerings and an RTX 2070 windows computer.


Recently, I went back to the GitHub repository, looking for new updates on tensorflow support for the M1 and was redirected here to the tensorflow-metal PluggableDevices installation guide: https://developer.apple.com/metal/tensorflow-plugin/

After installing the conda environment and running the same benchmark script, I realized my M1 systems's was running much slower.

Additionally, the following error messages printed to the console while running the benchmark:

2021-08-12 21:48:16.306946: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.

2021-08-12 21:48:16.307209: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

2021-08-12 21:48:16.437942: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)

2021-08-12 21:48:16.441196: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Has anyone else noticed this loss in performance?


The results I got are as follow:

benchmark script duration
tf GitHub alpha🟢9.62s
new tf-metal🔴76.52s
google colab🔴57.53s
RTX 2070 PC🔴23.18s
  • both tf GitHub alpha and new tf-metal were ran on the same 13" M1 MacBook Pro.


I wrote an installation guide for the GitHub alpha release if anyone wants to compare results, or run a faster version of tensorflow compatible with their M1 Mac: https://github.com/apple/tensorflow_macos/issues/215

  • I had a bit of a look into how this was performing on my system (13" M1 MacBook Air).

    Using tensorFlow-metal pluggable device I had total training and testing time of 62.52s. However, when training this on the CPU only I had training and testing time of 9.41s.

    I never managed to successfully install the original apple tf alpha so I can't directly test that but I am guessing that it allowed this to train and test on the CPU.

    I have done a bunch of other testing (as have others) that show that for small models and small image dimensions the CPU is faster than the GPU. Once the model, batch size and image size become a bit larger the GPU becomes faster. For example, using EfficientNetB0 against CIFAR100, image size 32x32 is consistently faster on CPU, image size 64x64 is pretty even and image size 128x128 is generally faster on GPU.

    Compared to Google Colab, a similar patter emerges. For small models, batch size and image size the M1 compares well but as the model and the data become larger the Colab GPU powers ahead.

    This has captured my interest because of the rumours of the M1X with double the CPU high performance cores and quadruple the GPU cores. If that turns out to be true then the Apple machines could become genuinely capable AI development systems (at a very competitive price). Fingers crossed :).

Add a Comment

Replies

I also see this issue.

I benchmarked my Mac Pro’s Radeon Pro 580X against this simple CNN model: https://github.com/macports/macports-ports/pull/12678

  • Apple GitHub tensorflow_macos alpha3: 5 s/epoch
  • PyPi tensorflow-macos v2.6 + tensorflow-metal v0.2: 15 s/epoch (3X slower)
  • Mac Pro 12-core Xeon W CPU: 10 s/epoch
  • Tesla V100: 1 s/epoch

I conclude that there are still some significant alpha-release issues with tensorflow-macos/tensorflow-metal.

  • I think you will find that the alpha3 v Tensorflow-metal difference is CPU versus GPU differences. For small models CPU is far faster and the Tensorflow_macos alpha3 seemed to use CPU for these. If you run the same model with the latests Tensorflow-macos it is still faster without GPU. However, once the models become large (both image size and batch size impact here) the GPU can become much faster. The new M1 Max chips does really well when you are looking at anything above small images and tiny batch sizes.

  • No, I explicitly observe the GPU/CPU loads with Performance Monitor, and explicitly set tf.device.

    In contrast, the Tesla V100 outperforms the CPU on the same code by 10 X on a decent Linux GPU cluster.

    This is definitely an issue with tensorflow-metal, at least on macOS 11.6.

  • Try increasing the batch_size parameter in the above code: 128 typically is too low for decent performance with a GPU (though platform dependent). On a previous gen MacBook Pro with AMD Radeon Pro 5500M, with batch_size 4096, I get a 2 s/epoch with the GPU, compared to 8 s/epoch using the CPU.

how to fix these warnings? I am on macOS 12 public release but these warning still persists @essandess @brendank_ntb

2021-08-12 21:48:16.306946: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.

2021-08-12 21:48:16.307209: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

2021-08-12 21:48:16.437942: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)

Hi @parthsharma, I get those warnings too (at the start of training) but I also get a message:

I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

Training then continues using the GPU. I expect that the warning messages could be suppressed but I have not bothered to do so.

Add a Comment

@brendank_ntb yes I get the same message too. In this manner

`2021-10-27 20:22:30.943266: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-10-27 20:22:30.946426: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-10-27 20:22:38.821706: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.`

Thanks for your response . I thought this was not supposed to happen but this is happening with everyone .

Wrapping your code with tf.device('/cpu:0'): I get 9.63 seconds (M1) and with GPU 60s.

  • I realized that as well when I was coding the clothing tutorial on the TensorFlow website and comparing the speed with my linux machine. On the linux machine it was running on the cpu and its 7x faster than the m1 gpu. Thank you for the tip how to use cpu instead of the gpu on the m1, that way its the same fast as the linux machine. I was actually expecting more from the M1 pro, but maybe some models just run better on cpu.

Add a Comment

ME TOO!!!!!!