Tensorflow-Metal slower than "CPU"-version of M1 Tensorflow

Hi everybody,

I was excited to check out Tensorflow for M1, as advertised. I've got the latest version to run by following

https://developer.apple.com/metal/tensorflow-plugin/

Then, I did a little performance check by running the code from

https://www.tensorflow.org/tutorials/quickstart/beginner

To my surprise, with tensorflow-metal installed, an epoch takes 7-8 seconds to complete in average. Without tensorflow-metal installed, it just takes 1 second.

Here the output I'm getting:

Metal device set to: Apple M1

systemMemory: 16.00 GB maxCacheSize: 5.33 GB

2021-11-12 19:55:13.379972: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2021-11-12 19:55:13.380077: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) [[0.05593071 0.11313408 0.0657519 0.04874061 0.11589434 0.1239315 0.09336308 0.1969078 0.13757557 0.04877035]] 2021-11-12 19:55:13.632319: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) 2021-11-12 19:55:13.632493: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz Epoch 1/5 2021-11-12 19:55:13.721079: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 1875/1875 [==============================] - 7s 4ms/step - loss: 0.2888 - accuracy: 0.9160
Epoch 2/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.1419 - accuracy: 0.9577 Epoch 3/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.1057 - accuracy: 0.9679 Epoch 4/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.0877 - accuracy: 0.9733 Epoch 5/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.0729 - accuracy: 0.9771 2021-11-12 19:55:47.920935: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

Best from Berlin, Michael

Hi @xor2k,

It is likely that the model used in the script you used for testing and the default batch size used are so small that they are not able to amortise the cost of running on the GPU. Try increasing the batch size or model size and test again, it is expected that on very small sizes the CPU may actually be faster.

Tensorflow-Metal slower than "CPU"-version of M1 Tensorflow
 
 
Q