Tensorflow-metal runs extremely slow

I am comparing my M1 MBA with my 2019 16" Intel MBP. The M1 MBA has tensorflow-metal, while the Intel MBP has TF directly from Google.

Generally, the same programs runs 2-5 times FASTER on the Intel MBP, which presumably has no GPU acceleration.

Is there anything I could have done wrong on the M1?

Here is the start of the metal run: Metal device set to: Apple M1

systemMemory: 16.00 GB maxCacheSize: 5.33 GB

2022-01-19 04:43:50.975025: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-01-19 04:43:50.975291: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) 2022-01-19 04:43:51.216306: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz Epoch 1/10 2022-01-19 04:43:51.298428: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

Replies

Hi @ahostmadsen

Thanks for reporting the issue. Do you have a sample script we could use to study the issue in case there is some performance issue? Another possibility is that the model size or batch sizes used when running the scripts are too small to take full advantage of the GPU and amortize the time cost in dispatching the data to the GPU.

This is a simple program I just downloaded to test. Each epoch takes about 6s on the M1 MBA, but 1s on the Intel MBP. But all my programs run slow. Yes, the examples I have been running are fairly small.

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10) ])

predictions = model(x_train[:1]).numpy() tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer = 'sgd', loss = loss_fn) model.fit(x_train, y_train, epochs=10)

  • Confirmed. If add "with tf.device('/cpu:0'):" to the program I listed, it runs much faster (like 10 times faster than on GPU), and faster than on my Intel MBP.

Add a Comment

Now I tried a tutorial example from Google:

https://www.tensorflow.org/tutorials/quickstart/advanced

That one runs about twice as fast on my M1 MBA as on my Intel MBP. Perhaps the example I put in the previous post is not well-suited for GPU? One would then hope that the metal framework could make a choice to run it on CPU (my experience is that the M1 as about twice as fast as Intel in running scientific computations on CPU).

Anyway, I think I will upgrade my 16" Intel MBP to a 16" M1 MBP, hoping that the TF metal framework continues to be developed.

  • Ok this makes it sound like the issue is indeed with the example being too small/simple to see advantages when running on GPU. The metal plugin is bound to respect the device placement Tensorflow expects so unfortunately we cannot automatically make the decision based on where the task would run faster. However most realistic use cases should be able to gain from the GPU acceleration so the default placement is tends to be correct. But as you noted you can always test if this is the case using with tf.device('/cpu:0'):

Add a Comment

Hi. I am having the same problem. Even with an encoder decoder architecture m1 runs 5 times slower than intel. And it couldn't find a GPU support

print("Num GPUs available:", len(tf.config.experimental.list_physical_devices('GPU')))

and it outputs " Num GPUs available: 0 "

I bought this Mac because of its speed and now it is even slower. How can ı fix this ?

  • I had the same problem. My solution was, as mentioned in the documentation, to use the tensorflow-macos package instead of the normal tensorflow package.

    These both, tensorflow-macos and tensorflow-metal in combination shows GPUs available: 1.

    But my network is still slower on the GPU than on the CPU. So i have two virtual envs, one with the setup for CPU one for the GPU. So i can try boths easily and use whatever is faster.

  • Did you installed macos and metal into the same environment? or two separate environments ?

Add a Comment

I have same problem as like mentioned above

Hello, Based on my observations, tensor flow-metal slows the processing instead of speeding it up on a M1 Pro MacBook. Easy to explain as the GPU's are not optimised for Neural operations and the "normal" processor has two optimised cores for ML.

Log from simple program in python: WITH tensor flow-metal plugin Epoch 16/20 1875/1875 [==============================] - 8s 4ms/step - loss: 0.0844 - accuracy: 0.9747 Epoch 17/20 1875/1875 [==============================] - 8s 4ms/step - loss: 0.0818 - accuracy: 0.9756 Epoch 18/20 1875/1875 [==============================] - 8s 4ms/step - loss: 0.0794 - accuracy: 0.9759

(tinyML-env) remco@Remcos-MBP tinyML % pip uninstall tensorflow-metal Found existing installation: tensorflow-metal 0.4.0 So without: Epoch 16/20 1875/1875 [==============================] - 1s 731us/step - loss: 0.0880 - accuracy: 0.9727 Epoch 17/20 1875/1875 [==============================] - 1s 736us/step - loss: 0.0845 - accuracy: 0.9742 Epoch 18/20 1875/1875 [==============================] - 1s 733us/step - loss: 0.0821 - accuracy: 0.9747 Epoch 19/20 1875/1875 [==============================] - 1s 727us/step - loss: 0.0807 - accuracy: 0.9750

Maybe you can try this to.

I have the same problem in my LSTM model on APPLE M2. I follow the progress of https://developer.apple.com/metal/tensorflow-plugin/. to set up my environment, and the model run extremely slow. How to fix it?

Also, i got this respond while i running the model: Failed to get CPU frequency: 0 Hz

It seems like there is no consensus as to how to resolve this. I have upgraded my OS to Sonoma on Mac - latest OS to date and it seems like my tensorflow needed to be updated along with all dependent libraries and at that point it runs EXTREMELY slow. I have been searching all over to find a solution but there isnt' one that I was able to find. Any help / direction from you would be greatly appreciated

Want to address the same issue. Two Apple Silicon computers. Mac Studio and MBP. The only difference is the new OS. They were performing almost identical to each other before upgrade to Sonoma. Now the latter OS gives 5-6 times slower performance on the same Python code. And it looks like it is lack of GPU use making most of the difference.

Apple... seriously... why?!?