TensorFlow is slow after upgrading to Sonoma

Hello - I have been struggling to find a solution online and I hope you can help me timely. I have installed the latest tesnorflow and tensorflow-metal, I even went to install the ternsorflow-nightly. My app generates the following as a result of my fit function on a CNN model with 8 layers.

2023-09-29 22:21:06.115768: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro 2023-09-29 22:21:06.115846: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB 2023-09-29 22:21:06.116048: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB 2023-09-29 22:21:06.116264: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-09-29 22:21:06.116483: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

Most importantly, the learning process is very slow and I'd like to take advantage of al the new features of the latest versions. What can I do?

Post not yet marked as solved Up vote post of erezkatz Down vote post of erezkatz
1.7k views

Replies

If you followed "https://developer.apple.com/metal/tensorflow-plugin/" to install TF, they you should be already taking advantage of the GPU power. The fact that your TF can see your GPU (device Metal) also confirmed that. How slow is your training process? For your reference, training the CIFAR-100 example in the above link takes 90s per epoch on my M1 Pro GPU. In my benchmarks, M1 Pro GPU is 5 - 7 times faster than M1 Pro CPU and should be around the level of an RTX 3050 Mobile for deep learning. It is not going to be as fast as a good desktop GPU.

  • What is the M2 Pro equivalent btw bro? Are there any test result that we can compare and conclude that M2 pro's NVIDIA equivalent is"...." Thank u .

Add a Comment

Is it slower as compared to before you upgraded to Sonoma?

My coreML model training/building is 7x slower after upgrading from Ventura to Sonoma. Perhaps they are related?

Same for me. After updating to Sonoma from Ventura my training runs 5 times slower than before. I also noticed that the GPU is no longer used, despite the message “2023-10-03 19:35:08.186175: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.”

I have installed and reinstalled tensorflow-metal, and if I run this code:

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)
model.summary()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

I see from the activity monitor that the GPU is fully utilized.

However, when I run another training which took about 11 seconds per batch on Ventura, it now takes more than 56 seconds per batch, and the GPU is utilized only at around 20% (before it was up to 98%). I have also already set the optimizer to tf.keras.optimizers.legacy.Adam, as suggested by the warning. Still no success in recreating the performances experienced in MacOS Ventura.

Same problem here. I also noticed the drop in GPU usage to 50%. It's hard to keep going like this.

I am not using tensor flow, simply training a model using .csv files and DataFrames using something simple like

let params = MLBoostedTreeRegressor.ModelParameters(validation: .split(strategy: .automatic),maxIterations:5000)
let model = try MLBoostedTreeRegressor(trainingData: trainingdata, targetColumn: columntopredict, parameters: params)

It is almost 7x slower on Sonoma, roughly the same difference in speed that you are noticing.

I have tried looking for flags to set, I have changed all deprecated code, with nothing making any difference.

My M1 laptop used to run hot whilst running my code under Ventura, and now it is at a pleasant ambient temperature and not really trying under Sonoma.

Something has definitely changed in the update to Sonoma, and it has made my application stupidly slow.

The only advice I've had so far is to try the developer beta, but I'm just not willing to go that route yet.

Same for me. I used the code below with the next library versions:

tensorflow-macos 2.14.0 - tensorflow-metal 1.1.0. - python 3.10.12

import tensorflow as tf
import tensorflow_datasets as tfds

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]", "train[90%:]", "test"],
    as_supervised=True
)
tf.random.set_seed(42)
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=3)

--

Mac Mini M1 - Sonoma 14:

The most weird thing it is not only slow but it does not converge at all... val_accuracy after last epoch still ~0.49...

2023-10-06 12:01:37.596357: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 2023-10-06 12:01:37.596384: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB 2023-10-06 12:01:37.596389: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB 2023-10-06 12:01:37.596423: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-10-06 12:01:37.596440: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>) 2023-10-06 12:01:37.930853: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.

Epoch 1/3 704/704 [==============================] - 434s 601ms/step - loss: 0.6935 - accuracy: 0.4989 - val_loss: 0.6931 - val_accuracy: 0.5020

Epoch 2/3 704/704 [==============================] - 290s 411ms/step - loss: 0.6933 - accuracy: 0.5048 - val_loss: 0.6945 - val_accuracy: 0.4988

Epoch 3/3 704/704 [==============================] - 276s 392ms/step - loss: 0.6916 - accuracy: 0.5021 - val_loss: 0.6955 - val_accuracy: 0.4988

I tried to run my script with disabled GPU usage on Mac (tf.config.set_visible_devices([], 'GPU')). It converges at least...

Epoch 1/3 704/704 [==============================] - 345s 485ms/step - loss: 0.5163 - accuracy: 0.7340 - val_loss: 0.4181 - val_accuracy: 0.8180

Epoch 2/3 704/704 [==============================] - 339s 482ms/step - loss: 0.3322 - accuracy: 0.8604 - val_loss: 0.3782 - val_accuracy: 0.8384

Epoch 3/3 704/704 [==============================] - 337s 478ms/step - loss: 0.2840 - accuracy: 0.8839 - val_loss: 0.3229 - val_accuracy: 0.8576

My old notebook with a Nvidia 960 mobile GPU (Windows11 + WSL2):

2023-10-06 12:15:25.031824: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8902 Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /home/mzperx/miniconda3/envs/tf/lib/libcublas.so.11: undefined symbol: cublasGetSmCountTarget 2023-10-06 12:15:26.012204: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fda8c02a3e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-10-06 12:15:26.012311: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce GTX 960M, Compute Capability 5.0 2023-10-06 12:15:26.180842: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable. 2023-10-06 12:15:27.076801: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.

704/704 [==============================] - 143s 176ms/step - loss: 0.4835 - accuracy: 0.7684 - val_loss: 0.4299 - val_accuracy: 0.8260

Epoch 2/3 704/704 [==============================] - 60s 85ms/step - loss: 0.3379 - accuracy: 0.8570 - val_loss: 0.3256 - val_accuracy: 0.8600

Epoch 3/3 704/704 [==============================] - 57s 81ms/step - loss: 0.2904 - accuracy: 0.8813 - val_loss: 0.3132 - val_accuracy: 0.8640

Google Colab with T4:

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.

Epoch 1/3 704/704 [==============================] - 74s 89ms/step - loss: 0.4796 - accuracy: 0.7576 - val_loss: 0.4048 - val_accuracy: 0.8304

Epoch 2/3 704/704 [==============================] - 28s 40ms/step - loss: 0.3402 - accuracy: 0.8589 - val_loss: 0.3149 - val_accuracy: 0.8676

Epoch 3/3 704/704 [==============================] - 27s 38ms/step - loss: 0.2899 - accuracy: 0.8824 - val_loss: 0.3065 - val_accuracy: 0.8684

Same problem here. Running: tensorflow-macos 2.6.0 tensorflow-metal 0.1.1 Making Mac Studio with Somona using 20-40% GPU, as for MBP with Ventura using 80-90% GPU, on same code. And the Mac Studio with Somona 6-7 times slower!