Tensorflow_macos performance between python 3.9 and 3.8

Hi team, It's nice that tensorflow can now running on my MacBook Air M1 (macOS 12).

Have tried it on python 3.9 and 3.8. Comparing the performance, looks 3.8 version release by apple team is much faster. See below:

Code for testing:

import tensorflow as tf
import time

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
start = time.time()
model.fit(x_train, y_train, epochs=5)
end = time.time()
model.evaluate(x_test, y_test)
print(end - start)

below is result of python 3.9, tensorflow-macos released from google team. using tensorflow-metal.

Init Plugin
Init Graph Optimizer
Init Kernel
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

2021-08-26 23:05:31.742656: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-08-26 23:05:31.742796: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-08-26 23:05:31.982641: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-26 23:05:31.984696: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/5
2021-08-26 23:05:32.121772: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1875/1875 [==============================] - 8s 4ms/step - loss: 0.2914 - accuracy: 0.9160
Epoch 2/5
1875/1875 [==============================] - 8s 4ms/step - loss: 0.1454 - accuracy: 0.9569
Epoch 3/5
1875/1875 [==============================] - 8s 4ms/step - loss: 0.1104 - accuracy: 0.9669
Epoch 4/5
1875/1875 [==============================] - 8s 4ms/step - loss: 0.0878 - accuracy: 0.9728
Epoch 5/5
1875/1875 [==============================] - 8s 5ms/step - loss: 0.0756 - accuracy: 0.9764
2021-08-26 23:06:13.969665: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
313/313 [==============================] - 1s 4ms/step - loss: 0.0814 - accuracy: 0.9752
42.12474703788757

below is result of python 3.8. tensorflow-macos released from apple team.

2021-08-26 23:06:40.105911: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-08-26 23:06:40.106407: W tensorflow/core/platform/profile_utils/cpu_utils.cc:126] Failed to get CPU frequency: 0 Hz
Epoch 1/5
1875/1875 [==============================] - 1s 382us/step - loss: 0.5059 - accuracy: 0.8497
Epoch 2/5
1875/1875 [==============================] - 1s 380us/step - loss: 0.1957 - accuracy: 0.9406
Epoch 3/5
1875/1875 [==============================] - 1s 375us/step - loss: 0.1563 - accuracy: 0.9522
Epoch 4/5
1875/1875 [==============================] - 1s 375us/step - loss: 0.1336 - accuracy: 0.9596
Epoch 5/5
1875/1875 [==============================] - 1s 418us/step - loss: 0.1213 - accuracy: 0.9633
313/313 [==============================] - 0s 255us/step - loss: 0.0831 - accuracy: 0.9741
3.882438898086548

As you can see the result is quite different. Almost 14 times... python 3.9 version: 42.12s python 3.8 version: 3.88s

Understand python 3.8 version is used "ML compute", is trained very faster. Is it possible to make it on python 3.9 ? or anyway I can find some reference to do it ?

On the other hand, the reason needs to use python 3.9 is error occur when using "class_weight" on tensorflow classification mode. can refer to this GitHub issue 275

can you pls maybe give me a guide ? Thanks in advance.

Replies

I don't think that the issue is Python 3.8 versus 3.9. With 3.8, I see the same slowness using tensorflow-metal.

Using the CPU instead of the GPU, each epoch takes 4 seconds rather than 8. It is pretty sad that the GPU is slower.

See also my comment at https://developer.apple.com/forums/thread/686098

When comparing Tensorflow CPU and GPU performance, take special note of the impact of batch size when doing the model.fit(). E.g, here you can find a performance graph related to this: https://github.com/moritzhambach/CPU-vs-GPU-benchmark-on-MNIST