metal 0.5.0: converge ; metal 1.0.1: failure to converge

macbook pro m2 max/ 64G / macos:13.2.1 (22D68)

import tensorflow as tf
def runMnist(device = '/device:CPU:0'):
    with tf.device(device):
        #tf.config.set_default_device(device)
        mnist = tf.keras.datasets.mnist
        (x_train, y_train), (x_test, y_test) = mnist.load_data()
        x_train, x_test = x_train / 255.0, x_test / 255.0
        model = tf.keras.models.Sequential([
          tf.keras.layers.Flatten(input_shape=(28, 28)),
          tf.keras.layers.Dense(128, activation='relu'),
          tf.keras.layers.Dropout(0.2),
          tf.keras.layers.Dense(10)
        ])
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
        model.compile(optimizer='adam',
                      loss=loss_fn,
                      metrics=['accuracy'])
        model.fit(x_train, y_train, epochs=10)
runMnist(device = '/device:CPU:0')
runMnist(device = '/device:GPU:0')

Python 3.9.17 macOS-13.2.1-arm64-arm-64bit

Hi @bruce__lee,

Thank you for reporting the issue and proving a sample code! We were able to reproduce the failure and it's currently under investigation.

@bruce__lee , the issue is due to an issue on Keras side handling the Adam optimizer:

tensorboard-data-server      0.7.2
tensorflow-estimator         2.15.0
tensorflow-io-gcs-filesystem 0.36.0
tensorflow-macos             2.15.0
tensorflow-metal             1.1.0
keras-nightly                3.1.0.dev2024022103

After upgrading to latest Keras with TF base and Metal plugin. Things are working as expected:

Running on CPU
2024-02-21 01:03:57.282476: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Ultra
2024-02-21 01:03:57.282496: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 192.00 GB
2024-02-21 01:03:57.282501: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 72.00 GB
2024-02-21 01:03:57.282536: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-02-21 01:03:57.282551: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
/Users/kulinseth/miniconda3/envs/tf39/lib/python3.9/site-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
Epoch 1/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 2s 767us/step - accuracy: 0.8592 - loss: 0.4806
Epoch 2/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 735us/step - accuracy: 0.9559 - loss: 0.1509
Epoch 3/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 749us/step - accuracy: 0.9656 - loss: 0.1135
Epoch 4/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 731us/step - accuracy: 0.9726 - loss: 0.0882
Epoch 5/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 743us/step - accuracy: 0.9771 - loss: 0.0734
Epoch 6/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 741us/step - accuracy: 0.9800 - loss: 0.0621
Epoch 7/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 751us/step - accuracy: 0.9819 - loss: 0.0570
Epoch 8/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 748us/step - accuracy: 0.9842 - loss: 0.0514
Epoch 9/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 739us/step - accuracy: 0.9842 - loss: 0.0478
Epoch 10/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 1s 747us/step - accuracy: 0.9862 - loss: 0.0411
Duration on CPU: 14.484932 sec
Running on GPU
/Users/kulinseth/miniconda3/envs/tf39/lib/python3.9/site-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
Epoch 1/10
2024-02-21 01:04:12.289801: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 4ms/step - accuracy: 0.8577 - loss: 0.4767
Epoch 2/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 8s 4ms/step - accuracy: 0.9562 - loss: 0.1466
Epoch 3/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 8s 4ms/step - accuracy: 0.9675 - loss: 0.1067
Epoch 4/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 8s 4ms/step - accuracy: 0.9738 - loss: 0.0830
Epoch 5/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.9769 - loss: 0.0748
Epoch 6/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.9801 - loss: 0.0636
Epoch 7/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 8s 4ms/step - accuracy: 0.9807 - loss: 0.0582
Epoch 8/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.9837 - loss: 0.0506
Epoch 9/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.9859 - loss: 0.0442
Epoch 10/10
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.9851 - loss: 0.0430     <===

@bruce__lee , please try installing Keras >=3.0 version , which has the fix..

pip install keras==3.0.0
metal 0.5.0: converge ; metal 1.0.1: failure to converge
 
 
Q