Loss unable to converge with tensorflow-metal > 0.8.0

Hi,

I've found that training loss is often unable to converge when training a model on a M2 Max or M1 on GPU.

After I found many other requests in this forum and no answer from Apple or anyone to resolve this issue, I tried to find the package combination that failed and the one that made it work.

This issue seems to happen on some TF (tensorflow), TFM (tensorflow-metal), and Batch size combinations.

The only most recent combination that seems to work in any situation is:

  • Tensorflow 2.12
  • Tensorflow-metal 0.8.0

They must be installed by pip and not conda forge like this :

pip install tensorflow-macos==2.12
pip install tensorflow-metal==0.8.0

Every other recent combination failed to get their training loss converge.

Here is the code to reproduce the issue. Sometimes the divergence appears clearly with only 10 epochs and sometimes Epoch must be increased up to 30 to get it more clearly.

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']


epochs = 20
batch_size = 128

with tf.device('gpu:0'):

    model = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(32,3,activation = 'relu',padding='same',input_shape=train_images.shape[1:]),
            tf.keras.layers.MaxPooling2D(2),
            tf.keras.layers.Conv2D(64,3,activation = 'relu',padding='same'),
            tf.keras.layers.MaxPooling2D(2),
            tf.keras.layers.Conv2D(128,3,activation = 'relu',padding='same'),
            tf.keras.layers.MaxPooling2D(2),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(64,activation='relu'),
            tf.keras.layers.Dense(10)
        ])
    
    model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    
    history = model.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size,
                        validation_data=(test_images, test_labels))


pd.DataFrame(history.history).plot(subplots=(['loss','val_loss'],['accuracy','val_accuracy']),layout=(1,2),figsize=(15,5));

Here are the results for some combinations.

TFTF MetalBatch SizeEpochsTraining Loss Convergence
2.14.01.1.012810NO
2.14.01.1.051210YES
2.13.01.0.012820NO
2.13.01.0.051220YES
2.12.01.0.012820NO
2.12.01.0.051230NO
2.12.00.8.012820YES
2.12.00.8.051230YES

As an example here is the loss and accuracy curve for TF2.14 and TFM 1.1.0 with batch size = 128. The training loss (blue line) goes up.

For TF2.12, TFM 1.0.0, batch size =128, The training loss (blue line) goes up.

And the one that work as expected : TF2.12, TFM 0.8.0, batch size=128

So, Apple, can you please fix it in the next release ?

I also suggest that before to publish a release, you implement a simple automated testing procedure that train some models like this with various batch size and epoch an analyze the loss in the history to detect major training loss divergence.

Thank you
Best regards