GPU training deadlock with tensorflow-metal 0.5

I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem.

import tensorflow as tf

#dev = '/cpu:0'
dev = '/gpu:0'
epochs = 1000
batch_size = 32
hidden = 128


mnist = tf.keras.datasets.mnist
train, _ = mnist.load_data()
x_train, y_train = train[0] / 255.0, train[1]

with tf.device(dev):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(10, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Test configurations are:

  • MacBook Air M1
  • macOS 12.4
  • tensorflow-deps 2.9
  • tensorflow-macos 2.9.2
  • tensorflow-metal 0.5.0

With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following.

  1. GPU to CPU
  2. remove Dropout layers
  3. downgrade tensorflow-metal to 0.4

Replies

@masa6s

Thanks for reporting the issue and the excellent test script to reproduce it. I can confirm that I have reproduced this locally and found an issue relating to the dropout layer that causes the training to stop. After we have verified the fix we will include it in tensorflow-metal==0.5.1.