I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem.
import tensorflow as tf
#dev = '/cpu:0'
dev = '/gpu:0'
epochs = 1000
batch_size = 32
hidden = 128
mnist = tf.keras.datasets.mnist
train, _ = mnist.load_data()
x_train, y_train = train[0] / 255.0, train[1]
with tf.device(dev):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(hidden, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(hidden, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
Test configurations are:
- MacBook Air M1
- macOS 12.4
- tensorflow-deps 2.9
- tensorflow-macos 2.9.2
- tensorflow-metal 0.5.0
With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following.
- GPU to CPU
- remove Dropout layers
- downgrade tensorflow-metal to 0.4