I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem.
import tensorflow as tf #dev = '/cpu:0' dev = '/gpu:0' epochs = 1000 batch_size = 32 hidden = 128 mnist = tf.keras.datasets.mnist train, _ = mnist.load_data() x_train, y_train = train[0] / 255.0, train[1] with tf.device(dev): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(hidden, activation='relu')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(hidden, activation='relu')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(10, activation='softmax')) model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
Test configurations are:
- MacBook Air M1
- macOS 12.4
- tensorflow-deps 2.9
- tensorflow-macos 2.9.2
- tensorflow-metal 0.5.0
With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following.
- GPU to CPU
- remove Dropout layers
- downgrade tensorflow-metal to 0.4