GPU training deadlock with tensorflow-metal 0.5

Question

Created Jun ’22

Replies 19

Boosts 2

Views 6.3k

Participants 12

I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem.

 import tensorflow as tf
 
#dev = '/cpu:0'
dev = '/gpu:0'
epochs = 1000
batch_size = 32
hidden = 128
 
 
mnist = tf.keras.datasets.mnist
train, _ = mnist.load_data()
x_train, y_train = train[0] / 255.0, train[1]
 
with tf.device(dev):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(10, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Test configurations are:

MacBook Air M1
macOS 12.4
tensorflow-deps 2.9
tensorflow-macos 2.9.2
tensorflow-metal 0.5.0

With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following.

GPU to CPU
remove Dropout layers
downgrade tensorflow-metal to 0.4

Boost

Answer 1

Frameworks Engineer OP

Apple

Jun ’22

@masa6s

Thanks for reporting the issue and the excellent test script to reproduce it. I can confirm that I have reproduced this locally and found an issue relating to the dropout layer that causes the training to stop. After we have verified the fix we will include it in tensorflow-metal==0.5.1.

0

Answer 2

karbapi OP

Jul ’22

Same with me. (Python: 3.9.13 tensorflow-macos: 2.9.2 tensorflow-metal: 0.5.0)

0

Answer 3

LLCP OP

Jul ’22

The same problem.

Python 3.8.9 tensorflow-macos 2.9.2 tensorflow-metal 0.5.0

0

Answer 4

ALecoq OP

Aug ’22

Hello, I don't know if the same reason but I tried to fine tune a BERT model and at at some point, I also have a deadlock after some time (need to kill the kernel and start over). The dead lock will happen depending on the quantity of data I used to fine tuned. In the cas below the training will stop in the middle of the 3rd epoch

my machine:

MacOS 12.5
Mac Book Pro Apple M1 Max

I use :

python 3.10.5
tensorflow-macos 2.9.2
tensorflow-metal 0.5.0
tokenizers 0.12.1.dev0
transformers 4.22.0.dev0

data : https://www.kaggle.com/datasets/kazanova/sentiment140

Quantity of tweets used: 11200

 tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                             num_labels=2)
 
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)
 
model.fit(tf_train_dataset,
          validation_data=tf_validation_dataset,
          epochs=4,
         )

0

Answer 5

Frameworks Engineer OP

Apple

Aug ’22

We have now released tensorflow-metal==0.5.1 which addresses multiple memory leak issues leading to GPU hangups. Please give it a try and see if it helps with the problems you are seeing.

0

Answer 6

karbapi OP

Aug ’22

Hi,

I did not see any improvement with TF-MACOS=2.9.2 and TF-METAL=0.5.1 with python 3.9.13. please see my latest (relevant) response in thread https://developer.apple.com/forums/thread/711753

This is why I am sticking to my old setup of TF-MACOS==2.8.0 and TF-METAL=0.4.0 along with python 3.8.13. And I am using CPUONLY option, which gives relatively less memory leakage. Yet I wish to wait till end when all the epochs (merely 3) end.

Thanks, Bapi

0

Answer 7

bahman_n OP

Aug ’22

I ran into the same issue. The training would stop at some random epochs with no error or warning when using tensorflow-metal 0.5.1.

The only way I could fix this was to reinstall my environment following Apple's instructions but now using this version of Miniforge3-MacOSX-arm64.sh from scratch and, this time, use tensorflow-metal 0.4.0.

0

Answer 8

Frameworks Engineer OP

Apple

Aug ’22

Hi @bahman_n and @karbapi! Thanks for verifying that this issue still persists in 0.5.1. I'll continue looking into the issue to get to the bottom of this.

0

Answer 9

karbapi OP

Aug ’22

Hi,

I wish to share a strange thing I noticed apart from this issue (and the memory leak issue for GPU):

**There is a huge gap between two epochs. Typically an epoch is taking around 4-5mins, but this inter-epoch gap spans around 6-7mins. This is possibly due to M1 Ultra process scheduler is under-optimised. **

Hope this pointer helps in your fix and yield better resolution in the subsequent TF-METAL releases.

Thanks, Bapi

0

Answer 10

karbapi OP

Aug ’22

I am really disappointed with my Mac Studio (M1 Ultra, 64c GPU, 128GB RAM). Now I am thinking why did I spend that huge money on this ****** machine!

Now I am getting error due to multiprocessing and the training is stopped!

 2022-08-26 07:49:49.615373: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
 
2022-08-26 07:49:49.615621: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
 
Process Keras_worker_SpawnPoolWorker-92576:
 
Traceback (most recent call last):
 
  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
 
    self.run()
 
  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/process.py", line 108, in run
 
    self._target(*self._args, **self._kwargs)
 
  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/pool.py", line 109, in worker
 
    initializer(*initargs)
 
  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/site-packages/keras/utils/data_utils.py", line 812, in init_pool_generator
 
    id_queue.put(worker_proc.ident, block=True, timeout=0.1)
 
  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/queues.py", line 84, in put
 
    raise Full
 
queue.Full

0

Answer 11

ALecoq OP

Sep ’22

Regarding the deadlock, it seems I found a way around accidentally. You have to include a line which in fact say you want to use the GPU especially if like me you do it cell by cell . Exemple below:

 with tf.device('/gpu:0'):
    <write your model here>

then here you do other things in your notebook like batch and such... Then you train your model

 with tf.device('/gpu:0'):
    hist_1 = model_1.fit

Somehow, this stopped my deadlock. In addition (and I don't know if it is related but just in case), I stopped to use Safari for my Jupyter Notebook and went on chrome instead (not for this reason but mainly because safari kept reloading my "heavy" notebook...)

Hope this help

cheers

0

Answer 12

karbapi OP

Sep ’22

Hi, Thanks for sharing the info.

However, my issue is little different (please see the thread on memory leakage https://developer.apple.com/forums/thread/711753).

My training is stopped apparently due to memory leakage and one potential (guessed by me) reason of CPU/GPU scheduling issue when the memory usage is too high (say ~125GB out of 128GB RAM in my system, with no SWAP being used for whatsoever reason) in my M1 ULTRA machine with 64c GPU (Mac Studio).

And FYI, my training setup use:

 with tf.GradientTape() as tape:
       .......

And I do not use Jupyter notebook. My work setup is run on command line, while my code (structured over multiple files) is written with a text editor such as GVIM.

--Bapi

0

Answer 13

karbapi OP

Sep ’22

CPU Only run on Mac STUDIO (20c CPU, 64c GPU, 128GB RAM). Training is STALLED, perhaps the CPUs are DEAD for some FABULOUS REASONS. Below is the snapshot (with temperatures of different cores).

Screenshot 2022-09-09 at 11.05.04 AM.png

0

Answer 14

karbapi OP

Sep ’22

HEY, ANY UPDATE? SHOULD MY 64c GPUs BE ALLOWED TO SIT IDLE?

0

Answer 15

strongpc OP

Dec ’22

Am wondering if this is a manifestation of a related problem?

My python code starts with: from transformers import AutoTokenizer, AutoModel

Then crashes during execution of the following code: model = AutoModel.from_pretrained("bert-base-uncased")

Running from within PyCharm SDE, I get this error: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Interestingly, this crashes on my (Intel i5 w 16MB RAM) MacMini, but runs fine on my (Apple M1 w 16 MB RAM) MacBook Air. Both are running MacOS Ventura, v13.0.1 at the moment.

0

	import tensorflow as tf

	#dev = '/cpu:0'
	dev = '/gpu:0'
	epochs = 1000
	batch_size = 32
	hidden = 128


	mnist = tf.keras.datasets.mnist
	train, _ = mnist.load_data()
	x_train, y_train = train[0] / 255.0, train[1]

	with tf.device(dev):
	model = tf.keras.models.Sequential()
	model.add(tf.keras.layers.Flatten())
	model.add(tf.keras.layers.Dense(hidden, activation='relu'))
	model.add(tf.keras.layers.Dropout(0.3))
	model.add(tf.keras.layers.Dense(hidden, activation='relu'))
	model.add(tf.keras.layers.Dropout(0.3))
	model.add(tf.keras.layers.Dense(10, activation='softmax'))
	model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

	model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

	tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

	model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
	num_labels=2)

	model.compile(
	optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
	loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
	metrics=tf.metrics.SparseCategoricalAccuracy(),
	)

	model.fit(tf_train_dataset,
	validation_data=tf_validation_dataset,
	epochs=4,
	)

	2022-08-26 07:49:49.615373: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.

	2022-08-26 07:49:49.615621: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

	Process Keras_worker_SpawnPoolWorker-92576:

	Traceback (most recent call last):

	File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap

	self.run()

	File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/process.py", line 108, in run

	self._target(self._args, *self._kwargs)

	File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/pool.py", line 109, in worker

	initializer(*initargs)

	File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/site-packages/keras/utils/data_utils.py", line 812, in init_pool_generator

	id_queue.put(worker_proc.ident, block=True, timeout=0.1)

	File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/queues.py", line 84, in put

	raise Full

	queue.Full