tensorflow-metal freezing nondeterministically during model training

Question

I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine.

I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably.

It looks like the problem might be in some Metal Performance Shaders init code.

The state of everything (backtraces, etc.) when the program freezes is attached.

Backtraces

2.0k

Posted by

andmis

Reply

I should say -- these logs are under 2.6.0, but the same bug was happening under 2.8.0. I downgraded to see if the bug would go away. I am upgrading to 2.8.0 again now, and if/when the bug appears again I will post more backtraces.

—
andmis

Add a Comment

Answer 1

Follow-up: just got the same freeze on 2.8.0. Backtraces attached, they look the same.

2.8.0 backtraces

Posted by

andmis

Add a Comment

Answer 2

Hi @andmis!

Thanks for reporting this and providing the backtraces. Based on those it looks like a thread left waiting causing a hangup in the program. We are already investigating one similar issue (https://developer.apple.com/forums/thread/702174) and I'm hoping that the root cause for this issue and the one you are seeing will be the same. We are working on providing a fix for the issue soon, I'll update here once it's out so you can try it out on your code. If the issue you are seeing persists after that we can look into ways on how to debug the specific case you have.

Posted by

Frameworks Engineer

Add a Comment

Answer 3

I'm seeing something similar when training the SwinTransformerV2Tiny_ns model from https://github.com/leondgarse/keras_cv_attention_models. After 4075-ish training steps it pretty reliably seems to just give up on using the GPU. The gpu memory / usage drops off and cpu usage also stays low. You can see the steps/sec absolutely tank in the training logs:

FastEstimator-Train: step: 3975; ce: 1.2872236; model_lr: 0.00022985446; steps/sec: 4.19;
FastEstimator-Train: step: 4000; ce: 1.3085787; model_lr: 0.00022958055; steps/sec: 4.2;
FastEstimator-Train: step: 4025; ce: 1.3924551; model_lr: 0.00022930496; steps/sec: 4.19;
FastEstimator-Train: step: 4050; ce: 1.4702798; model_lr: 0.0002290277; steps/sec: 4.16;
FastEstimator-Train: step: 4075; ce: 1.2734954; model_lr: 0.00022874876; steps/sec: 0.05;

GPU Memory Utilization over time (about 30% during training, then just cuts out. The first dip is an evaluation step during training, then training resumes and cuts out)

Screen Shot 2022-06-30 at 11.07.51 AM.png

GPU Utilization over time (about 100% during training, then just stalls out. The first dip is an evaluation step during training, then training resumes and cuts out)

Screen Shot 2022-06-30 at 11.07.39 AM.png

After the GPU gives up the terminal no longer responds to attempts to kill the training with ctrl-c.

Posted by

TortoiseHam

Add a Comment

Answer 4

Hi @TortoiseHam

Could you test if using tensorflow-macos==2.9.2 and tensorflow-metal==0.5.1 solve your issue? There are multiple bug fixes addressing gpu hangups that hopefully solve this issue for you.

Posted by

Frameworks Engineer

Have the same problem with tensorflow-macos==2.9.2 and tensorflow-metal==0.5.1 on M1 Pro. Logs are empty.

—
bvsn

Add a Comment

tensorflow-metal freezing nondeterministically during model training

Replies