Crash when running custom train step and layers

Question

tf_noob OP

Created Dec ’23

Replies 1

Boosts 1

Participants 1

My environment: Tensorflow: 2.14, tf-metal: 1.1, M3 Max

I am working on an GAN full of residual sum and concatenation. It is trained correctly if using CPU only. However, if I enable GPU, it would cause:

oc("mps_slice_1"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/d615290d-668b-11ee-9734-0697ca55970a/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":359:0)): error: 'mps.slice' op failed: length value 32 does not fit within the dimension size (33) with start value (32) /AppleInternal/Library/BuildRoots/d615290d-668b-11ee-9734-0697ca55970a/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:2133: failed assertion `Error: MLIR pass manager failed'

Some customization I guess might be related to the error:

tf.bitwise.bitwise_xor, tf.concat, tf.pad in custom layers
numpy.random in train steps.

Another debug hint I found is that the "32" is the number of channel of my models' conv layer, and change as I change the number of channel.

Is there anyone know what is wrong? Thank you so much

Boost

Answer 1

tf_noob OP

Dec ’23

There is one more possible clue: even though it would crash immediately with above error most of time, there are some rare cases it can train for around 2-3 epoch, and then crash with the same error.

0