Crash when running custom train step and layers

My environment: Tensorflow: 2.14, tf-metal: 1.1, M3 Max

I am working on an GAN full of residual sum and concatenation. It is trained correctly if using CPU only. However, if I enable GPU, it would cause:

oc("mps_slice_1"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/d615290d-668b-11ee-9734-0697ca55970a/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":359:0)): error: 'mps.slice' op failed: length value 32 does not fit within the dimension size (33) with start value (32) /AppleInternal/Library/BuildRoots/d615290d-668b-11ee-9734-0697ca55970a/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:2133: failed assertion `Error: MLIR pass manager failed'

Some customization I guess might be related to the error:

  • tf.bitwise.bitwise_xor, tf.concat, tf.pad in custom layers
  • numpy.random in train steps.

Another debug hint I found is that the "32" is the number of channel of my models' conv layer, and change as I change the number of channel.

Is there anyone know what is wrong? Thank you so much

There is one more possible clue: even though it would crash immediately with above error most of time, there are some rare cases it can train for around 2-3 epoch, and then crash with the same error.

Crash when running custom train step and layers
 
 
Q