Error: command buffer exited with error status.

Experimenting with the Tensorflow text_classification example from (https://www.tensorflow.org/tutorials/keras/text_classification) I am constantly getting the following error when increasing the batch size to 512:


Epoch 2/10
 5/40 [==>...........................] - ETA: 5s - loss: 0.6887 - binary_accuracy: 0.7086

Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x2e1897c10>
    label = <none> 
    device = <AGXG13XDevice: 0x119460c00>
        name = Apple M1 Max 
    commandQueue = <AGXG13XFamilyCommandQueue: 0x11946e400>
        label = <none> 
        device = <AGXG13XDevice: 0x119460c00>
            name = Apple M1 Max 
    retainedReferences = 1

With other experiments (which are working on other GPUs/Systems) I am getting the same error. How is it to be interpreted? Are there workarounds?

Setup:

  • Tensorflow 2.6.0 (installed as described here)
  • Apple M1 Max, 64 GB
  • Monterey 12.0.1
Post not yet marked as solved Up vote post of joergw Down vote post of joergw
2.6k views
  • I'm having the same problem. I don't know how to interpret it, did you find any solutions?

  • I am also facing the same issue, after few epochs it starts throwing this error. Any solution yet ?

  • I am having the same error, trying to run ResUNet++. I installed tf through miniforge, on m1 pro with monterey. Any insight to understand/amend the error would be very helpful, thanks!

Add a Comment

Replies

I have the same error with squueze net

I get the same error when training a DeepLabv3+ Network for semantic segmentation. I am not even sure if the error has a bad impact on the training result, until now everything looks fine and keeps improving. I use an Apple M1 Mac with python 3.9 with tensorflow 2.5 (installed through miniforge)

I have exactly the same error, and it happens on an M1 Ultra chip; however EXACTLY THE SAME CODE runs perfectly fine on an i9 MacBook Pro... So far the "upgrade" to M1 Ultra is not going well as software is not updated/ready for this new processor, and these errors where neither google nor apple take accountability to resolve are not helping. I attempted to update Python to 3.10 and the latest tensor flow & tensor flow metal plugin in a 2nd environment but exact same error, which I attribute again to tensor flow not coded correctly but who knows... would really appreciate Apple helping customers here - at least know where the error comes from exactly (I'm sure folks at google would love to know what is it they are not coding to properly use M1 gpu...

More on the error: it seems that there may be a correlation between # of layers, batch size & # of epochs to generate this error... if I drastically simplify the model, error goes away - so is not that the tensor flow software is not working well... as it is that the more complex it gets, the less capable it is to use the M1's gpu - which is really the reason I upgraded!!

I'm experiencing the error with Apple M1, TensorFlow 2.10.0 and Python 3.10.4.

In my application, this error was caused by the threads having to do long tasks (greater than about two seconds). However I have not been able to find precise information about this thread timeout!

I am TF macos 2.9, and TF metal 0.5, M2 Max 96gb

I ran into this issue using HF distilled Bert model to train on my dataset. My batch size is just 128 (less than 512 you reported, but impact sort of depends on the model). I suspect this may be a memory issue (or mismanagement/misalignment due to framework bugs). I will try to reduce the batch size and see if this improves.

But even so, this may be quite a disappointment since i got 96gb to really push the batch size up in my local env.